As regular readers of my blog would already know, I have just finished writing my book “Learn to Write DAX” and it will be released in November 2015. One of the topics in the book talks about the optimal shape of your data tables for DAX. It is such an important topic that I have decide to share some modified extracts from the book here.
When I say “shape” of your data, I am taking about things like how many tables you import into Power Pivot, how many columns are in each table, which columns are in each of the tables etc. One reason data shape is important is because the shape of data in transactional systems (or relational databases) is seldom the ideal shape for Power Pivot. When the IT department executes an Enterprise BI project, one of the important first steps is to shape the data so it is optimal for reporting. This step is normally completely transparent to the end user (ie you) and hence the end user is shielded from the need to do this. But you need to understand data shaping if you want to have efficient and effective Power Pivot data models. Just copying what you have got in your source data is unlikely to be optimal.
The Star Schema is the optimal shape
The generally accepted approach to bringing data into Power Pivot is to bring in your data in as a “Start Schema” format. This is a technical term coming from the Kimball methodology (also known as dimensional modelling) which describes the logical way data should be structured for optimal reporting performance. You can see why it is called a Star Schema by looking at the following image – note the star shape.
However I use and teach Excel users to follow the graphical layout approach developed and recommended by Rob Collie – let’s call it the Collie Methodology! That is to place the lookup tables at the top of the screen and the data tables at the bottom (see the image below).
If you compare the 2 images above, you will see they have exactly the same logical relationship (links) between the tables, it is just that they have a different visual layout.
Lookup tables and Data tables.
In the professional BI world, the Lookup tables would be referred to as Dimension tables and the Data tables would be referred to as Fact tables. For Excel users I suggest you think of them as Lookup tables and Data tables as it is a closer map to our old world of VLOOKUP.
You should have 1 Lookup table for each “object” that you need for reporting purposes. In the sample data above from Adventure Works, these objects are “customers”, “products”, “territories” and “time” (ie Calendar). A key feature of a Lookup table is it contains 1 and only 1 row for every individual item in the table and as many columns as you need to describe the object. Eg, there is only 1 row for each unique Customer in the Customer table. The Customer table has lots of columns describing each customer such as customer number, customer name, customer address etc, but there is only 1 row for each customer – each one is unique based on the customer number – no duplicates allowed.
The Data table in the Adventure Works data is the Sales table (it is possible to have many data tables in your data model but there is only one in this example). This Data table contains lots of rows (60,000+ in this case) and contains all of the transactional records of sales that occurred over several years. Importantly the data table contains one column “key” that matches to each of the “keys” in each Lookup table needed for reporting. So in this sample data, there is a date, customer number, product number, and territory key so that the data table can be logically joined to the Lookup tables.
The ideal shape of Data tables is to have very few columns but to have as many rows as needed to bring in all the data records. These Data tables normally have lots of rows (sometimes in the 10s or even 100s of millions).
Why you should layout your Star Schema using the Collie Methodology
There are a few reasons you should visually layout your data using the Collie Methodology.
- Firstly it gives you a visual clue to which ones are the “lookup” tables. They are the ones above that you have to “lookup” to see them.
- Also note that in the old world of traditional Excel, you would have probably written VLOOKUP formulae to bring the data from these Lookup tables into your single data table prior to creating a pivot table.
- Finally, filters only propagate in 1 direction in Power Pivot (in Excel version 2010/2013) and that is from the 1 side of the relationship to the many side of the relationship. The Lookup tables are on the 1 side, so that means the filters always flow “down hill”. They do not/can not flow “up hill” automatically (but note this can be done with advanced DAX). So this layout gives you a visual clue which direction the filters will flow.
A final word on Shaping
While the optimum shape is a Star Schema as show above, that doesn’t mean this is the only way it will work. It will also work with a Snowflake Schema too (where there are secondary lookup tables joined to the primary lookup tables), but the point is that these extra relationships required to make the “snowflake” can come at a cost of worse performance and possibly also confusion to the user (if they access the field list for Pivot Tables).
There is no 1 correct way to shape your data, but what you need to remember is
- There are 2 types of tables
- Data tables that contain your actual data that you want to analyse.
- Lookup tables that contain metadata about the objects you are going to analyse, like the name, address, suburb of each customer.
- The optimal way to shape your data is a Star Schema as shown above, but this is not the only way it will work.
August 14, 2015 7:59 am