Level: Beginners
One of the topics in my book “Supercharge Power BI: Power BI is Better When you Learn to Write DAX” talks about the optimal shape of your data tables. It is such an important topic that I have decided to share some modified extracts from the book here.
When I say “shape” of your data, I am talking about things like
- how many tables you import into Power BI (and Power Pivot for Excel),
- how many columns are in each table,
- which columns are in each of the tables, and
- should you unpivot your data so column headings become values in rows in your table.
One reason data shape is important is because the shape of data in transactional systems (or relational databases) is seldom the ideal shape for Power BI. When the IT department executes an Enterprise BI project, one of the important first steps is to shape the data so it is optimal for reporting. This step is normally completely transparent to the end user (i.e. you) and hence the end user is shielded from the need to do this. But you need to understand data shaping if you want to have efficient and effective Power BI data models. Just copying what you have got in your source data is unlikely to be optimal.
The Star Schema is the Optimal Shape
The generally accepted approach to bringing data into Power BI is to bring in your data in a “Star Schema” format. This is a technical term coming from the Kimball methodology (also known as dimensional modelling) which describes the logical way data should be structured for optimal reporting performance. You can see why it is called a Star Schema by looking at the following image – note the star shape.
However I use and teach Power BI users to follow the graphical layout approach developed and recommended by Rob Collie – let’s call it the Collie Methodology! That is to place the lookup tables at the top of the screen and the data tables at the bottom (see the image below).
If you compare the 2 images above, you will see they have exactly the same logical relationships (links) between the tables, it is just that they have a different visual layout.
Lookup Tables and Data Tables
In the professional BI world, the Lookup tables would be referred to as Dimension tables and the Data tables would be referred to as Fact tables. For Power BI users I suggest you think of them as Lookup tables and Data tables as it is a closer map to our old world of Excel VLOOKUP.
Lookup Tables
You should have 1 Lookup table for each “object” that you need for reporting purposes. In the sample data above from Adventure Works, these objects are “Customers”, “Products”, “Territories” and “Time” (i.e. “Calendar”). A key feature of a Lookup table is it contains 1 and only 1 row for every individual item in the table and as many columns as you need to describe the object. For e.g., there is only 1 row for each unique Customer in the Customers table. The Customers table has lots of columns describing each customer such as customer number, customer name, customer address etc., but there is only 1 row for each customer – each one is unique based on the customer number – no duplicates allowed.
Data Tables
The Data table in the Adventure Works data is the Sales table (it is possible to have many data tables in your data model but there is only one in this example). This Data table contains lots of rows (60,000+ in this case) and contains all of the transactional records of sales that occurred over several years. Importantly the Data table contains one column “key” that matches to each of the “keys” in each Lookup table needed for reporting. So in this sample data, there is a date, customer number, product number, and territory key so that the Data table can be logically joined to the Lookup tables.
The ideal shape of Data tables is to have very few columns but to have as many rows as needed to bring in all the data records. These Data tables normally have lots of rows (sometimes in the 10s or even 100s of millions).
Why you should layout your Star Schema using the Collie Methodology?
There are a few reasons you should visually layout your data using the Collie Methodology.
- Firstly it gives you a visual clue to which ones are the “Lookup” tables. They are the ones above that you have to “lookup” to see them.
- Also note that in the old world of traditional Excel, you would have probably written VLOOKUP formulae to bring the data from these Lookup tables into your single Data table prior to creating a pivot table.
- Finally, filters only propagate in 1 direction in the data model and that is from the 1 side of the relationship to the many side of the relationship. The Lookup tables are on the 1 side, so that means the filters always flow “down hill”. They do not/cannot flow “up hill” automatically (but note this can be done with advanced DAX and also by editing a relationship in Power BI to make it bi-directional). So this layout gives you a visual clue which direction the filters will flow.
Naming Conventions
It is a good practice to use simple English names for your tables and columns. Avoid using abbreviations and underscores wherever possible.
Example: Instead of a table called Pricing_History_Sum_Tbl, simply call it Pricing. It is much easier to read and write your DAX if you keep it simple.
Other best practices include:
- Use English business names.
- No spaces in Table Names (use CamelCase if necessary).
- Spaces are fine in Column Names, because you always refer to these in [square brackets], so it doesn’t matter.
- Keep the names of your tables and columns short and to the point.
Measures on the other hand should always use Spaces and not CamelCaseAsTheyAreHarderToReadThatWay. Also avoid abbreviations unless they are common in your business setting. Some examples of good Measure names are
[Total Sales]
[Total Sales YTD]
[Total Sales All Products]
A Final Word on Shaping
While the optimum shape is a Star Schema as shown above, that doesn’t mean this is the only way it will work. It will also work with a Snowflake Schema too (where there are secondary Lookup tables joined to the primary Lookup tables), but the point is that these extra relationships required to make the “snowflake” can come at a cost of worse performance and possibly also confusion to the user (if they access the field list for visuals).
There is no 1 correct way to shape your data, but what you need to remember is
- There are 2 types of tables
- Data tables that contain your actual data that you want to analyse.
- Lookup tables that contain metadata about the objects you are going to analyse, like the name, address, suburb of each customer.
- The optimal way to shape your data is a Star Schema as shown above, but this is not the only way it will work.
Hi, Matt,
thanks for a simple and clear explanation.
There’s one thing I am still not sure – if Dimension/Lookup tables may reffer to each other.
My case: I am reporting the sales volumes, business meetings are important aspects of the reporting as well.
I will definitelly have Dimension/Lookup tables “SalesPersons” and “Meeting”.
But what I wonder is how to define the fact that SalesPersons arrange Meetings (1:many)?
Should Meetings table refer to SalesPersons table directly or is it better approach the create a separate Fact/data table managing this relation between two Dimension/Lookup tables?
Thank you
Jakub Albrecht
These are hard questions in a blog comment section. To me, SalesPeople is a dimension table – it has a primary key – and meetings is a fact table. It can get complicated with your example where a Sales Person organises a meeting. Is “who” organises the meeting important to you for reporting? Is it always 1 person that organises the meeting? Do you record who “attends” these meetings, and are these the same sales people as in the salepersons table? If the attendees are important, then that would be a fact table of the meetings table, right? Generally you want 1 filter path from any table to another table. If you have more than 1, you may get unexpected results. Just because you “can”, doesn’t mean you “must/should”. The reasons you may want to join the “who organised the meeting” may include that you want to roll up reporting by department or manager of “who” organised the meeting. If you just want to report on who organised it by name, you can do that by using the column in the meeting fact table without the need to join it to the salespersons table.
Very interesting read. I have a quick question. Am I better using MULTIPLE SHORT NARROW tables or rather than ONE LONG NARROW table?
I ask because when I have multiple Indicator datasets with the same field names and with a denominator and numerator fields that I use to create the percentage measure, I always appended these Indicator datasets into the one long narrow Data table.
However, after reading your article, I am wondering if the design would be more efficient if I had an individual Data table for each indicator dataset. I could then have the links from the Lookup tables to each Indicator Data table, which is controlled by the Lookup tables.
mmmm. It’s complicated and there is no one correct answer. Things that you need to consider include.
Multiple data tables will mean extra relationships
Relationships with a high cardinality are more expensive than those with a small cardinatlity
If you need data from multiple data tables to do single calculations, there could me more storage engine queries (or more expensive queries) with multiple data tables than with single tables.
Sometimes, it is more efficient to store data in extra columns in a data table than in a single column. eg, if you have a very large data table 100+ million rows, with sales for multiple years, it may be more efficient to store the sales data twice in 2 columns, one for this year and one for last year, than to write time intelligence functions to fetch both year’s data.
As Marco always says – it depends.
Good tips. All of what you said makes sense. The only thing I dispute is in your introduction, you say “When the IT department executes an Enterprise BI project, one of the important first steps is to shape the data so it is optimal for reporting. ”
Yes data shaping would be important, but I doubt it would happen. I used to work in Enterprise IT. Unless things have changed a lot since then I would be surprised if they did any “shaping” other than setting up the data connection, or providing exported data. The “BI expert” would have to do that in Data Manager.
Very helpful, Thank you Matt.
Hi Matt,
excellent post, very much looking forward your DAX-book.
Although modelling this style, I’ve often seen people dragging the date-key from the data table instead of from the date-(lookup-) table which leads to wrong results in some cases. So hiding these columns from the client tools makes sense in my eyes.
Meanwhile I pushed this a little further: Putting my measures into dedicates measure tables, I’m now able to hide my complete data tables from client tools, which is also a good check if the model is complete.
I agree with you on the date thing. If I share my workbooks with power users (ie they use the PivotTable Field List) then I also hide the other date columns (and anything else that can get them into trouble). But as a rule I try to distribute reports where the users do not need to interact with the PivotTable Field List.
I have also used the Measures table in the past and find this quite good. It is interesting that when writing Cube Formulas, the Intellisense automatically adds the Measures table regardless. To me this looks like another of those things that Microsoft just needs to polish up a bit.
Great tips! Thanks Imke and of course Matt!