The objective of this article is to help you understand the differences between shaping your data (using Power Query) and modelling your data (using the modelling tools inside Power BI). You will need to do both to build a robust Power BI report. I first wrote this article in September 2016 and am now updating it to reflect the latest developments in Power BI.
The Four Phases of a Self Service BI Project
I like to think of a Self Service BI Project as having four logical phases as shown below.
The image above shows the logical flow of data from the source through to the delivery of final reports. While the image shows that the data flows from left to right, the reality is that when you build the solution, you will likely jump back and forth through all the different phases iteratively until the job is done. Two of the phases above are part of the topic today; Data Acquisition (Shaping) and Data Modelling. These are both IT BI terms that business users may or may not have heard before.
Shaping and Modelling Your Data
Data acquisition using Power Query is where the “shaping” is done, and data modelling using the Power BI Analysis Services engine is where the “modelling” is done. You need to do both to build a good Power BI report.
Definition of Shaping
Shaping is the process of extracting your data from the source, transforming the data into the shape you need, and then loading it into Power BI. This process is often referred to as ETL (extract, transform and load). The reason I like the word shaping is that it clearly describes what you are actually doing in this step. The process of shaping your data includes:
- Restructuring the data coming from your source tables into new tables that meet the requirements for the reporting database. The table structure in a relational database is seldom (if ever) appropriate for Power BI. Don’t fall into the trap of just loading what is in your source data.
- Selecting only the columns and rows that are needed. Load everything you need, and nothing you don’t. It’s easy to make changes later if needed.
- Pre-aggregating the data (if desired/needed), although I generally don’t recommend this in most cases (there are always exceptions).
- Naming/renaming the columns and tables to be business user friendly.
- Loading everything into the Power BI database (the data model).
Where possible, it is better to do this shaping work inside views in your relational database (if you have one). If this is not possible for any reason, Power Query is there to do this work for you.
Definition of Modelling
Modelling is the process of designing the structure of your tables of data, building the relationships between your tables, and writing measures to extract the business insights you need in your reports.
Data modelling includes:
- Deciding what data you load and what you do not load.
- Deciding on a table schema; is it one table, two tables, many tables? You should always prefer a star schema if possible.
- Deciding what names you give to the columns and tables
- Deciding if your tables are wide and short (many columns), or long and narrow (fewer columns)
- Joining tables using relationships (where needed)
- Writing the business logic in measures to extract the value adding insights from your raw data.
- Correctly formatting measures and columns to meet the business requirements (e.g. currency, percentage, appropriate number of decimals)
- Assigning good business names to the measures that accurately describes what is being calculated.
Note above, that the decision on what to load is part of modelling, but the process of structuring the data is part of shaping. Clearly these 2 concepts are closely linked and are mutually dependent.
Things Are Not Always Clear Cut
One thing that creates a lot of confusion for new Power BI users is that there are at least 2 ways to add columns to your data. You can add a column using Power Query or you can add a calculated column using DAX. Of course it may be possible for a third way; add the data back at the source database instead. Regardless of which method you use, the decision on whether to add the column (or not), and what should be in the column, is a data modelling decision. It is debatable if the process of actually adding a calculated column is shaping or modelling, but I think of it as a shaping task.
So Which Add Column Method Should You Use?
So there are 3 places you can add a column, so which should you use? Well it depends on what you are trying to do. My general advice on adding columns to your tables is to make the change as close to the source as possible. So in order of preference, you should:
- Add the column in your source data (eg your Data Warehouse) with the help of your IT friends if necessary.
- Add the column using Power Query on data load.
- Add the column inside Power BI using a calculated column.
The reasons for this recommended approach are:
If you add the column at the source, then you have access to it for your current BI needs but also for future BI needs and other BI users. Adding the column at the source provides the most reuse, and if you need to make a change down the track, it will be the least maintenance. The reality however is often you can’t get it added (or changed) at the source for various reasons, or at least you can’t get it to happen quickly, so, often this is not an option particularly for self service BI.
If you add the column inside Power Query (or at the source), it will be loaded and compressed with all other columns during data load. This normally means better overall table compression and hence a good outcome for the overall model. It also keeps all the shaping tasks together, which can make it more intuitive to find things and easier to maintain.
Normally you should only use a calculated column if the other two approaches are not good options for whatever reason. The most common reason you should choose to add a calculated column (using DAX) is if you need to leverage existing parts of the model to create the new data. An Adventure Works example here should help. Imagine you want to classify your customers in bands, of high sales, medium sales, or low sales. You need a new column in your customer table (a modelling decision) so you can use that column on a slicer in one of your reports. If you tried to do this task inside Power Query, it would be quite some additional work. You would have to work out the total sales for each customer in Power Query, and that would require you to
- Create a join between the customer table and the sales table
- Pre-aggregate the sales data for each customer
- Group the customers into the size bands using the business logic you need.
- Add the column (high, medium, low).
The important point is that items 1 and 2 above will most likely already exist in the data model itself. In the case of Adventure Works, the model already has a customer table, a relationship to the sales table, and a measure that aggregates the sales. These features of the model can be used to easily add the new calculated column using DAX. So in short, you should prefer a calculated column when it leverages the logic of your model (measures and relationships) so that you don’t have to repeat this logic inside Power Query.
Now, just because you can, doesn’t mean you should. Please don’t fall into the trap of using too many calculated columns instead of writing measures. I rant on about that here. Use calculated columns when they are needed, otherwise avoid them.
Hopefully this explanation of the role of shaping and modelling in Power BI has helped clarify some things for you. If you found this article useful, you may like to check out the other useful articles in my knowledge base here.