I was doing some work for a customer this week – they had a performance issue with a Power BI report. The data in the workbook wasn’t overly large, about 400,000 rows, yet the file size was 110 megabytes and the performance of the model was relatively slow given the number of records. When I looked at the report I noted that the report was using GUIDs between the primary and foreign keys on a number of tables. Generally speaking, it is not good practice to use a GUID to join tables, as GUIDs do not compress well and have a negative effect on the efficiency of physical 1 to many relationships.
What is a GUID?
GUID is an acronym for Globally Unique IDentifier. In short a GUID is a hexadecimal number that is the equivalent of a decimal number with 39 digits – something like this 1,000,000,000,000,000,000,000,000,000,000,000,000,000 (a Duodecillion). Stated differently, a GUID is a highly unique ID that is unlikely (based on probability) to be reproduced even if it were generated randomly – which they are. Read more about it here:
What is Wrong with Using GUIDs in Relationships?
A physical 1 to many relationship in Power BI is a fundamental part of the underlying database structure. The Power BI engine (Vertipaq) materialises and stores these relationships into the database and then uses them to rapidly propagate filters from one table to another (via the 1 to many relationship). It is very common (even desirable) that the relationship logic is loaded in to fast L1 or L2 cache memory on your PC chip so that it can do its job super fast. If the relationship doesn’t fit in cache memory, then the whole process is going to be slower. Things that can drive up the size of the relationship are the number of unique values in the columns used in the relationship, but also the data type used in the relationship. Suffice to say, a GUID is way way less efficient that an integer value as a key column in Power BI.
Enter a Surrogate Key
One way to solve this problem is to replace the GUID with a surrogate key. A surrogate key, as its name suggests, is a new “key column” that is a “surrogate” (or a replacement) for the original key column. In a perfect world where you have SQL server at the back end, and you have an IT department that can do the work for you, I would suggest you get your IT department to build the surrogate key for you and make it available in a view so that you don’t have to use the GUID. But the world we live in is not perfect hence this article will show you how to replace the GUID with a surrogate key using Power Query.
Steps to Complete
Assuming you can’t get the work done at the data source, this is essentially a problem for Power Query to Solve. The steps to complete this process are as follows
- create a connection to the raw dimension table
- create a branch in power query
- add a new integer ID column (surrogate key)
- re-join the surrogate key table back with the original dimension table and replace the GUID using merge
- repeat the step to replace the GUID in the fact table
This process assumes that your dimension table contains a complete list of keys that exist in your fact table . If this is not the case, you have a problem anyway. It is possible to vary this pattern to include the GUIDs coming from the fact table too, however it actually doesn’t solve the root problem (that you may have IDs in the fact table that are missing in the dimension table) hence I have not provided/suggested this as part of the solution. Besides, doing this will certainly slow the refresh time without solving the root issue.
Connect to the Data
As you can see in the image below, I have two queries (RawCustomer, RawSales shown as #1 below) that connect directly to my sample data (I have modified Adventure Works so that the Customer number uses a GUID #2 below). Note my two Raw Data queries are simple connections. This is the technique Ken teaches in the Power Query Academy training, and I think it is a great practice. From there I created two staging queries (shown as #3 below) that are simple references to the raw data queries
So at this point, it simply looks like this
Create a Branch in Power Query on the Dimension Table
- The next step I followed was to right click on the customer staging query and select “reference” to create a new branch. I called this new query CustomerGUIDs. In this query I simply kept the GUID column and removed everything else. As a safety step I remove duplicates just in case there was a duplicate in the GUID column (even though it is unlikely).
- I then added an index column starting at 1, and called it CustomerID.
The new CustomerID column is the surrogate key. The beauty of this approach is that the surrogate keys will grow over time if the original GUID list grows (which is highly likely). Also, if there are any deletions from the customer table, the surrogate key will just rebuild itself to the new set of data.
Re-join the Table of Surrogate Keys to Form a new Dimension Table
Next I took a second branch from the customer staging query (right click “reference”) and merged it with the CustomerGUIDs table.
- Then I extracted the surrogate key
- Then I deleted the original GUID from the final customer table.
You can see the before (#1) and after (#2) version of the Customer table below.
Repeat the Process to Replace the GUID in the Fact Table
- I created a new query from the SalesStaging query (right click, “reference”) and merged it with the CustomerGUIDs table.
- Then I extracted the surrogate key
- Then I deleted the original GUID from the final sales table.
My final query dependency view looks like this.
I made sure all tables were set so they did not load apart from the Customers and Sales tables.
The demo I have showed you here is with a very small dataset (19,000 rows). In my case the surrogate key approach reduced the file size by more than 30%, and there was only 1 table of surrogate keys in my file. In the case of my customer that had around 400,000 rows of data, 3 GUID key columns and a much higher number of unique GUIDs, the file size reduction was much greater, from 110MB to just 11MB (90% reduction with the surrogate key). Probably more importantly there were noticeable performance improvements after the cahnge.
Are There Any Negative Impacts?
At this point you might be asking yourself “won’t this slow down the refresh time?”. If you were thinking this, you are absolutely correct, it will most likely slow the refresh performance. However, it is much better to have a slower refresh time and a faster runtime performance than the other way around.
What if I need my GUIDs for Auditing?
A second issue is that you may need the GUIDs so you can trace the data in your report back to the transaction in the source system. If this is a common need with your data, then I suggest you still create and use the surrogate key, but also load the GUID in the dimension table as an additional column. That way, the GUID is available but not used in the relationship. Better still, remove the GUID from the dimension table and then bring it back later only if you have an issue that needs you to trace the source.
Wrap Up, and Where to Learn More
Here is a copy of the workbooks I used in this article in case you are interested in taking a closer look.
If you would like to learn how to be great using Power Query, I recommend you take a look at the Power Query Academy online training at Skillwave.training. Ken, Miguel and I have teamed up forces to create the best, most comprehensive Power Query training course available.