Replace GUIDs with a Surrogate Key for Better Performance - Excelerator BI

Replace GUIDs with a Surrogate Key for Better Performance

I was doing some work for a customer this week – they had a performance issue with a Power BI report. The data in the workbook wasn’t overly large, about 400,000 rows, yet the file size was 110 megabytes and the performance of the model was relatively slow given the number of records. When I looked at the report I noted that the report was using GUIDs between the primary and foreign keys on a number of tables. Generally speaking, it is not good practice to use a GUID to join tables, as GUIDs do not compress well and have a negative effect on the efficiency of physical 1 to many relationships.

What is a GUID?

GUID is an acronym for Globally Unique IDentifier. In short a GUID is a hexadecimal number that is the equivalent of a decimal number with 39 digits – something like this 1,000,000,000,000,000,000,000,000,000,000,000,000,000 (a Duodecillion). Stated differently, a GUID is a highly unique ID that is unlikely (based on probability) to be reproduced even if it were generated randomly – which they are.  Read more about it here:

https://betterexplained.com/articles/the-quick-guide-to-guids/

What is Wrong with Using GUIDs in Relationships?

A physical 1 to many relationship in Power BI is a fundamental part of the underlying database structure. The Power BI engine (Vertipaq) materialises and stores these relationships into the database and then uses them to rapidly propagate filters from one table to another (via the 1 to many relationship). It is very common (even desirable) that the relationship logic is loaded in to fast L1 or L2 cache memory on your PC chip so that it can do its job super fast. If the relationship doesn’t fit in cache memory, then the whole process is going to be slower. Things that can drive up the size of the relationship are the number of unique values in the columns used in the relationship, but also the data type used in the relationship. Suffice to say, a GUID is way way less efficient that an integer value as a key column in Power BI.

Enter a Surrogate Key

One way to solve this problem is to replace the GUID with a surrogate key. A surrogate key, as its name suggests, is a new “key column” that is a “surrogate” (or a replacement) for the original key column. In a perfect world where you have SQL server at the back end, and you have an IT department that can do the work for you, I would suggest you get your IT department to build the surrogate key for you and make it available in a view so that you don’t have to use the GUID. But the world we live in is not perfect hence this article will show you how to replace the GUID with a surrogate key using Power Query.

Steps to Complete

Assuming you can’t get the work done at the data source, this is essentially a problem for Power Query to Solve. The steps to complete this process are as follows

  • create a connection to the raw dimension table
  • create a branch in power query
  • add a new integer ID column (surrogate key)
  • re-join the surrogate key table back with the original dimension table and replace the GUID using merge
  • repeat the step to replace the GUID in the fact table

This process assumes that your dimension table contains a complete list of keys that exist in your fact table . If this is not the case, you have a problem anyway. It is possible to vary this pattern to include the GUIDs coming from the fact table too, however it actually doesn’t solve the root problem (that you may have IDs in the fact table that are missing in the dimension table) hence I have not provided/suggested this as part of the solution.  Besides, doing this will certainly slow the refresh time without solving the root issue.

Connect to the Data

As you can see in the image below, I have two queries (RawCustomer, RawSales shown as #1 below) that connect directly to my sample data (I have modified Adventure Works so that the Customer number uses a GUID #2 below). Note my two Raw Data queries are simple connections. This is the technique Ken teaches in the Power Query Academy training, and I think it is a great practice. From there I created two staging queries (shown as #3 below) that are simple references to the raw data queries

So at this point, it simply looks like this

Create a Branch in Power Query on the Dimension Table

  • The next step I followed was to right click on the customer staging query and select “reference” to create a new branch. I called this new query CustomerGUIDs. In this query I simply kept the GUID column and removed everything else. As a safety step I remove duplicates just in case there was a duplicate in the GUID column (even though it is unlikely).
  • I then added an index column starting at 1, and called it CustomerID.

The new CustomerID column is the surrogate key. The beauty of this approach is that the surrogate keys will grow over time if the original GUID list grows (which is highly likely).  Also, if there are any deletions from the customer table, the surrogate key will just rebuild itself to the new set of data.

Re-join the Table of Surrogate Keys to Form a new Dimension Table

  • Next I took a second branch from the customer staging query (right click “reference”) and merged it with the CustomerGUIDs table.

  • Then I extracted the surrogate key
  • Then I deleted the original GUID from the final customer table.

You can see the before (#1) and after (#2) version of the Customer table below.

Repeat the Process to Replace the GUID in the Fact Table

  • I created a new query from the SalesStaging query (right click, “reference”) and merged it with the CustomerGUIDs table.
  • Then I extracted the surrogate key
  • Then I deleted the original GUID from the final sales table.

My final query dependency view looks like this.

I made sure all tables were set so they did not load apart from the Customers and Sales tables.

Results

The demo I have showed you here is with a very small dataset (19,000 rows). In my case the surrogate key approach reduced the file size by more than 30%, and there was only 1 table of surrogate keys in my file. In the case of my customer that had around 400,000 rows of data, 3 GUID key columns and a much higher number of unique GUIDs, the file size reduction was much greater, from 110MB to just 11MB (90% reduction with the surrogate key). Probably more importantly there were noticeable performance improvements after the cahnge.

Are There Any Negative Impacts?

Refresh Time

At this point you might be asking yourself “won’t this slow down the refresh time?”. If you were thinking this, you are absolutely correct, it will most likely slow the refresh performance. However, it is much better to have a slower refresh time and a faster runtime performance than the other way around.

What if I need my GUIDs for Auditing?

A second issue is that you may need the GUIDs so you can trace the data in your report back to the transaction in the source system.  If this is a common need with your data, then I suggest you still create and use the surrogate key, but also load the GUID in the dimension table as an additional column.  That way, the GUID is available but not used in the relationship.  Better still, remove the GUID from the dimension table and then bring it back later only if you have an issue that needs you to trace the source.

Wrap Up, and Where to Learn More

Here is a copy of the workbooks I used in this article in case you are interested in taking a closer look.

If you would like to learn how to be great using Power Query, I recommend you take a look at the Power Query Academy online training at Skillwave.training.  Ken, Miguel and I have teamed up forces to create the best, most comprehensive Power Query training course available.

22 thoughts on “Replace GUIDs with a Surrogate Key for Better Performance”

  1. Matt,
    Thanks for this information. It bears directly on an issue we are having with out data warehouse on Snowflake and Power BI.

    We are facing an issue with truncation of numeric surrogate keys in Power BI when connected to Snowflake and were ironically discussing converting to GUIDs to address the problem.

    We currently generate Surrogate IDs on Snowflake for all tables. Snowflake stores both decimal and integer values as numbers, so our Integer IDs are represented as Number(19,0). Power BI interprets this as a decimal, which is only supported to 15 digits, so an ID of -9223191239154221888 is stored in Power BI as -9223191239154222000.

    When users are building Power BI data models with only imported data, we are not seeing collisions on the IDs (yet). Direct Query data models are also not affected.

    However, with composite models, we are seeing issues where the imported ID, which is truncated, does not match up with the ID on the database when passed as a filter in a direct query.

    We prefer not to require our uses to duplicate and transform columns to address this, so were considering transforming our surrogate key approach to use GUIDs instead to avoid the truncation of the key. Another option is to create a “text” version of the surrogate key, so that -9223191239154221888 is stored as a text vs a number. However, it seems like this would face similar issues with memory as a GUID?

    1. I would expect a text based version of your SK would have similar impacts as GUIDs. The best approach is to create new SKs that are small integers if possible. Best to do that in your DB if possible.

  2. David O'Neill

    Great post, I have implemented this on one of my models looking at Dynamics CRM Data, where I am connecting all my tables on GUIDs. I have not noticed the compression levels you have experienced in your example, my change in model size was only ~300kb per GUID column, (changed 2 columns with this method and got a 600Kb reduction) and that is with a fact table of 4.5M rows. Saying that, the query time improvements have been noticeable, with my queries running 50% quicker. I am going to play with this method in more depth to see what can be achieved with it on the compression side. Thanks again for this Matt.

      1. Absolutely, I am going to run through some more tests to see what is achievable ( I have many many reports I can test this on! ) Will definitely post back with any findings.

  3. Thanks for the detailed post Mark. My questions-

    1. As we are bringing in a non foldable operation of index creation, I guess this impact the refresh.
    No?

    2. Also, in case of incremental refresh of the fact, for any deletion in dim- if the surrogate key rebuilds itself, will it not break the dim- fact relationship ?

  4. Clear and practical post as usual! As a best practice, the creation of index/surrogate key should be pushed upstream into a data warehouse if possible, to reduce the refresh time in PowerBI workflow. If DW is not available, then try to push this operation into PBI dataflows to offload the data refresh time.

    1. Correct, however be aware that dataflows will not work with referenced queries unless you have premium. You can do the reallocation of the surrogate key in