Self Referencing Tables in Power Query - Excelerator BI

Self Referencing Tables in Power Query

I have had this idea in my head for over a year, and today was the day that I tested a few scenarios until I got a working solution. Let me start with a problem description and then my solution.

Add Comments to a Bank Statement

The problem I was trying to solve was when I download a digital copy of my bank statement, loaded it up into Excel using Power Query and then wanted to add some commentary to some of the transactions.  If you just add the comments to the source file then everything is easy.  But it is often not practical to add the comments directly into the source file for a number of reasons.  This could be because you get a new copy of the source data each time you refresh (eg you get a CSV file that replaces the old file, and the file just gets larger each month), or it could be if you are using a “combine multiple files” technique and you only want to deal with a summarised, cleansed version of the final data instead of dealing with the source files directly.

Once you have loaded the table in Power Query, it is possible to add a new column to the resulting table (shown below).  The problem with this approach is that the comments are not logically linked to the rows of data in the table.  Take the example below.  The data is loaded from table 1 using Power Query, into table 2 on the right.  I then manually added a new comment column and added some comments (as shown below).

comments

The problem that can happen is demonstrated below.  If the sort order of the source table changes and then you refresh the query, the comments no longer align with the original data.  And this is despite having a unique ID column in the original table.

sort

The problem is that the new column I manually added to the Power Query table on the right is not logically joined to the actual table, and hence the comments are actually in a column next to the table rather than part of the rows in the main table.

Enter Self Referencing Tables

I read this blog post from Imke quite some time ago, and that was what gave me the idea on how I could solve this problem. The idea is to load table 2 above a second time (after adding the comments), and then joining it back to itself, hence logically joining the manually added comments to the rows in the main table.

Note: Prior to completing the self referencing steps below, I used Power Query to create the second table from the source table. This was my starting point.

Then I loaded the resulting table 2 a second time as follows:

  1. Select in the second table
  2. Go to the Power Query menu
  3. Create a new query using “from table”.

image

I renamed this new query “Comments” and selected “Close and Load To” so that it only created a connection but didn’t load the new table to the worksheet.

image

The next thing to do is to merge the original query with the new query.  To do this,

  1. Go back and edit the original query for table 2
  2. Add a new step to the query using “Merge Query”
  3. Merge the second “Comments” query with the original query by joining on the ID column as shown below.

image

This will give you a new column in your query, and the column contains a table (as shown below).  You can then expand this new column to extract the comments.

image

I only extracted the comments column, and deselected the last option as shown below. Note, it is important that you deselect “Use original column name as prefix” so that the new column has the same name as the original source column.

Click OK.

Edit May 2020.  Power Query is always changing.  It seems there has been a change which introduces the need for another step at this stage. After expanding the new column, you will see that Power Query renames the new column and calls it Comments.1 (shown below in 1).

Rename this step

You need to edit the code shown in 1 and remove the .1 so the new column has exactly the same name as the original table you loaded.
After you have made this change, you can then close and load.

image

When you look at the table now, it all looks the same as before (see below), but there is one important difference.  The last column in the Power Query Table is now a part of the table and not a manually added on column.  You now have a self referencing table.  EDIT 2017.  If you get a new column called Comment2, just delete it.  Everything else will work. There was a change to the way Power Query works at some stage and this duplicate column with “2” appended now appears.  But you can (and should) just delete it – it will then all work.

image

To prove this, I completed the same sort test as before.  First I sort the source table, then refresh the Power Query table – you can see the results below.  Note how this time the comments stick with the row in the table – sweet!

sort2

Power Query Online Training

Incrementally Add Comments

Now that the queries are all set up, you can incrementally add comments to the Power Query table any time you like.  And you can also add new data to the source data – it will all work as you would expect as shown below.

final

Real Life Data

The above demo is of course test data.  A more realistic real life scenario would be to download a CSV file from your bank, and use one of the many “combine multiple files” techniques to import the data.  It is likely that your source data is messy, and contains lots of columns you don’t need. In this real life scenario, it is not easy to add comments to the source data unless you open each file and add the comments there – this is not ideal.  Another problem could be that your data source could be a CSV file that uses a “delete and replace with the new larger file containing the latest records” method.  In both of these scenarios, you will need to have your manually added comments located somewhere other than the source table.  The solution provided here is a simple and intuitive way to manage this for the user (albeit with a little bit of setup first).

The example shown in this post will work just fine as long as 2 conditions are maintained.

  1. You have a unique ID for each record
  2. You don’t accidentally load duplicate data – so don’t do that!

But What if I don’t have an ID column

The next logical issue you may have is a scenario where you don’t have an ID column.  I tested quite a few scenarios including adding an ID column inside the first Power Query step (add index column).  I haven’t been able to find a solution that works 100% of the time, but there is one that can work for most people.  If your data is such that each row in the table is unique (eg date, maybe a time column if you are lucky enough to have this, amount of transaction, remaining balance etc) you could create a unique ID by concatenating these columns into a single unique column.  This should solve the issue for most people in non-commercial scenarios (those in commercial scenarios are more likely to have a reference ID anyway).  You could even create an audit query that checks for duplicates and warns you if this occurs, but that sounds like another blog post.

I hope you find this article useful, and or it inspires you to come up with your own innovative ways to use Power Query.

 

If you want a comprehensive lesson on how to use Power Query, checkout my training course here https://exceleratorbi.com.au/product/power-query-excel-power-bi-online-training/

77 thoughts on “Self Referencing Tables in Power Query”

  1. Hi Matt,
    can you explain why this works? When you try to use the result of an Excel formula inside the formula, you’ll get an error. Somehow this solution feels like it would also create an error. Does it depend on the order in which the queries are processed? Is there any risk that future versions of Power Query will break my table-with-comments?

    1. Can I explain why?, not really; i just know it works. Effectively there are 2 copies of the table, to original and the one with the comments, and it sorts itself out. Power Query is not Excel. Excel formulas continue to iterate until there is an end point. Power Query doesn’t – it just does one iteration then stops once the table loads. It seems to always work. Yes, there is a risk it will stop working. It has already changed a few times, but it still works.

  2. Matt,

    This has to be one of the most glorious solutions I’ve come across. I’m a heavy user of Visio in my consulting role and needed to be able to make use of the data driven Visio cross functional flowcharts but needed to use the Excel table the Visio interacts with as a dynamic data source. As you may be aware Visio is notoriously finicky about any changes to the underlying table. For the project I’m on I needed to not only derive additional columns from the Visio data but also to extend the dataset out in order to build some semi-dynamic documentation. I’d been losing my mind trying to carefully extend the base spreadsheet only to keep having the link between the Visio and the Excel break.

    Finding this self referencing method has completely solved the issue because I can simply leave the original Visio Excel table alone. I’ve built a second spreadsheet that uses the original table as a source but also extends on that table for the additional fields I need using the Visio Object ID as the primary key. I’ve managed to get a combination of calculated fields, mixed with free text fields (as per your example) and it’s completely saved the day!

    Thank you so much for sharing this brilliant solution.

    Bravo!

    Stu
    ITSM/SIAM Consultant

  3. Thank you Matt for this post; Leila G pointed me to your site which I greatly appreciate (I took her PQ course took to find such nuggets – her course is great btw!!). I only found one other YouTube video referring to this topic (Doug H from exceltraining101.com). There are a lot of assumptions here (it should work, you should delete) which may not be straight forward! Have you tested the position of your “comment” column in reference to the original query? I would like my “comment” column to actually be almost at the beginning (2nd column to the left) of my original table (inserting the “comment” in column B); this seems to throw a curve ball to the process. I would love to hear from you if you have tested more scenarios on how the “comment” column position affects performance. My “source” table is a query from an external Excel file. Many thanks in advance! Yves

      1. Matt, I have tested a few options:
        1) I figured it is possible to add multiple “comment” columns (I needed to add 5 comment columns!)
        2) when adding the “comment” columns, I created them first at the end of the table (far right); when I attempted to insert them in the middle or before the first column, the “comments.1″,”comments.2” replicas would pop up on every refresh. This seems to be a key step.
        3) I created the simple self-referencing query as you suggest.
        4) I merged my original query with the self-referencing using several data columns for my merge – I had no unique ID).
        5) Upon the very first refresh, I deleted the extra “comments.1” to the right of my table.
        6) I went back in to query editor and re-ordered the columns as I needed (I wanted the comments columns at the beginning).
        The refresh now holds in place!
        Thanks again for the great pointers Matt!