Today I am combining a few techniques to show how to build a robust cleansing approach for your data using Power Query. This article will demonstrate the following Power Query techniques
- Joining tables (Merge Query)
- Self Referencing tables
- Adding Custom Columns
Sample Data
I am using some sample data as shown below. The Country data is the data I want to cleanse so there are consistent country values in the Country column.
I first load this data into Power Query and set “Close and Load to” so that it only creates a connection. I then have Power Query access to the source data table that can be reused multiple times.
Create a Table of All Entered Countries
The next step is to produce a list of the unique entries in the source data. To do this, I referenced the above table by right clicking (1 below), then Reference (2)
I then removed the Name column (1 below) and then removed the duplicates in the remaining Country Column (2 below).
I then sorted the column before loading the data back to Excel as shown below.
Create the Substitutes with a Self Referencing Table
I first covered this trick of a self referencing table in this blog article. Continuing on with my test data above, I have manually added a new column to the green table above and entered the substitutions as needed. Where there is no substitute needed I have simply left the substitute blank.
As it currently stands above, if the source table is sorted and then the green table is refreshed, the substitute column will fall out of alignment with the source table – that is bad. Now for the self referencing trick. I load this green table above back into Power Query as a new query, setting it to only create a connection as I did for the original set of data. I named this new query “Substitutes”
I then edited the original query that created the original green table and joined it to the substitute table (aka self referencing). While editing the Entered Names table, I selected Merge Queries (1 below), selected the Substitute table (2), then joined on the Country Columns (3, 4) with a left outer join (5).
I then expanded the new column to extract the substitutes.
This now gives me a self referencing table that will refresh while keeping the countries and substitutes correctly aligned regardless of what happens to the source table. This is the key point about a self referencing table (covered in my previous article).
Update the New Data
Now when new data is added to the source table, it will appear in the substitute table after refresh as follows.
What would be much better is to be clear which are the correct entries that have previously been checked so it is more obvious which are the new entries. So I decided to enter the correct country name for each entry as follows.
It is now much clearer that there are 3 new entries that need to be maintained, and I then updated the new data below.
Use the Substitute Table
Now that I have this substitute table working for me, all I need to do is create a new copy of my source table that uses the substitute country rather than the original country entered. I create a new query referencing my original source data (right click, reference).
Merge the queries as before, joining on the original country column.
expand the table to retrieve the substitute column
Then delete the original country column and rename the substitute column to be called “Country”
As you can see below, the substitute table is correctly remapping the data as needed.
When the new data is added to the source table, the substitute table will have blank entries as shown below.
As a nice touch, I went back into the final query and added a new custom column as follows
and then used this new column as the country column. Now when I refresh the data, it is clear that I need to do some maintenance on the substitute table.
If you want a comprehensive lesson on how to use Power Query, checkout my training course here https://exceleratorbi.com.au/product/power-query-excel-power-bi-online-training/
How do you see this working behind the scenes? Does the join between Cleansed Data and Substitutes persist? Or is the join run before (for example) Source is resorted?
I’m not sure what you mean by “persist”. The join is made and the substitutions are made. The substitution table remains as part of the model. If you refresh you data later and new data exists, the substitutions will work on that new data too. If you get new substitution requirements, you will need to add new records in the substitution tables
Very useful post. I kept struggling with Units of Measure in multiple versions ( eg. “Kg”, “kg”, “kgs”, “kg.” ).
I actually use Table.TransformColumns() instead of merging to save the Expand, Remove, and Rename Columns steps. Something like:
= Table.TransformColumns( YourTable, {{“Country”, each try Substitutes{ [Country = _ ] }[Substitute] otherwise _ }} )
Ok, performance can be an issue with my approach. Merging is extremely fast compared to accessing a record field for each row, at least with my data that has ~60k rows and 40 original values that are subsituted by 20 ‘correct’ values.
Hi! I was having trouble with the middle part, because it kept adding a new column to the green table [Substitute2], because I didn’t see that I had to delete the entered names table that we manually updated. Also, at the end, I think it is confusing where you switch to using the NAME instead of COUNTRY. The orange table ends up having people’s first name in the country column. Thank you for all that you do!
Ohhh, I have to admit I struggled to understand what you were talking about. Wow, yes this was a typo in the formula. I will fix it now. Thanks for letting me know
This is awesome, because I actually did something similar to this for a project just last month. I had a data set that might have client names or might not have client names, and the names may or may not be correct. Using that glorious, glorious self-referencing tables technique, I set it up so the user can update them or not as they choose, and another column where they noted what needed to happen for the accounting data on that line so they could see easily what had or hadn’t been looked at. I think I like your three table approach better for most cases though, so I’m bookmarking this one (grins)!
One thing I have been playing around with is trying to get the date of the last update on a line to show automatically, but I haven’t quite gotten that figured out yet.
I doubt you will get auto date working, as anything that auto adds the date will also auto update. You could use VBA, but maybe just as easy is Ctrl + ; (which is a great short cut I use all the time)
Yep, VBA seems to be the only option, but I think you’re right that it’s more trouble than it’s worth. And that’s a handy shortcut, I’ll add that in to the best practices list, thanks!
You can create in Excel a simple table with two columns (ParamaterName, ParameterValue). Your first row could be (Date, =today()). This will always give you the current date. If you want to be more precise, you can use (Timestamp, =now()), which will give you date and time. Load this table in PQ and call it Parameters. In any query which is used to get a snapshot of data and needs a timestamp, you simply create a new column and then merge that query with the Parameters query.