Cleansing Data with Power Query

Today I am combining a few techniques to show how to build a robust cleansing approach for your data using Power Query.  This article will demonstrate the following Power Query techniques

  • Joining tables (Merge Query)
  • Self Referencing tables
  • Adding Custom Columns

Sample Data

I am using some sample data as shown below.  The Country data is the data I want to cleanse so there are consistent country values in the Country column.

image

I first load this data into Power Query and set “Close and Load to” so that it only creates a connection.  I then have Power Query access to the source data table that can be reused multiple times.

image

Create a Table of All Entered Countries

The next step is to produce a list of the unique entries in the source data.  To do this, I referenced the above table by right clicking (1 below), then Reference (2)

image

I then removed the Name column (1 below) and then removed the duplicates in the remaining Country Column (2 below).

image

I then sorted the column before loading the data back to Excel as shown below.

image

Power Query Online Training

Create the Substitutes with a Self Referencing Table

I first covered this trick of a self referencing table in this blog article.  Continuing on with my test data above, I have manually added a new column to the green table above and entered the substitutions as needed.  Where there is no substitute needed I have simply left the substitute blank.

image

As it currently stands above, if the source table is sorted and then the green table is refreshed, the substitute column will fall out of alignment with the source table – that is bad.  Now for the self referencing trick.  I load this green table above back into Power Query as a new query, setting it to only create a connection as I did for the original set of data. I named this new query “Substitutes”

I then edited the original query that created the original green table and joined it to the substitute table (aka self referencing).  While editing the Entered Names table, I selected Merge Queries (1 below), selected the Substitute table (2), then joined on the Country Columns (3, 4) with a left outer join (5).

image

I then expanded the new column to extract the substitutes.

image

This now gives me a self referencing table that will refresh while keeping the countries and substitutes correctly aligned regardless of what happens to the source table.  This is the key point about a self referencing table (covered in my previous article).

image

Update the New Data

Now when new data is added to the source table, it will appear in the substitute table after refresh as follows.

image

What would be much better is to be clear which are the correct entries that have previously been checked so it is more obvious which are the new entries.  So I decided to enter the correct country name for each entry as follows.

image

It is now much clearer that there are 3 new entries that need to be maintained, and I then updated the new data below.

image

Use the Substitute Table

Now that I have this substitute table working for me, all I need to do is create a new copy of my source table that uses the substitute country rather than the original country entered.  I create a new query referencing my original source data (right click, reference).

Merge the queries as before, joining on the original country column.

image

expand the table to retrieve the substitute column

image

Then delete the original country column and rename the substitute column to be called “Country”

image

As you can see below, the substitute table is correctly remapping the data as needed.

image

When the new data is added to the source table, the substitute table will have blank entries as shown below.

image

As a nice touch, I went back into the final query and added a new custom column as follows

and then used this new column as the country column.  Now when I refresh the data, it is clear that I need to do some maintenance on the substitute table.

If you want a comprehensive lesson on how to use Power Query, checkout my training course here https://exceleratorbi.com.au/product/power-query-excel-power-bi-online-training/

30 thoughts on “Cleansing Data with Power Query”

  1. Nice post. I learn something new and challenging on sites I stumbleupon on a daily basis.

    It will always be interesting to read through articles from other writers and
    use something from their sites.

    Feel free to visit my web-site;

  2. Unquestionably imagine that which you stated. Your favourite reason seemed to be on the net the easiest factor to take into accout
    of. I say to you, I certainly get irked whilst other folks think about worries that they just don’t realize about.

    You managed to hit the nail upon the highest as smartly as defined out
    the whole thing with no need side-effects , other folks can take a signal.
    Will likely be again to get more. Thanks

    Feel free to visit my webpage … Mlm Company

  3. When I originally commented I clicked the “Notify me when new comments are added” checkbox and now each time a comment is added I get several e-mails with the same comment. Is there any way you can remove people from that service? Appreciate it!

  4. I am usually to running a blog and i really appreciate your content. The article has actually peaks my interest. I am going to bookmark your web site and preserve checking for brand spanking new information.

  5. Hey! This is kind of off topic but I need some advice from an established blog. Is it very difficult to set up your own blog? I’m not very techincal but I can figure things out pretty fast. I’m thinking about creating my own but I’m not sure where to begin. Do you have any points or suggestions? Appreciate it

  6. I do trust all the ideas you have presented to your post. They are very convincing and can certainly work. Still, the posts are very brief for starters. May just you please extend them a bit from next time? Thanks for the post.

  7. I really like your writing style, superb information, appreciate it for posting :D. “Nothing sets a person so much out of the devil’s reach as humility.” by Johathan Edwards.

  8. Thanx for the effort, keep up the good work Great work, I am going to start a small Blog Engine course work using your site I hope you enjoy blogging with the popular BlogEngine.net.Thethoughts you express are really awesome. Hope you will right some more posts.

  9. you are in point of fact a good webmaster. The website loading speed is amazing. It sort of feels that you are doing any unique trick. Furthermore, The contents are masterwork. you’ve done a wonderful task on this subject!

  10. I’ve been surfing online greater than three hours today, yet I by no means found any attention-grabbing article like yours. It’s beautiful price sufficient for me. In my view, if all web owners and bloggers made just right content material as you probably did, the net shall be a lot more useful than ever before.

  11. Appreciating the persistence you put into your blog and in depth information you offer. It’s awesome to come across a blog every once in a while that isn’t the same unwanted rehashed material. Wonderful read! I’ve saved your site and I’m adding your RSS feeds to my Google account.

  12. When I originally commented I clicked the -Notify me when new comments are added- checkbox and now each time a comment is added I get four emails with the same comment. Is there any way you can remove me from that service? Thanks!

  13. How do you see this working behind the scenes? Does the join between Cleansed Data and Substitutes persist? Or is the join run before (for example) Source is resorted?

    1. I’m not sure what you mean by “persist”. The join is made and the substitutions are made. The substitution table remains as part of the model. If you refresh you data later and new data exists, the substitutions will work on that new data too. If you get new substitution requirements, you will need to add new records in the substitution tables

  14. Very useful post. I kept struggling with Units of Measure in multiple versions ( eg. “Kg”, “kg”, “kgs”, “kg.” ).
    I actually use Table.TransformColumns() instead of merging to save the Expand, Remove, and Rename Columns steps. Something like:
    = Table.TransformColumns( YourTable, {{“Country”, each try Substitutes{ [Country = _ ] }[Substitute] otherwise _ }} )

    1. Ok, performance can be an issue with my approach. Merging is extremely fast compared to accessing a record field for each row, at least with my data that has ~60k rows and 40 original values that are subsituted by 20 ‘correct’ values.

  15. Hi! I was having trouble with the middle part, because it kept adding a new column to the green table [Substitute2], because I didn’t see that I had to delete the entered names table that we manually updated. Also, at the end, I think it is confusing where you switch to using the NAME instead of COUNTRY. The orange table ends up having people’s first name in the country column. Thank you for all that you do!

  16. This is awesome, because I actually did something similar to this for a project just last month. I had a data set that might have client names or might not have client names, and the names may or may not be correct. Using that glorious, glorious self-referencing tables technique, I set it up so the user can update them or not as they choose, and another column where they noted what needed to happen for the accounting data on that line so they could see easily what had or hadn’t been looked at. I think I like your three table approach better for most cases though, so I’m bookmarking this one (grins)!

    One thing I have been playing around with is trying to get the date of the last update on a line to show automatically, but I haven’t quite gotten that figured out yet.

      1. Yep, VBA seems to be the only option, but I think you’re right that it’s more trouble than it’s worth. And that’s a handy shortcut, I’ll add that in to the best practices list, thanks!

    1. You can create in Excel a simple table with two columns (ParamaterName, ParameterValue). Your first row could be (Date, =today()). This will always give you the current date. If you want to be more precise, you can use (Timestamp, =now()), which will give you date and time. Load this table in PQ and call it Parameters. In any query which is used to get a snapshot of data and needs a timestamp, you simply create a new column and then merge that query with the Parameters query.

Leave a Comment