Learn how to remove duplicates keep the last record on power query.
Today I was helping a customer with a problem that seemed quite simple on the surface. She had a data table containing historical customer sales orders (each customer has many orders on different dates). The objective was to filter this table in Power Query and just load one record for each customer – the one that was the last order date. To illustrate the problem more clearly, I have adapted the scenario using the Adventure Works database so you can remove duplicates keep the last record on power query.
Adventure Works Example
The Sales table contains all the historical sales transactions by customer (identified by CustomerKey) and each transaction has an Order Date. The objective is to filter this table in Power Query so as to keep only the last entry for each customer (the last entry is the most recent order date). At the first instance, the solution seems to be simple. In Power Query, you would think that you simply:
- Sort the table by Order Date in descending order.
- Select the customer key column and then remove duplicates.
But when you do this in Power Query, it does not work as expected. As you can see in the Sales table below, each customer has many transactions with different order dates.
In Power Query, I sorted by OrderDate descending, then removed duplicates as shown below.
But the solution is not correct – the order dates for some of the customers are actually not the last orders. The table on the left below shows the original data sorted by OrderDate for each customer. The table on the right below shows the results from power query. If you compare the full data on the left with the Power Query results on the right, you can see that PQ has returned the wrong order date for some customers.
Why Doesn’t it Work?
I can’t say that I have a deep technical understanding of the problem, but I do have a conceptual understanding. When you select “sort column”, it is reasonable to expect that the entire table is sorted before proceeding to the next step. In reality, it is only the data that is loaded in memory that is sorted. The remaining data on disk is not included in the sort. Power Query also uses a concept called “lazy evaluation”. In short this means that if you add a step in the code, and that step is not technically needed to produce the final result, then that step is actually never executed (even though it is there in the instructions) – weird I know, but very efficient.
Table.Buffer to the Rescue
Before I share this solution, let me point out there are other ways to solve the problem, specifically using group by. However the purpose of this article is to broaden readers understanding of Power Query and introduce the table.buffer function.
I am pretty sure I learnt this tip from Imke Feldman at The BIccountant (or possibly Chris Webb). Both are absolute wizes at this stuff. To solve the problem you will need to get in and make some manual changes to the M code. To do this, first make sure you turn on the formula bar. Go to the View menu and select “formula bar”.
When I click on the step that sorts the table (desc) by OrderDate, the M code was as follows:
To solve the problem, I need to force Power Query to load all the data into memory, forcing the sort to be completed now before proceeding. All I did was to wrap the line of code above inside the Table.Buffer( ) function as shown below.
The rest of the steps remain the same. The Table.Buffer( ) function forces the entire set of data to be loaded into memory after sorting and the hence the next step of removing duplicates works correctly on the entire data set.
The resulting table looks as follows:
These results are now correct as you can see in the table below. The OrderDate (Incorrect Solution) column is the result without using Table.Buffer( ) and the OrderDate (Correct Solution) column is the result of using Table.Buffer( ). You can see several customers have different results. The correct result can be manually validated against the raw data.
Here is the sample workbook and the source data that I used in this blog post.