Remove Duplicates and Keep the Last Record with Power Query 🏆

Remove Duplicates and Keep the Last Record with Power Query

Learn how to remove duplicates keep the last record on power query.

Today I was helping a customer with a problem that seemed quite simple on the surface.  She had a data table containing historical customer sales orders (each customer has many orders on different dates).  The objective was to filter this table in Power Query and just load one record for each customer – the one that was the last order date.  To illustrate the problem more clearly, I have adapted the scenario using the Adventure Works database so you can remove duplicates keep the last record on power query.

Adventure Works Example

The Sales table contains all the historical sales transactions by customer (identified by CustomerKey) and each transaction has an Order Date. The objective is to filter this table in Power Query so as to keep only the last entry for each customer (the last entry is the most recent order date).  At the first instance, the solution seems to be simple.  In Power Query, you would think that you simply:

  • Sort the table by Order Date in descending order.
  • Select the customer key column and then remove duplicates.

But when you do this in Power Query, it does not work as expected. As you can see in the Sales table below, each customer has many transactions with different order dates.

remove duplicates keep last record power query

In Power Query, I sorted by OrderDate descending, then removed duplicates as shown below.

remove duplicates keep last record power query

But the solution is not correct – the order dates for some of the customers are actually not the last orders. The table on the left below shows the original data sorted by OrderDate for each customer. The table on the right below shows the results from power query. If you compare the full data on the left with the Power Query results on the right, you can see that PQ has returned the wrong order date for some customers.

Removing Duplicates Expected Incorrect

Why Doesn’t it Work?

I can’t say that I have a deep technical understanding of the problem, but I do have a conceptual understanding.  When you select “sort column”, it is reasonable to expect that the entire table is sorted before proceeding to the next step.  In reality, it is only the data that is loaded in memory that is sorted.  The remaining data on disk is not included in the sort.  Power Query also uses a concept called “lazy evaluation”.  In short this means that if you add a step in the code, and that step is not technically needed to produce the final result, then that step is actually never executed (even though it is there in the instructions) – weird I know, but very efficient.

Table.Buffer to the Rescue

Before I share this solution, let me point out there are other ways to solve the problem, specifically using group by. However the purpose of this article is to broaden readers understanding of Power Query and introduce the table.buffer function.

I am pretty sure I learnt this tip from Imke Feldman at The BIccountant (or possibly Chris Webb).  Both are absolute wizes at this stuff.  To solve the problem you will need to get in and make some manual changes to the M code.  To do this, first make sure you turn on the formula bar.  Go to the View menu and select “formula bar”.

remove duplicates keep last record power query formula bar

When I click on the step that sorts the table (desc) by OrderDate, the M code was as follows:

Removing Duplicates 4

To solve the problem, I need to force Power Query to load all the data into memory, forcing the sort to be completed now before proceeding.  All I did was to wrap the line of code above inside the Table.Buffer( ) function as shown below.

Removing Duplicates 5

The rest of the steps remain the same. The Table.Buffer( ) function forces the entire set of data to be loaded into memory after sorting and the hence the next step of removing duplicates works correctly on the entire data set.

The resulting table looks as follows:

Removing Duplicates 2

These results are now correct as you can see in the table below.  The OrderDate (Incorrect Solution) column is the result without using Table.Buffer( ) and the OrderDate (Correct Solution) column is the result of using Table.Buffer( ).  You can see several customers have different results.  The correct result can be manually validated against the raw data.

Removing Duplicates Solution

Here is the sample workbook and the source data that I used in this blog post.

Removing Duplicates Sample

62 thoughts on “Remove Duplicates and Keep the Last Record with Power Query”

  1. Thanks. Saved my day. Stuff like this is hard to even know it’s not working right when you’re dealing with 1000’s or records. Only when things have gone too far wrong or it’s too late, we’ll know we need to investigate only to find out that the software decided to be “lazy”.

  2. Hi, I need to find a way to delete not only one of duplicated records but to delete all records which appear to be doubled. We are using this option to review all employees internal movements from NH and termination list. Removing duplicates is always leaving one record behind when i don’t need any of it. When employee moved internally should not be on NH list nor on Termination list. In regular excel file i just mark them with color and delete manually but I want to create query and i can’t seem to find a way to do it automatically. Help please 🙂

    1. Well, the short answer is this. What is the repeatable process you use to manually identify the records you don’t need? You need to be able to extract the logic from that manual process so you can write a query to do it for you. Of course, it depends on the situation, but that is what you need to do.

  3. This has solved a problem for me in seconds that I had been struggling with for days, Many thanks for the solution and explanation of what causes the issue in the first place

  4. I was fortunate enough to find this guide before implementing the solution and possibly getting incorrect results. It boggles my mind that Power Query would work in this “lazy evaluation” mode without some more or less explicit warning about this functionality. I wonder about other cases this can generate possible issues. Maybe anyone has some further insight? Thank you for the article!

  5. Perfect! I was tearing my hair out trying to figure this out. I knew it must have to do with query folding, but didn’t know a way around it. Such a simple and elegant solution! Thanks!

  6. Thank you so much for this article. This is the exact method o need to apply to my data set to obtain the unique results.

  7. It depends what you are trying to do when you say “processing” and “a long time”. Refresh typically takes 2 – 30 mins depending on column and complexity of transformation. Power BI is a reporting tool that has its own ETL. If the source is not a database, then ETL can be slow.

  8. Hi,
    I have about 1 million records of data in csv file and need to process them and create the reports.
    Is powerbi suitable for this mean?
    when i load data in powerbi and use power query for processing , this takes long time.

  9. Thanks a million Matt! I have a few suppliers sending me their stock on hand reports regularly via Excel. I have been consolidating them into one sheet. To get the latest stock on hand by each supplier, I could use max/maxx measures for visuals/reports in Power BI. But what I really wanted was for Power BI to only load the latest stock on hand data by each supplier. Your solution was exactly what I wanted in the first place. If this post had a title like “only load/filter/keep the latest/last record by customer/product/supplier via Power Query”, I’d have found it much sooner… LOL. Cheers Sean

  10. Thanks so much for this info…. I was wondering how could the sorting/duplicatesRemoval not work properly..
    This post showed me the full picture about the “partial sort”…..

    Really Great.

    Regards
    //Lünkes

  11. Amazing! I’ve spent so much time fighting with deduplication issues. This has brought my skills up another notch!

  12. Thank you so much for this article…this has been a problem for me for some time…problem solved…I was removing duplicates manually because I was not getting the expected results…You have saved me so much time going forward. Very much appreciated!!!

  13. I thought I had the problem solved, until I recently performed an audit and discovered otherwise. Wrapping the Table.Sort function in the Table.Buffer function worked like a charm! Can’t thank you enough! 🙂

  14. Thank You so much, Mr. Matt Allington First sorted data in descending order then applied to remove duplicate so removed old record data instead of a new one as per modified date. So once again thank you so much.

  15. Tsvetelina Lalova

    Thank you so much! This article is so helpful to me. I have been working so hard months and this solution helps me to finish my work.

  16. Thank you very much!!! Brilliant your explanation it saved my life. Good to know that even the computer is lazy sometimes.

  17. I’ve also found that adding an index column between the sort and remove duplicates steps also produces the correct result.

  18. Hi Matt,
    Thank you for this article and for this trick.
    In this example, if we have more than 1 article bought in the same date, I think it’s better to use the “Group By” method.

    Hoss

  19. Hi Matt,

    If Table.Distinct finds more than one row with duplicate values in the specified column list (which just contains column CustomerKey, in this case — Table.Distinct(#”Sorted Rows”, {“CustomerKey”})), I believe it’s free to return any one of those duplicate rows, at its convenience (documentation [https://docs.microsoft.com/en-us/powerquery-m/table-distinct] doesn’t promise that it will return the first or last row out of a sorted set of duplicates). This makes me think that the non-working solution is really working correctly–just not producing the desired output.

    If Table.Distinct happens to respect sort order in some circumstances (like when working with a buffered input), it seems like this would be an internal implementation detail that could change and might not hold true across all platforms (PBI, ADF, etc.).

    Ben

    1. Fair question, but not as Important as getting the correct answer. I suspect performance will be fine except for large data. The data has to be loaded anyway. Even if you use groupby, PQ still has to load all the data before proceeding.

  20. Hi Matt,
    Great post, thanks for sharing.
    I added Conditional Formatting in the Sales_IncorrectSolution.
    1 Added Column: RelCorrectDate = RELATED(Sales_CorrectSolution[OrderDate])
    1 Added Column:EqualYesNo = IF(RELATED(Sales_CorrectSolution[OrderDate])=Sales_IncorrectSolution[OrderDate];1;0)
    Conditional Formatting based on the column EqualYesNo

    Greetings

  21. Nice work Matt, I was working on a DiFOT report just yesterday and did the same thing you mentioned; sort by then remove duplicates. Then your email came out last night, talk about timing. Thank you.

  22. Matt, I add an INDEX column after the sort step. Fixes the issue. Think I read that on Reza’s blog once upon a time

    Like you say, a number of ways to address this. Maybe a sort should automatically trigger an index step? Things like Promote Headers triggers an automatic change type step.

  23. Nithyanandam Raman

    Hi Matt,
    Although there are multiple ways to resolve any problems, thanks for sharing this with an example, I learnt new and unique command(Table.Buffer()) today, which will be more helpful when I working in power query,

  24. Thanks for sharing this – it’s useful to learn the quirks and potential solutions. You never know when that information might come in handy.

  25. Sweet solution. I was not aware of Table.Buffer().

    side note: There is a phrase between brackets before the last table that is not supposed to be there 🙂

    regards

  26. I’m thinking that you should just reference the original table, group by customer key and do a max on the date field, and then use this result to inner join back to the original table on customer key and date. Seems like a straight forward solution. It probably pushes down to the database too. And no need to hack code?

    Thanks,
    Scott

  27. Try this trick 🙂 … Insert “stupid step” between Table.Sort and Table.Distinct… e.g. you can change type OrderDate on the type datetime

    It “freezing” table in sorted state…. and after that you can remove duplicates properly.

    Polish/English version :

    #”Posortowano wiersze” = Table.Sort(#”Zmieniono typ”,{{“OrderDate”, Order.Descending}}),

    #”Zmieniono typ1″ = Table.TransformColumnTypes(#”Posortowano wiersze”,{{“OrderDate”, type datetime}}),

    #”Usunięto duplikaty” = Table.Distinct(#”Zmieniono typ1″, {“CustomerKey”})

  28. Maybe I’m not understanding the full extent of the problem but I found simply grouping on CustomerKey and taking the Max of OrderDate produced the same result.

    let
    Source = Excel.Workbook(File.Contents(“C:\Projects\Consulting\PowerBI\Removing-Duplicates-Sample\5 Tables in Excel.xlsx”), null, true),
    Sales_Sheet = Source{[Item=”Sales”,Kind=”Sheet”]}[Data],
    #”Promoted Headers” = Table.PromoteHeaders(Sales_Sheet, [PromoteAllScalars=true]),
    #”Changed Type” = Table.TransformColumnTypes(#”Promoted Headers”,{{“ProductKey”, Int64.Type}, {“OrderDate”, type date}, {“CustomerKey”, Int64.Type}, {“SalesTerritoryKey”, Int64.Type}, {“SalesOrderNumber”, type text}, {“SalesOrderLineNumber”, Int64.Type}, {“OrderQuantity”, Int64.Type}, {“UnitPrice”, type number}, {“ExtendedAmount”, type number}, {“TotalProductCost”, type number}, {“TaxAmt”, type number}, {“Freight”, type number}, {“RegionMonthID”, type text}}),
    #”Grouped Rows” = Table.Group(#”Changed Type”, {“CustomerKey”}, {{“Last Order Date”, each List.Max([OrderDate]), type date}})
    in
    #”Grouped Rows”

    1. Yes, that will also work. But remember this is a simplified example to demonstrate the technique. What I didn’t say in the post was that in reality the customer need was a bit different and she wanted to keep the entire record with 15 columns. This can still be done with group by but is harder.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top