Pareto Analysis in Power BI - Excelerator BI

Pareto Analysis in Power BI

Level: Intermediate

Pareto Analysis is a statistical technique that applies the Pareto Principle to data. This is more commonly known as the 80:20 Rule. The Pareto Principle is based on the presumption that a relatively small number of inputs (20%)  have most impact on the results/output (80%).  The 80:20 rule can be applied to a wide variety of data in most businesses.  Examples include:

  • Which 20% of products make up 80% of sales
  • Which 20% of customers make up 80% of profit.
  • just to name a few

Although it is called the 80:20 Rule, the results normally do not come out exactly in the ratio of 80:20.  Anything close to this ratio is considered normal.  Armed with the results, you can focus business efforts on the 20% of the things that impact 80% of the results.

The Pareto Principle can be illustrated on a Pareto chart.  In Power BI you can use a combo line and column chart. The column in the chart represents the individual inputs, while the line chart represents the cumulative percentage of the inputs as they build towards 80%.

In this article, I will use the AdventureWorks database (as usual) to show that just 12% of the products (in this case) contribute to 80%of the total margin for the business.

Measures Used for A Pareto Chart

To start with I have created 3 measures in the Sales table.

Total Cost = SUM(Sales[TotalProductCost])

Total Sales = SUM(Sales[ExtendedAmount])

Total Margin = [Total Sales] – [Total Cost]

Next I needed to write the measure that would calculate the cumulative % of sales ranked from the best product to the worst.  This is a tricky formula, but I explain how it works below.

To understand how this formula works, I will need to refer to the results table shown below showing the product codes, Total Margin in $ for each product, and the Cumulative % of margin.  Note that the table is sorted from highest margin $ product to the lowest margin $ in descending order.

The first row in the table below is product 312.  It has a margin of approximately $474,151 which makes up 4% of the total margin for all products.  Each row in the table below is calculated completely independently of each other for both the Total Margin and Cumulative % measures.

Now, let me explain how a single row is calculated (the row highlighted in red below – ProductKey 363).

The highlighted row in the table below is filtered for ProductKey = 363. This is what the row section of a table does – it filters the data before the measure is calculated.

Line 2 in the measure above is a variable.  Variables are defined first as part of a measure before any other execution is completed.  So the variable thisProductMargin is assigned a value of $437,021 in this case.

Line 4 in the measure above is pretty straight forward.  It uses CALCULATE to remove the filter on ProductKey = 363 so it can calculate the total margin for all products.  The ALL function removes all filters on products.

The secret sauce of the formula is line 10.

  • Filter is an iterator and hence it has a row context (it can repeatedly evaluate a formula, one row at a time, as it steps through the specified table).  In this case it is going to step through the products table.
  • But there is a filter on ProductKey = 363, so FILTER needs to step through ALL(Products) instead of Products.  ALL removes the filter applied by the visual before FILTER starts it’s iteration process.  If I didn’t use ALL here, then there would be only 1 row in the table. ProductKey = 363 due to the filter coming from the visual.
  • Then FILTER steps through the unfiltered copy of the products table and checks each product, one at a time, to see which  individual products have a margin greater than the current product.  The current product is ProductKey = 363.
  • If  FILTER comes to a product that has a margin greater than Product 363, that product is kept – in other words that product forms part of the cumulative build of all products that make up the cumulative total.
  • FILTER continues to check every single product against this criteria.
  • At the end of the filter process, only products with a total margin greater than or equal to the product being compared (product 363 in this case) are kept.
  • Then the total margin for all those products is calculated.
  • Finally this total is divided by the total for all products.

The process then repeats for every single row in the visual, over and over again.

Creating a Pareto Chart in Power BI

As I mentioned above we can use the line and clustered column chart visual to display the Pareto chart.  To do that:

  • Add a Line and clustered column chart visual to the Report canvas.
  • Put ProductKey from the Products table on Shared axis.
  • Put Cumulative % from the Products table on Line values.
  • Place the [Total Margin] measure on Column values.
  • Sort the chart based on Total Margin, Sort Descending.

This will sort the columns in the chart in descending order as you can see above.

I then went ahead and wrote a measure to count the number of products that make up the 80% of margin.  That measure is as follows.

Count of Products Making 80% of Margin =
CALCULATE (
    DISTINCTCOUNT ( Sales[ProductKey] ),
    FILTER ( Products, [Cumulative %] <= 0.8 )
)

Here is my Pareto Analysis workbook if you would like to take a look.

26 thoughts on “Pareto Analysis in Power BI”

  1. I tried the following formula, getting blank value.
    Cumulative % =
    VAR CQ = [TQuantity]
    VAR TQ = CALCULATE([TQuantity], ALL(Orders))
    RETURN
    CALCULATE(
    [TQuantity],
    FILTER(ALL(Orders), [TQuantity] >= CQ )
    ) / TQ

  2. Hi Matt, This is a very interesting DAX calculation. I successfully implemented it on a 300k line dataset, I was very excited, but it didn’t work and ran out of memory. I end up doing the Pareto calculation as a fixed table using M Query. The performance is much better, but the trade off is that it’s less flexible (had to disable cross-filtering on a few charts), Thanks for sharing.

    1. Yes, you are right David. This is a very expensive calculation. If you can live with non- dynamic results then it is much better. You can also generate a calculated table in Power BI instead of using PQ

  3. Hi Matt,

    Thanks for this amazing post, it helps me a lot. I used the discussion with Dominik to create my calculated column to classify my product regarding Pareto.

    Below my calculate, if anyone want to use it :

    Pareto =
    VAR thisCustomersales = Articles[Total articles sales 12 rolling month]

    VAR TotalSalesAllcustomers = CALCULATE ( SUM ( Articles[Total articles sales 12 rolling month] ); ALL ( Articles ) )

    VAR PourcentageCumulesVentes = CALCULATE (
    SUM ( Articles[Total articles sales 12 rolling month]);
    FILTER ( Articles; Articles[Total articles sales 12 rolling month] >= thisCustomersales )
    ) / TotalSalesAllcustomers

    RETURN
    SWITCH(TRUE();
    PourcentageCumulesVentes 0,5 ; PourcentageCumulesVentes 0,8 ; PourcentageCumulesVentes 0,9 ; PourcentageCumulesVentes < 1 ) ; "Hyper 80/20")9 ; PourcentageCumulesVentes < 1 ) ; "Hyper 80/20")I have a question remaining : how the calculate knows what is the product with the more quantity sold ? I want to be sure that my "hyper 20/80 products" stay the same even if I sort out the column.Tks a lot Erwan

  4. Hi Matt,

    Thanks for this amazing post, it helps me a lot. I used the discussion with Dominik to create my calculated column to classify my product regarding Pareto.

    Below my calculate, if anyone want to use it :

    Pareto =
    VAR thisCustomersales = Articles[Total articles sales 12 rolling month]

    VAR TotalSalesAllcustomers = CALCULATE ( SUM ( Articles[Total articles sales 12 rolling month] ); ALL ( Articles ) )

    VAR PourcentageCumulesVentes = CALCULATE (
    SUM ( Articles[Total articles sales 12 rolling month]);
    FILTER ( Articles; Articles[Total articles sales 12 rolling month] >= thisCustomersales )
    ) / TotalSalesAllcustomers

    RETURN
    SWITCH(TRUE();
    PourcentageCumulesVentes 0,5 ; PourcentageCumulesVentes 0,8 ; PourcentageCumulesVentes 0,9 ; PourcentageCumulesVentes < 1 ) ; "Hyper 80/20")9 ; PourcentageCumulesVentes < 1 ) ; "Hyper 80/20")

    I have a question remaining : how the calculate knows what is the product with the more quantity sold ? I want to be sure that my "hyper 20/80 products" stay the same even if I sort out the column.

    Tks a lot
    Erwan

  5. Really great stuff!! The Cumulative% formula is priceless.

    I recently had need to do this and hacked my way through it. Now I know the best way to do it. Thanks Matt.

    1. 👍 there are definitely some performance issues when the number of items gets large (as can be seen by some of the comments). I may come back with an update to this post later with a way to cluster the data for improved performance.

  6. Hi,

    10x for great contnet.
    there is missing formala in the article – I tried with 3 diff browser in 3 computer.
    also the link for the workbook file is broken

    can u please check

    1. Thanks for letting me know, and sorry for the problem. I updated the website over the weekend and clearly this was a problem as a result. I found the problem and it should be fixed now.

  7. You talk about using the AdventureWorks database. Is there a way to get access to this data just so we can actually follow along? It also seems like the Cumulative % is done in VBA or some sort of editor. Can you explain that to me a little better? I’m just starting to learn Power BI and your newsletters have been extremely helpful

    1. My workbook is at the bottom of the post. You can go to the data view and copy any table you want into Excel. The Cumulative % is calculated on the fly by DAX. This is an intermediate level post and there is a lot to learn to understand everything. My best advice is to buy my book and read it from cover to cover. I don’t think there is a better way to learn (I may be a bit biased, but you could also read the reviews on Amazon for other people’s opinions). https://exceleratorbi.com.au/supercharge-power-bi-book/ You can also find the AdventureWorks DB at this link with my book details.

  8. Hi Matt, as it happens we were talking about a Pareto analysis (per customer) in our organisation last week, so the timing of your article is excellent!
    However, when adjusting the formula to my data model, I came across a simple problem: RAM.
    When calculating this in the desktop version over our customers (1,2 Million instead of 397 products) and sales (11m instead of 60k) it just stops working after some time, saying that there’s not enough RAM available.
    Any idea how to work around this problem in larger datasets?!

    1. Yes, I figured it may be an issue with larger data sets. Are you using 64 bit Power BI? If not, that will help. Let me know. The other option is to simply switch the measure to a calculated column. This will work as long as the ranked item is a dimension table. In your case (and mine), both Customer and Product are dimensions, so the same formula will work as a calculated column. In fact, you can remove the ALL() in line 10 too when using it as a column. Please let me know how it goes

      1. I’m using the lastest 64-bit version of PBI.
        Just tried to add the formula as a column in the customer table (and removed the ALL() in line 10)…calculation was running for 30 minutes without a result, so I killed the application.

        1. Thanks for posting back Dominik. I also have been doing some testing – even AdventureWorks with 18,484 was very slow, both as a measure and calc column (maybe 5 mins). I have been thinking about the problem and I have a new suggested approach for you to try. Please give it a go. It will still take a bit of time, but I think it will work for you. Please post back. with the time it takes. I will then update my blog with this information

          Add 2 calculated columns. The first is simply
          Total Cust Sales = [Total Sales]
          Cumulative % Customer Sales =
          VAR thisCustomersales = Customers[Total Cust Sales]
          VAR TotalSalesAllcustomers =
          CALCULATE ( SUM ( Customers[Total Cust Sales] ), ALL ( Customers ) )
          RETURN
          CALCULATE (
          SUM ( Customers[Total Cust Sales] ),
          FILTER ( Customers, Customers[Total Cust Sales] >= thisCustomersales )
          ) / TotalSalesAllcustomers

          1. Hi Matt, thanks again for your help!
            I added the two colums as suggested, however about 10 minutes after I started calculation for the second formula (Cumulative % Customer Sales) PBI stopped the process and the error message “There is not enough memory to complete this operation” popped up.
            I’m working with a MS Surface laptop with only 4 GB RAM (will be upgraded to a new one with incredible 8 GB soon…), so maybe that’s part of the problem too.

            1. Thanks for posting back. As you can tell, this is resource hungry. I appreciate your testing, and I would like to know it works before I update the article. There actually is a very big difference between 4gb and 8gb, is you will really notice it.

            2. Hi Dominik and Matt,

              I don’t believe you’ll succeed in making this work. At least not on the desktop. Running totals, Pareto Analyses and partly moving averages etc. applying huge dimensions will only be fully within reach when they are supported directly in the engine. Put in another way, only when we have DAX functions that will support these scenarios will we be able to do them in an acceptable way.

              Upgrading your laptop to 8GB RAM won’t help in this case, Dominik.

              That said, I don’t think you will ever do Pareto analyses this way. Think about it. If 20% of the customers make 80% of the margin you’ll have to show around 250,000 customers on the shared x-axis. There is no way you can get the big picture this way. Instead, you’ll set up slicers/filters that cross filter the customer table, so only a subset will be shown at any one time. This requires a slightly modified measure; however, it only takes a few thousand dimension values to make it a no go. How many depends on your hardware and data model.

              As Matt also said somewhere in the comments; I would love to be proved wrong.

              Jes.

            3. Thanks for your comments Jes. I have learnt a lot from Dominik’s example. I have actually developed what I think is an acceptable work around for this using customer banding and a calculated column. I have it ready to go and it works fine on 18,484 customers in Adventure Works. I am hoping that Dominik will test it first on his million rows of customers to see if it works. Then I will update this post.

    2. Danil Bogomazov

      Matt, thanks for this great example. One quick question – would it be better to use only one field (Products[ProductKey]) instead of using entire table (Products) to avoid the concept of table extension which is super expensive operation from RAM prospective. it won’t be noticed on 397 rows but should make big difference on 1.2M rows.
      If we rewrite line 10 (secrete source) I am curious if it works on 1.2M rows = FILTER ( ALL ( Products[ProductKey] ), [Total Margin] >= thisProductMargin )

      1. I would not expect ALL(Products[Key]) to be any more performant than ALL(Products) given that the key is the primary key of the table. I could be wrong and am happy to be corrected by someone. The result will be the same. It would be possible to modify a pattern such as this one to solve the problem in a different way. https://www.daxpatterns.com/abc-classification-dynamic/ I can’t be sure if this would be more performant or not – my guess is yes. Also, most of the “cost” in your example would be doing the calc for the 80% of the customers you don’t need. You could try putting a visual level filter on the key based on sales/margin or what ever and just keep say the top 200k in you case.

        1. Hi Danil and Matt,

          I agree with Matt. I don’t think there will be any performance difference between having the table ‘Products’ or the primary key of the ‘products’ table as the argument; however, there is a semantic difference that will matter.

          The measure :

          FILTER ( ALL ( Products[ProductKey] ), [Total Margin] >= thisProductMargin )

          will cross filter the [Total Margin] measure with values from the ‘products’ table if filters/slicers on some of the columns of the ‘products’ table’s filter domain exist ( e.g. subcategory, colour etc. ) whereas the expression from matt’s measure will not. If you want the exact same semantic behaviour as in Matt’s measure you’ll need to use this measure:

          FILTER (
          ALL ( Products[ProductKey] ),
          CALCULATE (
          [Total Margin],
          ALLEXCEPT (
          products,
          Products[ProductKey]
          )
          ) >= thisProductMargin
          )

          Jes.

        2. Marco Russo is in Australia at the moment and he spoke at the Sydney Power BI Meetup tonight. I asked him specifically to confirm this point. He confirmed that there should be no material difference between iterating a primary key of a dimension table, or the dimension table itself.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top