When to Use SUM() vs SUMX()

Level: Beginners/Intermediate

It has been 2 years since I first wrote an article about SUM() vs SUMX().  I decided to write a new article to refresh this topic and add in the things I have learnt over the last 2 years.

SUM() vs SUMX()

First let me start with an overview of both of these functions.

  • SUM() is an aggregator function.  It adds up all the values in a column you specify in the current filter context. SUM() has no knowledge of the existence of rows – all it can do is add everything in the single column it is presented with.
  • SUMX() is an iterator function.  It works through a table, row by row to complete the evaluation in the current filter context.  SUMX() has awareness of rows in a table, and hence can reference the intersection of each row with any columns in the table.

Both function can end up giving you the same result (maybe, maybe not), but they come up with the answer in a very different way.  Before I move on, it is worth making sure you have a clear understanding of filter context.

Filter Context

Filter context simply means “the filters that are influencing the data model for this evaluation”.  This is important because in DAX, the exact same formula can (and normally will) return a different result with a different filter context.

Filter context can come from:

  1. The visual in your workbook. A Pivot Table if you are using Power Pivot for Excel, or anywhere on the report canvas if you are using Power BI.  This is called the Initial Filter Context.
  2. The use of a CALCULATE() function.  CALCULATE() is the only function that can change the Initial Filter Context.  Of course there may or may not be a CALCULATE() function in your DAX formula.  If there is a CALCULATE() function, it can add to, remove from, or modify the initial filter context.

Note: Filter context can also come from an implicit CALCULATE() function.  Every measure has a hidden CALCULATE() function wrapped around it that you can’t see.  In certain circumstances (such as inside a row context), this hidden CALCULATE() function can give you a different result than you may otherwise expect. This is a complex topic and outside the scope of this article.

Filter context is important because it will affect the results you get from your formulas.  Let’s briefly look at how initial filter context works in Excel and Power BI.

Power Pivot for Excel

In the following Pivot Table example, the highlighted cell has an initial filter context of Products[Category] = “Bikes” coming from the rows of the pivot table (shown as 1).  Initial filter context can also come from Filters, Columns and Slicers in a pivot table.

image

Power BI Desktop

Power BI desktop is similar to Excel, but initial filter context can come from almost anywhere in the report.  The highlight cell below is filtered by Products[Category]=”Bikes” (shown as 1) just like in Excel above, but there is also a cross filter coming from Territory[Country]=”Australia” (shown as 2).  Both of these filters are part of the initial filter context.  Filters can also come from Columns, Slicers, and the filter section on the right hand side of the Power BI report window.

image

OK, now it is time to look more in depth at SUM and SUMX, one at a time.

The SUM() Function

Syntax: = SUM(<Column Name>)

Example: Total Sales = SUM(Sales[ExtendedAmount])

The SUM() function operates over a single column of data to aggregate all the data in that single column in the current filter context.  The filter context is applied first, and the SUM() works over the column after the filter has been applied.  Hopefully you can see now why the review of filter context is so important.

The SUMX() Function

Syntax: = SUMX(<Table>, <expression> )

Example: Total Sales SUMX = SUMX(Sales,Sales[Qty] * Sales[Price Per Unit])

SUMX() will iterate through a table specified in the first parameter, one row at a time, and complete a calculation specified in the second parameter, eg Quantity x Price Per Unit as shown in the example above for the current filter context.  Once it has done this for every row in the specified table in current filter context, it then adds up the total of all of the row by row calculations to get the total.   It is this total that is returned as the result.

The only reason SUMX() can complete calculations this way is because of the existence of a row context.  Row context can be a confusing term to beginners, but if you think about it, the logic is obvious.  If I said “multiply column A by column B”, how would you do that?  The only way is to multiply “one row at a time”.  That is all row context means.

Which One Should I Use?

Which you use really depends on your personal preference and the structure of your data.  Let’s look at a couple of examples.

  1. Quantity and Price Per Unit
  2. Extended Amount
  3. Totals Don’t Add Up

1. Quantity and Price

image

If your Sales table contains a column for Quantity and another column for “Price Per Unit” (as shown above), then you will necessarily need to multiply Quantity by the “price per unit” in order to get Total Sales.  It is no good adding up the total quantity SUM(Quantity) and multiplying it by the average price AVERAGE(Price Per Unit) as this will give the wrong answer.

If your data is structured in this way (like the image above), then you simply must use SUMX()  – this is what the iterator functions were designed to do.  Here is what the formula would look like.

Total Sales 1 =SUMX(Sales,Sales[Qty] * Sales[Price Per Unit])

You can always spot an Iterator function as it always has a table as the first input parameter.  This is the table that is iterated over by the function.

2. Extended Amount

If your data contains a single column with the Extended Total Sales for that line item (ie it doesn’t have quantity and price per unit), then you can use SUM() to add up the values.

image

Total Sales 2 =SUM(Sales[Total Sales])

There is no need for an iterator in this example because in this case it is just a simple calculation across a single column. Note however you “could” still use SUMX () (like shown below) and it will give you the same answer.

Total Sales 2 alternate = SUMX(Sales, Sales[Total Sales])

Despite what your intuition may tell you, this alternate formula using SUMX() is identical in performance and efficiency to the SUM() version.  More on that below.

3. Totals Don’t Add Up

There is another use case when you simply must use SUMX() that is less obvious. When you encounter this problem, you will need to use an iterator like SUMX to solve it.  I have created a small table of sample data to explain the problem.

image

The table above shows 4 customers with the average amount of money they spend each time they have shopped as well as the number of times they have been shopping. If I load this data into Power BI and then try to use aggregating functions to find the average spend across all customers as well as the total amount spent, I get the wrong answers (as shown below).

image

Here are the measures from above.

Total Number of Visits = SUM(VisitData[Number of Visits])

Avg Spent per visit Wrong= AVERAGE(VisitData[Spend per Visit])

Total Spent Wrong = [Avg Spent per visit Wrong] * [Total Number of Visits]

The first measure [Total Number of Visits] is correct because the data is additive, but the other 2 measures give the wrong result. This is a classic situation where you can’t perform multiplication on the averages at the grand total level. Given the sample data that I started with, the only way to calculate the correct answer is to complete a row by row evaluation for each customer in the table as shown below.

image

In this second table above I have written a SUMX() to create the Total Spent SUMX (row by row) first. Only then do I calculate the average spend per visit as the final formula.

Total Number of Visits = SUM(VisitData[Number of Visits])
Total Spent SUMX = SUMX(VisitData,VisitData[Spend per Visit] * VisitData[Number of Visits])
Avg Spent per visit Correct = DIVIDE([Total Spent SUMX] , [Total Number of Visits])

In this second case, SUMX is working through the table of data and is correctly calculating the result, even for the Total row at the bottom of the table.

Performance Implications

Preferring the Storage Engine

The last thing I want to talk about is the performance implications of using SUM vs SUMX.  Given that SUMX is an iterator, you may think that SUMX is inherently inefficient.  Generally speaking this is not true as the software has been optimised to handle the scenario efficiently.  Having said that, bad DAX can definitely cause SUMX to be inefficient.  Let me explain.

Power Pivot has 2 calculation engines, the Storage Engine (SE) and the Formula Engine (FE).  The SE is faster, multi threaded and cached.  The FE is slower, single threaded and not cached.  This is a complex topic in its own right and I will only scratch the surface in this article, however the implication is that you should write your formulas to leverage the SE where possible.  Of course this can be hard if you don’t know exactly how to do this, but there are a few simple tips that will help you.

  1. SUM () always uses SE for its calculations, so nothing to worry about there.
  2. For most simple calculations (like Sales[qty] by Sales[unit price]), SUMX() will also use SE, so all good there.
  3. In some circumstances SUMX() may use the FE to do some or all of the the calculation, particularly if you have a complex comparison statement in your formula.  If SUMX needs to use the FE, then performance can be slow – sometimes very slow.

Regarding point 3, the best advice I can give you is to avoid writing complex conditional statements such as “IF Statements” within a SUMX function.  Consider the following 2 formulas:

Total Sales of Items more than $100 Bad 
      = SUMX(Sales,
          if(Sales[ExtendedAmount] > 100, Sales[ExtendedAmount])
         )
Total Sales of Items more than $100 Good 
      = CALCULATE(
           SUMX(Sales,Sales[ExtendedAmount]),
           Sales[ExtendedAmount] > 100
        )

The first formula (Bad) has an IF statement within a SUMX.  This IF statement forces the Storage Engine to pass the evaluation task over to the Formula Engine for a comparison check to see if each individual row is greater than 100 before deciding to include it or not in the calculation.  As a result the formula engine must complete the task one row at a time making the evaluation slow and inefficient.

The second formula (Good) first modifies the initial filter context coming from the visual to add a new filter on Sales[ExtendedAmount] > 100.  This new filter is applied by the Storage Engine very efficiently.  After CALCULATE modifies the filter context, the SUMX can then do its job of adding up the remaining rows “in the current filter context” using the storage engine, not the formula engine.  As a result this second formula is very efficient.   In some simple testing I completed, the first (bad) formula took 5x longer to complete than the second (good) formula.  In other scenarios it could be 100s or even 1000s times slower, so that clearly could be a problem.

If you are in doubt as to which engine is actually being used to execute your formulas, the only sure way to check and confirm is to use a profiling tool such as DAX Studio or SQL Profiler to check what is happening under the hood.  Once again this is a complex topic beyond the scope of this article.  I may come back and do an article on this topic another day, but until then here is an introduction to DAX studio and how it can be used for this purpose.

Compression Impacts on Performance

The second area that can impact performance is overall data model compression.  The more unique values that exist in a column in the data model, the less compressed the data will be. The less compressed the data, the more memory that is required and potentially the slower the calculations will be. Let’s take another look at the tables from earlier in this article.

Example Table 1

image

The column of data Sales[Total Sales] has all unique values. This column would not compress well.

Example Table 2

image

In this table there are duplicate values in the Qty column and also the Price Per Unit column.  The less unique values in a column as a percentage of row count, the better the compression.

Now of course these 2 sample tables are very small, but imagine the impact of this concept on very large tables (eg tables with millions of rows of data). It is possible that for very large tables, the number of unique values in example table 1 will be significantly larger than the unique values in the columns in example table 2.  It is therefore possible that loading data as outlined in example 2 could have positive impacts on total table size and hence performance of your data model.  Making a change may mean of course that you have to swap your measures from

SUM(Sales[Total Sales])

to

SUMX(Sales,Sales[Qty] * Sales[Price Per Unit])

however this usage of SUMX instead of SUM is perfectly fine and highly performant.  It is impossible to say what impact one vs the other will have overall; it depends entirely on your data.  But one thing I would advise is that you should not load all 3 of the columns as shown in this simple example, i.e don’t load Qty, Price Per Unit, AND Total Sales.  As no doubt you can see, you only need 2 of these columns of data loaded because you can always derive the value in the third column from the other 2.  So if you are going to use Qty and Price Per Unit, then don’t load Total Sales.  If you are going to use SUM(Sales[Total Sales]) then don’t load both Qty and Price Per Unit.  I recommend you load your most used column, then determine which of the other 2 columns you will use the most and then derive the third one as needed on demand.

As always, there are exceptions to the rules.  Everyone’s data is different, so test out the techniques on your own data and see what gives you the best results.  If your data models are small and fast then it probably doesn’t matter.  If your data models start to get large and slow, then it is time to investigate the best options to try to maximise performance.

Share?

Comments

  1. My understanding was that
    SUM ( tblName[ColName] ) is just Syntax Sugar for SUMX (tblName, [ColName])
    My understanding was SUM is also an Iterator – How can it be otherwise ?

    • Hi Sam, I certainly understand your comment. I had the same thought when starting to learn DAX.

      SUM() definitely iterates through the values of whatever column is specified as the argument. That’s after all the only way to aggregate something. You need to have access to the individual values to do so.

      SUM() is not an iterator because it does not own the right to set the DAX engine into row context mode (One of the 2 contexts a DAX expression is evaluated under even though either or both of them can be empty). Only an iterator can set the DAX engine into row context mode and SUMX() happens to be one.

      Operating under Row context mode means that you can now refer to all the columns of a row instead of just one as is the case with SUM(). In other words it’s now legal syntax to reference more than one column. SUM(Sales[Qty] * Sales[Price Per Unit]) would not be legal syntax because of the aforementioned reasons, SUMX(Sales,Sales[Qty] * Sales[Price Per Unit]) is legal syntax because SUMX() is an iterator and therefore operates in row context mode. Both SUM() and SUMX() iterates through all the rows visible in the current filter context but SUMX() does so with more degrees of freedom so to speak.

      You can say that SUM ( tblName[ColName] ) is just Syntax Sugar for SUMX (tblName, tblName [ColName]) because only one column is involved. The 2 expressions would give the same result in that case. Well as a matter of fact the correct syntax sugar for SUM ( tblName[ColName] ) would be SUMX (VALUES(tblName), tblName[ColName]). The difference is subtle and is only important if ‘ tblName’ is a dimension table and an unknown member exists on this table.

      Best regards Jes.

      • @Oxenskiold – Thanks for your explanation. I understand what you mean by SUMX having access to any column of the table in the row that it is iterating.
        However your last line got me confused
        “SUM ( tblName[ColName] ) would be SUMX (VALUES(tblName), tblName[ColName]). The difference is subtle and is only important if ‘ tblName’ is a dimension table and an unknown member exists on this table”

        Assuming tblName is a fact table what it the difference between
        EVALUATE tblName and EVALUATE VALUES(tblName)
        both seem to return a table with the same number of rows despite have duplicate rows.

        • Hi Sam,
          “SUM ( tblName[ColName] ) would be SUMX (VALUES(tblName), tblName[ColName]). The difference is subtle and is only important if ‘ tblName’ is a dimension table and an unknown member exists on this table”

          I almost regret writing this. Please standby for a few days.

  2. This is one of the best articles on PowerPivot I have ever read, neatly incorporating every important concept in a clear and concise manner. Thanks a lot for helping us expanding our understanding of how PowerPivot really works and how to use it most efficiently!

  3. To remember why you must use SUMX when multiplications between columns are involved (paragraphs 1 and 3 in your post), people should think of the order of operations in elementary school math (BODMAS / PEMDAS), where multiplication takes precedence over addition. SUMX is DAX’s way of enforcing proper order of operations.

  4. I just got 2016 installed, and am giving Power Pivot another try. Mostly, I love it (and I love articles like this one that give great watch-points!) My question: I want to change column order in a dataset in Power Pivot.

    I need to do that in the query, I understand that from the error message I get when I try to do it in Power Pivot. But when I change the column order in Power Query, some of the columns change to the new positions and some do not when I update the Power Pivot from the new query. I even tried deleting the table and bringing it in again from the new query after getting everything just as I want it.

    I checked the Query Advanced Editor to make sure the Reorder Columns statement is correct, and it is. Query is in proper order, Power Pivot model is not. I realize it doesn’t make any difference in the resulting formulas or pivot tables, but if I look at the tables, the order is not what I want. Thanks for your help!

    • I agree it is strange, but also agree it doesn’t really matter. So you use the column name list at the top left of the data view grid? That is always alphanumeric order and will take you directly to the column you select.

  5. It might just be me being thick, but I’m not following why the example is using an average instead of a sum. SUM() vs. SUMX() is something I’ve been pondering for some time. I seem to get the same answer (If memory serves) whether I’m using:

    SUM(Financials[Obligations]) – SUM(Financials[Expended]) or

    SUMX(Financials,[Obligations]-[Expended])

    In this situation, which is the one to use?

    • In your example, both columns are additive so it makes no difference. Identical performance and results. But if one column is quantity and another is price per unit, then you must use sumx.

    • @Bruce
      SUM(Financials[Obligations]) is just short had for SUMX(Financials, [Obligations])
      So my guess would be
      SUMX(Financials,[Obligations]-[Expended]) would be faster

    • I have a production database with 39 million rows. I just tested three things
      1. SUM(column1) + SUM(column2)
      2. SUMX(table,column1) + SUMX(table,column2)
      3. SUMX(table,column1 + column2)

      1 and 2 delivered identical results with identical query plans averaging around 35 ms
      3 had a slightly different query plan and consistently took twice as long at over 70ms. But given the difference is only 35 ms, I guess it doesn’t really matter – use which ever you want.

      This is my data and it may be different on other data

      • Thanks All! Good info to know. Honestly, how impressive is DAX if we are pondering 35 ms over 39 million rows… Hey Microsoft, Good on ya! (I’m not an Aussie but like the lingo.) Thanks again!

      • @Matt – This is very interesting and strange- I want to try this on my Data as well
        I just cant understand why this SUMX(table,column 1 + column 2) should be slower than
        SUMX(table, Column 1) + SUMX(table, Column 2)
        Time to shine the “Help Italians” light

      • I don’t profess to be speaking based on my knowledge, but I can conceive how this could be. If we assume that there is an overhead to add 2 columns together, and sumx(table, column1 + column2) is likely to have more addition events than sumx(table, column1) + sumx(table,column2), then there would be less overhead in the second one. That is my guess

        • @Matt – Thanks for the explanation. I am aware about data compression depending on the cardinality of the columns
          So my guess is the speed between the two approaches would vary greatly depending on the cardinality of the two number columns – I guess in most real life scenarios this would be high if ex : Basic Amt + Tax Amt to arrive at net amount would be slow
          Whereas lets assume Type + Rating = Some fictitious metric, where Type has 1, 2 and 3 and Rating has 1,2,3,4,5 would be fast

          • @Sam, that could be correct – I would be just guessing. I know the query plans between the 2 measures is different, so the engine is treating things a different way. Much of what The Italians do to optimise their DAX is done by reading the query plans and working out which is most efficient. Often this is a bit of trial, error and experience. There is a very good video where Alberto talks about this concept. I don’t remember which one, but this one looks good http://www.sqlbi.com/tv/optimizing-dax-queries/

      • @Sam, like I said I can only share what I think (I am not an expert). But I have attended Marco Russo’s Optimising DAX training and I have some understanding of what is happening. With this in mind, here are some comments.
        A. Even though SUMX is an iterator, it doesn’t scan millions of rows. The engine is optimised and the data is compressed. The compression finds matches in the columns of data and stores it efficiently. Conceptually, if there were 2 unique values in a column but 4 million rows of data, the engine would only have to scan 2 rows, not 4 million (it is not exactly like that, but it should help conceptualise).
        B. There are 2 different tasks for this example. 1) Add up the values in the columns and 2) add the columns together. Given the way that compression works and given the different overhead of the different tasks, there may be more overhead to add the columns together row wise than to add the columns first on their own and then add them together at the end.

Leave a Reply

Your email address will not be published. Required fields are marked *

Ten Things I Wish I Knew When I Started With Power Pivot.   Learn  Without The Pain!

x