Level: Beginners/Intermediate (Updated Oct 2018).
This is a further (third) update/edit to an article I first wrote 3 years back and then updated a year back. When I update my articles I simply change the original post with the latest data – that way only the current/latest version is available preventing people stumbling on old advice.
Over the years I have been teaching DAX, I have learnt new and improved ways to explain some of the more complex topics. This article is one of the most frequently read on my blog site and I wanted to update it to continue to improve the value people get from my site. Even if you have read this article before, why not take another look and refresh your knowledge.
Before moving on to SUM() vs SUMX(), there are 2 important concepts about how Power BI and Power Pivot for Excel work that you must understand. Note: Both Power BI and Power Pivot have the same SQL Server Analysis Services (SSAS) engine, and that is why this article applies to both products (as well as SSAS of course)
When you write a formula in DAX, the result of the formula depends on which filters have been applied in the report. DAX is not the same as Excel. In Excel you can write slightly different formulas in each and every cell of the report and each formula can itself point to different cells creating different results. Every cell is stand alone and unique. That is not how it works in DAX. In DAX you write a single formula such as SUM(Sales[Sales Amount]) and then use filters in the report to modify the results returned from the formula.
Filters can come from:
- The visuals in your workbook. A Pivot Table if you are using Power Pivot for Excel, or anywhere on the report canvas if you are using Power BI. The visuals in your report create the Initial Filters that impact your formulas.
- The use of a CALCULATE() function. CALCULATE() is the only* function that can change the Initial Filter behaviour of your report. Of course there may or may not be a CALCULATE() function in your DAX formula (or a precedent formula). If there is a CALCULATE() function, it can add to, remove from, or modify the initial filter behaviour of the formula.
Filters always get applied first, then the evaluation is completed. Filters first, evaluate second.
* Note: Filters can also come from an implicit CALCULATE() function. Every measure has a hidden CALCULATE() function wrapped around it that you can’t see. In certain circumstances (such as inside a row by row evaluation), this hidden CALCULATE() function can give you a different result than you may otherwise expect. This is a complex topic and outside the scope of this article. In addition, there is also a CALCULATETABLE() function and some other functions that use CALCULATE() “under the hood” that can modify the initial filter behaviour from your visuals.
The technical term for this filter behaviour is “Filter Context”, but I prefer to use the term “Filter Behaviour” as it is less intimidating for most people.
Filter behaviour is important because it will affect the results you get from your formulas. Let’s briefly look at how the initial filters work in Excel and Power BI.
Power Pivot for Excel
In the following Pivot Table example, the highlighted cell has an initial filter of Products[Category] = “Bikes” coming from the rows of the pivot table (shown as 1). Initial filters can also come from Filters, Columns and Slicers in a pivot table.
Power BI Desktop
Power BI desktop is similar to Excel, but initial filters can come from almost anywhere in the report. The highlighted cell below is filtered by Products[Category]=”Bikes” (shown as 1) just like in Excel above, but there is also a cross filter coming from Territory[Country]=”Australia” (shown as 2). Both of these filters are part of the initial filters. Filters can also come from Columns, Slicers, and the filter section on the right hand side of the Power BI report window.
Now onto a new topic.
Row By Row Evaluation
A second important topic for you to understand is ‘row by row evaluation’. Not every DAX formula is capable of doing calculations row by row. Some areas of DAX (such as in a calculated column) have the ability to do this row by row evaluation and yet other areas cannot. For example, you can write a single formula in a calculated column such as Sales[Qty] * Sales[Price Per Unit] and this formula is evaluated one row at a time down the entire column. If you were to write this exact same formula as a measure, you would get an error.
The technical term for this behaviour is “Row Context”, but I prefer to use the term “Row by Row Evaluation” as it is less intimidating for most people.
SUM() vs SUMX()
Now that I have covered the foundation knowledge, let me get back to the actual purpose and topic of this article. I will start with an overview of both of these functions.
- SUM() is an aggregator function. It adds up all the values in a single column you specify after applying all filters that will impact the formula. SUM() has no knowledge of the existence of rows (it can’t do row by row evaluation) – all it can do is add everything in the single column it is presented with after the filters have been applied.
- SUMX() is an iterator function. It works through a table, row by row to complete the evaluation after applying all filters. SUMX() has awareness of rows in a table, and hence can reference the intersection of each row with any columns in the table. SUMX() can operate over a single column but can also operate over multiple columns too – because it has the ability to work row by row.
- SUM() operates over a single column and has no awareness of individual rows in the column (no row by row evaluation).
- SUMX() can operate on multiple columns in a table and can complete row by row evaluation in those columns.
Both functions can end up giving you the same result (maybe, maybe not), but they come up with the answer in a very different way. They both often give the same results inside the rows of a matrix or visual, but often give different results in the sub totals and totals section of a visual.
OK, now it is time to look more in depth at SUM and SUMX, one at a time.
The SUM() Function
Syntax: = SUM(<Column Name>)
Example: Total Sales = SUM(Sales[ExtendedAmount])
The SUM() function operates over a single column of data to aggregate all the data in that single column with the current filters applied – filter first, evaluate second.
The SUMX() Function
Syntax: = SUMX(<Table>, <expression> )
Example: Total Sales SUMX = SUMX(Sales,Sales[Qty] * Sales[Price Per Unit])
SUMX() will iterate through a table specified in the first parameter, one row at a time, and complete a calculation specified in the second parameter, eg Quantity x Price Per Unit as shown in the example above with the current filters applied (ie still filter first, evaluate second). Once it has done this for every row in the specified table (after the current filters are applied), it then adds up the total of all of the row by row calculations to get the total. It is this total that is returned as the result.
Which One Should I Use?
Which you use really depends on your personal preference and the structure of your data. Let’s look at a couple of examples.
- Quantity and Price Per Unit
- Extended Amount
- Totals Don’t Add Up
1. Quantity and Price
If your Sales table contains a column for Quantity and another column for “Price Per Unit” (as shown above), then you will necessarily need to multiply (one row at a time) the Quantity by the “price per unit” in order to get Total Sales. It is no good adding up the total quantity SUM(Quantity) and multiplying it by the average price AVERAGE(Price Per Unit) as this will give the wrong answer.
If your data is structured in this way (like the image above), then you simply must^ use SUMX() – this is what the iterator functions were designed to do. Here is what the formula would look like.
Total Sales 1 =SUMX(Sales,Sales[Qty] * Sales[Price Per Unit])
You can always spot an Iterator function as it always has a table as the first input parameter. This is the table that is iterated over by the function.
^Note: I say must, but actually this is where many (most) business people fall in a hole. To solve this problem, rather than using SUMX() as prescribed above, most business people tend to gravitate towards a calculated column to solve the problem. A calculated column solves the problem in the same way as SUMX(), but with one big difference – it permanently stores the row by row results in your workbook. This is generally bad and you should avoid this. I recommend you read my article Measures vs Calculated Columns for a more in depth coverage of this topic.
2. Extended Amount
If your data contains a single column with the Extended Total Sales for that line item (ie it doesn’t have quantity and price per unit), then you can use SUM() to add up the values.
Total Sales 2 =SUM(Sales[Total Sales])
There is no need for an iterator in this example because in this case it is just a simple calculation across a single column and row by row execution is not required. Note however you “could” still use SUMX () (like shown below) and it will give you the same answer.
Total Sales 2 alternate = SUMX(Sales, Sales[Total Sales])
Despite what your intuition may tell you, this alternate formula using SUMX() is identical in performance and efficiency to the SUM() version. More on that below.
3. Totals Don’t Add Up
There is another use case when you simply must use SUMX() that is less obvious. When you encounter the problem where the totals don’t add up as you need/expect, you will need^ to use an iterator like SUMX correct the problem. I have created a small table of sample data to explain.
The table above shows 4 customers with the average amount of money they spend each time they have shopped as well as the number of times they have been shopping. If I load this data into Power BI and then try to use aggregator functions to find the average spend across all customers as well as the total amount spent, I get the wrong answers in the total row (as shown below).
Here are the measures from above.
Total Number of Visits = SUM(VisitData[Number of Visits]) – the total is correct for this formula.
Avg Spent per visit Wrong= AVERAGE(VisitData[Spend per Visit]) – the total is wrong here.
Total Spent Wrong = [Avg Spent per visit Wrong] * [Total Number of Visits] – the total is wrong here too.
The first measure [Total Number of Visits] is correct because the data is additive, but the other 2 measures give the wrong result. This is a classic situation where you can’t perform multiplication on the averages at the grand total level. Given the sample data that I started with, the only way to calculate the correct answer is to complete a row by row evaluation for each customer in the table as shown below.
In this second table above I have written a SUMX() to create the Total Spent SUMX (row by row) first. Only then do I calculate the average spend per visit as the final formula.
Total Number of Visits = SUM(VisitData[Number of Visits])
Total Spent SUMX = SUMX(VisitData,VisitData[Spend per Visit] * VisitData[Number of Visits])
Avg Spent per visit Correct = DIVIDE([Total Spent SUMX] , [Total Number of Visits])
In this second case, SUMX is working through the table of data one row at a time and is correctly calculating the result, even for the total row at the bottom of the table.
Preferring the Storage Engine
The last thing I want to talk about is the performance implications of using SUM vs SUMX. Given that SUMX is an iterator, you may think that SUMX is inherently inefficient. Generally speaking this is not true as the software has been optimised to handle the scenario efficiently. Having said that, bad DAX can definitely cause SUMX to be inefficient. Let me explain.
Power Pivot has 2 calculation engines, the Storage Engine (SE) and the Formula Engine (FE). The SE is faster, multi threaded and cached. The FE is slower, single threaded and not cached. This is a complex topic in its own right and I will only scratch the surface in this article, however the implication is that you should write your formulas to leverage the SE where possible. Of course this can be hard if you don’t know exactly how to do this, but there are a few simple tips that will help you.
- SUM () always uses SE for its calculations, so nothing to worry about there.
- For most simple calculations (like Sales[qty] * Sales[price per unit]), SUMX() will also use SE, so all good there.
- In some circumstances SUMX() may use the FE to do some or all of the the calculation, particularly if you have a complex comparison statement in your formula. If SUMX needs to use the FE, then performance can be slow – sometimes very slow.
Regarding point 3, the best advice I can give you is to avoid writing complex conditional statements such as “IF Statements” within a SUMX function. Consider the following 2 formulas:
Total Sales of Items more than $100 Bad = SUMX(Sales, IF(Sales[ExtendedAmount] > 100, Sales[ExtendedAmount]) )
Total Sales of Items more than $100 Good = CALCULATE( SUMX(Sales,Sales[ExtendedAmount]), Sales[ExtendedAmount] > 100 )
The first formula (Bad) has an IF statement within a SUMX. This IF statement forces the Storage Engine to pass the evaluation task over to the Formula Engine for a comparison check to see if each individual row is greater than 100 before deciding to include it or not in the calculation. As a result the formula engine must complete the task one row at a time making the evaluation slow and inefficient.
The second formula (Good) first modifies the initial filters coming from the visual using CALCULATE() to add an additional filter on Sales[ExtendedAmount] > 100. This new filter is applied by the Storage Engine very efficiently. After CALCULATE() modifies the filters, the SUMX() can then do its job of adding up the remaining rows with the new set of filters applied using the storage engine, not the formula engine. As a result this second formula is very efficient. In some simple testing I completed, the first (bad) formula took 5x longer to complete than the second (good) formula. In other scenarios it could be 100’s or even 1000’s times slower, so that clearly could be a problem.
If you are in doubt as to which engine is actually being used to execute your formulas, the only sure way to check and confirm is to use a profiling tool such as DAX Studio or SQL Profiler to check what is happening under the hood. Once again this is a complex topic beyond the scope of this article. I may come back and do an article on this topic another day, but until then here is an introduction to DAX studio and how it can be used for this purpose.
Compression Impacts on Performance
The second area that can impact performance is overall data model compression. The more unique values that exist in a column in the data model, the less compressed the data will be. The less compressed the data, the more memory that is required and potentially the slower the calculations will be. Let’s take another look at the tables from earlier in this article.
Example Table 1
The column of data Sales[Total Sales] has all unique values. This column would not compress well.
Example Table 2
In this table there are duplicate values in the Qty column and also the Price Per Unit column. The less unique values in a column, the better the compression.
Now of course these 2 sample tables are very small, but imagine the impact of this concept on very large tables (eg tables with millions of rows of data). It is possible that for very large tables, the number of unique values in example table 1 will be significantly greater than the number of unique values in the columns in example table 2. It is therefore possible that loading data as outlined in example 2 could have positive impacts on total table size and hence performance of your data model. Making a change may mean of course that you have to swap your measures from
SUMX(Sales,Sales[Qty] * Sales[Price Per Unit])
This usage of SUMX instead of SUM is perfectly fine and highly performant. It is impossible to say what impact one vs the other will have overall; it depends entirely on your data. But one thing I would advise is that you should not load all 3 of the columns as shown in this simple example, i.e don’t load Qty, Price Per Unit, AND Total Sales. As no doubt you can see, you only need 2 of these columns of data loaded because you can always derive the value in the third column from the other 2. So if you are going to use Qty and Price Per Unit, then don’t load Total Sales. If you are going to use SUM(Sales[Total Sales]) then don’t load both Qty and Price Per Unit. I recommend you load your most used column, then determine which of the other 2 columns you will use the most and then derive the third one as needed on demand.
As always, there are exceptions to the rules. Everyone’s data is different, so test out the techniques on your own data and see what gives you the best results. If your data models are small and fast then it probably doesn’t matter. If your data models start to get large and slow, then it is time to investigate the best options to try to maximise performance.
Want to Learn More from a Pro?
If you found it easy to learn from this article, then you may like to consider completing some more structured learning from me. My books, online Power BI training, online Power Query training and live training all have the same “easy to learn and understand” approach.