Level: Intermediate
There is a concept in the DAX language called “Syntax Sugar”. Simply put, the developers have created simplified versions of more complex formulas to make it easier for people to learn and use the DAX functions. Here are a couple of examples.
Total Sales Syntax Sugar = SUM(Sales[ExtendedAmount])
Total Sales Full Syntax = SUMX(Sales, Sales[ExtendedAmount])
Total Sales Prior Year Syntax Sugar = CALCULATE( [Total Sales], SAMEPERIODLASTYEAR('Calendar'[Date] )
Total Sales Prior Year Full Syntax = CALCULATE( [Total Sales], DATEADD('Calendar'[Date], -1, YEAR) )
Simple Syntax in CALCULATE DAX function
Another example of syntax sugar is the simple syntax inside CALCULATE. There are 2 different ways you can apply a filter inside CALCULATE.
= CALCULATE(<expression>, Table[Column] compared to a scalar value, ...)
= CALCULATE(<expression>, TableFunction(),. ..)
Consider the following example using the simple syntax.
Total Sales of Bikes Simple = CALCULATE( [Total Sales], Product[Category] = "Bikes" )
The above formula is syntax sugar for the following full DAX function formula.
Total Sales of Bikes Full Version = CALCULATE( [Total Sales], FILTER( ALL(Product[Category]), Product[Category] = "Bikes" ) )
As you can see, the simple version of the DAX formula is clearly easier for a beginner to learn and understand. Under the hood, the simple version is converted to the full version prior to execution.
But Why use ALL(Table[Column])?
One thing a curious mind may be interested in is “why does the full version use ALL(Table[Column]) and not ALL(Table)”? Before I answer that question, let me first talk about compression. The Vertipaq engine (used in Power BI and Power Pivot) is a column store database. Traditional databases (eg SQL Server) use row store technology – each row is stored one full row at a time. A column store database stores the data one column at a time. Some of the consequences (benefits) of a column store database are that it is very efficient to filter, iterate and operate over entire columns of data in a table.
Compression Explained
There are a few different types of compression used in a the Vertipaq database, and I am only going to discuss one of them here. Let’s assume you have a data table with 6 columns and 100 million rows, and one of the columns is “Quantity”. The objective is to sum the quantity and come up with a total. A row store database (without an index) would need to retrieve all 100 million rows of data, including those columns not needed in the calculation, to come up with the answer. A column store database can directly access the single column to get the answer. What’s more, the columns can be compressed by the Vertipaq engine prior to being stored and accessed. One compression method is called run length encoding (RLE).
In the image below, the original column of data contains various values (in this case, the numbers 1 to 5). The way RLE works is that it first sorts the column, then creates a compressed version of the column, similar to that shown below (illustrative).
As you can see above, the compressed version of the column is small compared to the original. In my example the original column/table has only 18 rows, but if the original table had 100 million rows with values from 1 to 5, it would still compress to exactly the same compressed version as shown above. The calculation to add up the columns in my sample above would therefore look something like this (pseudo code).
Value 1 exists 3 times (starting row 4 minus 1). Multiply 1 * 3 Value 2 exists 3 times (starting row 7 minus 4). Multiply 2 * 3 Value 3 exists 4 times (starting row 11 minus 7). Multiply 3 * 4 Value 4 exists 3 times (starting row 14 minus 11). Multiply 4 * 3 Value 5 exists 5 times (starting row 19 minus 14). Multiply 5 * 5 Add up all the totals
As you can no doubt understand, the above process is identical for an 18 row column and a 100 million row column. What’s more, the Vertipaq engine is mutli-threaded. If you have 4 cores on your machine, the calculations can be shared amongst the cores and completed in parallel before finally being combined to return the final result.
Uniqueness is Your Enemy
It follows that the more unique values in a column, the lower the compression. There also comes a tipping point where it is better just to store the uncompressed column than it is to try to compress the data. This is why it is better to remove un-needed precision and uniqueness from a column before loading. Examples include rounding to 2 decimal places rather than keeping unrounded decimal numbers, separating date and time into 2 columns and rounding the time to minutes rather than keeping the seconds (if seconds are not needed) etc.
Back to ALL (Table[Column])
Now, back to the syntax sugar. The reason it is more efficient to filter, iterate and operate over a column than a table is because the engine can iterate over the compressed version of the column. the ALL() DAX function returns a table that contains all the distinct values in a column. This is exactly what the “Value” column in the compressed table in the image above contains (except the blank row at the bottom). So that is the reason the syntax sugar is written the way it is – it is to leverage the efficiencies of the Vertipaq engine.
Final Advice for DAX functions
In the wise words of Alberto Farrari, “Never filter a table if you can filter a column instead”. Now you know why.
Wonderfully written piece. How did you figure out what the DAX “sugar” functions really called?
A combination of learning from others, speaking to Jeffrey Wang at Microsoft, and then once I had an understanding of how it works, you can actually guess.
I suspect that Syntax Sugar uses a boolean logic to retain efficiency due to the issue that FILTER is considered generally inefficient. After reading Microsoft article I think it explains the following error that you get when you try to use Syntax sugar with a measure
(For example: CALCULATE([Total Orders Measure], Big_Table[SomeColumn] > [Avg Price Measure])
The error that you will get: “A function ‘Calculate’ has been used on a True/False expression that is used as a table expression. This is not allowed”.
https://docs.microsoft.com/en-us/power-bi/guidance/dax-avoid-avoid-filter-as-filter-argument
No, that’s not it. FILTER is not generally considered inefficient, as long as you use it efficiently. The syntax sugar in this case ensures that FILTER is executed efficiently under the hood.
Hi Matt – Nice article. How were you able to determine the actual underlying code the DAX “sugar” functions are calling? This could be extremely educational and helpful in certain circumstances.
Well i know some people that know this stuff – Marco Russo, Jeffrey Wang. But if you use DAX studio, you can see the queries generated by Power BI. You can write the different measures and see that the same query gets executed. However it is not as simple as just referring to a list of syntax sugar.
Thank you Matt. DAX Studio will not show what I want. For example, if I call TOTALYTD in my code, DAX Studio shows that I called TOTALYTD but it does not show the code TOTALYTD is actually executing. This is what I’d love to see.
The only reason I can think of that you would need one column is if you want to create a relationship using that column. That implies you will also have a date/time calendar table. This is not a common requirement. And yes, don’t split the columns and then add a calculated column later. This is worse for 2 reasons. Firstly you now have 3 columns instead of 2, and second calculated columns are generally less well compressed (not always, but at least there is the possibility that they are less well compressed).
ThX Matt
This article made me think one step forward, regarding optimizing one’s Data:
In the article you gave an example that it will help separate Date from time, in to 2 columns, so there will be less unique members in the column on the expense of having 2 columns.
I understand this well but then I think:
If I must have one dimension (or column) of both (as in your example) of date and time, will it payoff (from the performance side only) to separate them, as you mentioned in to 2 columns, and then with DAX add a third calculated column of them joined together? or will all the advantage be loosed?
Excellent article. Understanding what is really happening behind the scenes helps me understand DAX better.
In your first set of examples, isn’t there still some syntax sugar to be teased out:
Total Sales Prior Year Full Syntax =
CALCULATE([Total Sales],DATEADD(Calendar[Date],-1,YEAR))
Isn’t *that* syntax sugar for this:
Total Sales Prior Year Full Syntax =
Total Sales Prior Year Full Syntax =
CALCULATE(
[Total Sales],
DATEADD(CALCULATETABLE(VALUES(‘Date'[Date])),-1,YEAR)
)
Possibly, I don’t know for sure. I can’t see what CALCULATETABLE does in your example, as VALUES already respects the filter context. Then of course the second and third parameters are doing something under the hood too. I guess at the nth degree, all code is just syntax sugar for machine code.
You’re right Ed.
The ‘Date'[Date] part of DATEADD is syntax sugar for CALCULATETABLE(VALUES(‘Date'[Date])). This is important to realize because DATEADD(Calendar[Date],-1,YEAR) behaves differently from ,DATEADD(VALUES(Calendar[Date]),-1,YEAR) when evaluated under a row context.
“Understanding what is really happening behind the scenes helps me understand DAX better.”.
I couldn’t have put it better.
If you’re a stickler for details you could argue that the real syntax sugar is CALCULATETABLE(DISTINCT(‘Date'[Date])); using DISTINCT instead of VALUES. This because the time intelligence functions generally ignore the blank row (Unknown member). (How would you shift/move the blank row/unknown member?).
Best regards Jes.
Great explanation on the column filter. The picture made it easy to understand
Matt, I am currently facing a problem of connection my pbix file to an sql adventureworks2012 installed on
my pc. I have tried a lot of combinations and permutations but to no avail. Can u please guide me in this.
Thanks in advance
I think it best if you post a question at community.powerbi.com
Again a great article, Matt. However one typo in the pseudo code: “Multiply 5 * 5” ?
Beautiful article, Matt. Run Length Encoding is analogous to ‘Range Slicing’ in Excel. I wrote something on Range Slicing some time back at Stack Overflow that might be of interest by way of comparison: https://stackoverflow.com/a/32488911/2507160