How Query Folding Works - Excelerator BI

How Query Folding Works

Level: Beginners

Power Query is a product that allows business users to extract data from any source, cleanse and transform that data, and then load the modified data to a new location, typically into an Excel spreadsheet or a Power BI data model.  Power Query is very easy to learn, and most tasks can be completed by using the buttons on the menus.

When the data source to be loaded is a file or files eg a csv(s), Power Query reads the data from the file(s) and processes the transformation tasks directly inside Power Query.  This is fine for relatively simple tasks (eg remove or rename columns) and relatively small files (say less than 1 million rows).  But bad things can happen when you try to load tens of millions of rows using Power Query from a csv file and then try to preform ‘heavy’ tasks, such as sorting, over the data to be loaded.  To sort 10 million rows from a csv, Power Query would first need to load every record into memory and then complete the sort task from there.  While it may be possible to do this, it certainly isn’t the best approach.

Persistent Storage

Power BI and Power Pivot for Excel do not have their own persistent storage of the source data.  These products are a database, but they are a reporting database, not a storage database.  You will need a place to persistently store your source data outside of Power BI and Power Pivot for Excel if you want to be able to refresh your data in the future.

You can of course keep your source files (csv, xlsx) as the persistent storage if you have relatively small amounts of data, but if you are going to process a lot of data you really should consider loading the source data into a relational database. This could be something like Microsoft Access but better still, you could use something like SQL Server if you have it available. Even if you don’t have access to SQL Server, you could consider installing SQL Server Express on your laptop (it is free).  Believe me SQL Server is not that hard to learn (with YouTube) and it is well worth the effort to better manage your persistent storage of data.

What is Query Folding

Inside Power Query, the term query folding refers to the the process of taking complex tasks generated by the Power Query UI (such as sorting 10 million rows of data) and pushing those tasks back to the database for execution.  I like to think of Power Query “writing down” some SQL code on a piece of paper, folding the paper in half and passing it over to the database as a set of instructions for execution.

Power Query will automatically execute query folding under certain conditions including:

  1. The source must be a database that can accept a folding request.  Most relational databases are included.  It can also be an OData source, SSAS and Google Analytics (thanks Miguel and Chris).
  2. If you start your query with SQL code you wrote yourself, no further steps inside Power Query will be folded.
  3. Once you perform a step that cannot be folded, query folding stops for the rest of the query.

Power Query Online Training

How do I Know if Query Folding is Active?

When you use Power Query, the steps you produce are listed in the right side “applied steps” panel. You can check if a query step has generated query folding (or not) by checking for the existence of Native Query code inside that step.  To do this, right-click on one of the query steps and then select “view Native Query”. If you are able to do this, ie if view Native Query is enabled, it means Power Query has successfully performed query folding.

Do the Heavy Lifting First

Your objective when using Power Query over a relational database should be to do all the heavy lifting first.  You basically want to make sure query folding is enabled for as long as possible so the harder tasks are performed more efficiently by the database and not by Power Query itself.  The best way to check is to check if the last step in your query uses query folding.   If the last step uses query folding, then every step uses query folding.  However if you find that that last step is not using query folding, then you should work you way back up the list of applied steps to find out where query folding stopped.  If possible, change the order of execution of your steps so that all the foldable tasks are completed first before you complete any step that cannot be folded (e.g. Split Column).

I have created a brief video to show how Query Folding works and how to check where folding stops (there is no sound in this video, so you can look at it now 🙂 ).

Power Up Your Skills

You can learn this stuff from online articles, but why not excelerate your learning (see what I did there?) by doing some formal training.  I have an excellent online Power Query training course that will teach you query folding and so much more with over 7.5 hours of structured video instruction using real world examples.  You can checkout my training course here => Power Query for Excel and Power BI – Online Training

23 thoughts on “How Query Folding Works”

  1. Adolfo Socorro

    Hi Matt, how come query folding for SQL Server does not work if one enters a SQL statement when connecting to the database? Does that mean one always needs to start with a connection to a table or a view? Thanks!

    1. I don’t know for sure “why”, but I can guess. Query folding is essentially PowerQuery automatically writing a SQL statement for you. I guess this is relatively easy when starting from blank, but difficult when starting from any of an infinite number of pre-written queries written by you (that may or may not be well written). So yes, you need to start with a simple connection.

  2. In short, it depends. My best tip is to filter asap in the query. Right click the applied step to see if query folding is active, or use a profiling tool on the db to see exactly what happens. There is no other way to be sure. Sorry.

  3. You are close in both statements. This is how I would describe it.

    Power Query will load what ever data is needed in order to produce a sample table showing the top 1000 rows of data after applying all the applied steps. The top 1000 rows are then cached in memory and disk and used to build out the new query steps in the editor. If a query step then requires the query to load more data (such as a sort descending step) then it will repeat the process before storing the new 1000 rows in cache memory and disk.

    The advantage of this is it doesn’t have to execute all the applied steps on the entire table during editing. Imagine if you had a 100m row table and you did a Merge Column step in the query. With the current approach, the step is recorded and the preview shows the results on the top 1000 rows – it doesn’t have to execute the full 100m rows until you click Close and Apply.

    Hope that helps

    1. Hi Matt,
      I too am trying to get my head around the query editor. I would really appreciate some clarification.

      I have taken the question a step further.

      While I am in Query Editor, I filter the query for records after a certain date. For example Sales transactions after 1st April 2019. Will the query editor still bring in to Power BI all records including Pre and post 1st April 2019. However, will only display in report view(after I click on close/apply) records dated after 1st April 2019. As a consequence, no matter how filters are placed, Query Editor will pull out of the API all records in that table. For example will put out 1,000,000 rows to store in Query Editor but will only display in report view of Power BI, 80,000 row. (sales records after 1st April).
      Thank you in advance

  4. Brilliant read! Thank you for this Matt.
    Did I understand correctly through?
    I have a query (data set) in Power Query which the source data comes from 5 CSV’s (over 5M records) appended together through a folder connector. Then, I return two more Queries (data sets) from a SQL DB Source using a SQL query, conduct a LEFT OUTER MERGE from the CSV data set as it requires some fields from the two SQL DB Sources, the performance will be slow as Power Query must load every record into Memory first and then complete the Left Out Merge.
    However, if I load the CSV’s into a SQL DB Table, therefore I will no longer be loading CSV’s into Power Query, but instead loading from a SQL DB Source, performance should increase as Power Query will now be pushing back those tasks (Multiple MERGES) to the database for execution.
    Thanks,
    Laz

    1. Yes Laz, that’s correct. It is easy to test if you have access to SQL. The final performance from SQL will depend on normal SQL issues, such as the existence of suitable indexes and other performance tuning features

      1. Lazaros Viastikopoulos

        Hey Matt,

        I am diving much deeper into Power BI and trying to understand performance much more, which reminded me of this article you shared. If you can shed any light on this, it would be very much appreciated.

        This was my initial assumption:

        “Power Query only shows a preview of data and DOES NOT load data into memory. Data is only loaded into memory once we click on ‘Close and Apply’ and Power Query loads data into Power Pivot.”

        When I read the above to myself, it seems incorrect, as surely the transformations must take place before loading anything into Power Pivot (Model). So, this now leads me to believe this:

        When applying transformation to data in Power Query, the data is first loaded into memory and then the transformation is applied. After we click the ‘Close and Apply’ button, the Data is once again re-loaded into the model. This is why the loading prompt appears.

        Thanks in advance,

        Laz

  5. I love query folding. A comment, and a plea for help.
    Comment: If you change the data type as a separate step, it will most likely break query folding, but if you include the data type change in the step by manually editing the formula in the formula bar, it will change the data type and preserve query folding. The below will preserve query folding.

    #”Added Commission” = Table.AddColumn(dbo_oeordlin_sql, “Commission”, each [unit_price] * .05),
    #”Changed Type” = Table.TransformColumnTypes(#”Added Commission”,{{“Commission”, type number}}),

    That preserves query folding because the data type was already a number and it is just a visual in PQ to change ABC/123 to 1.2 in the data type header.

    This will break query folding:
    #”Added Commission” = Table.AddColumn(dbo_oeordlin_sql, “Commission”, each [unit_price] * .05),
    #”Changed Type” = Table.TransformColumnTypes(#”Added Commission”,{{“Commission”, Int64.Type}}),
    It had to change a decimal number to an integer. It won’t fold that, so folding breaks, but the below will preserve query folding:

    = Table.AddColumn(dbo_oeordlin_sql, “Commission”, each [unit_price] * .05, Int64.Type)

    Just insert “, Int64.Type” in the formula bar before the last paren manually, and it works. Note that editing the formula might destroy that, and you’ll have to put it back.

    As for the plea for help, I would love it if anyone had any ideas how to convert a YYYYMMDD integer into a valid date and not break folding. I’ve tried dozens of methods, and it always breaks. Converting to a date is no issue, but folding gets killed.

    1. Can you get someone to create a view in SQL that has the date column you need? I’m not even sure if SQL can do that task efficiently – I’m not knowledgeable enough on TSQL to know.

    2. Has anyone found a solution to the YYYYMMDD date issue mentioned above by edh? In my case, the YYYYMMDD field comes across from the source in String format. So, anytime I transform the column it breaks query folding.

  6. Hi Matt,
    Nice article, but there are a few inaccuracies. First, there are a few other sources apart from relational databases and OData that support query folding, for example SSAS and Google Analytics. Second, your point about no manual M coding allowed is wrong – there’s no difference between M code you write manually and M code generated by the UI, some things can be folded and some things can’t but it depends on what you do in the code, no how you write the code. You’re right though that using your own SQL query prevents any further folding.

    Chris

  7. Matt- great tutorial. In this example, it is straightforward. Any way you can do a blog (or video) of query folding when doing a left outer join? I am having trouble understanding this:

    You connect to Table 1 (75MB) in a database and do some transformations and close and load as a connection.
    Then you connect to Table 2 (100MB) in the same database and do some transformations and close and load as a connection.
    So now we have two queries in power query. When I do a left outer join (Table 1 is the left) it takes a long time. Obviously query folding is not taking place. What steps using the Power Query interface can I do in this scenario to do the join with query folding so it wont take 30 minutes?
    That seems like a much harder query folding task than simply sorting rows. Any help on this would be super helpful. Thank you.

      1. Matt,
        Thanks for putting up a follow up video. Both videos have were really enlightening to me. Doing the transformations after the left outer join was the missing piece for me. Appreciate you taking the time to educate all of us. Thanks again.

  8. hola Mat siempre es genial tus publicaciones, seria genial que los cursos estuvieran disponibles en español. cuidate mucho, exitos en todo. 🙂

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top