Level: Beginners
I have blogged before about how to combine multiple files in a folder using Power Query. The way I have always done it is to use a Function, and this is not hard to do if you follow the instructions I provided on my last blog on this topic at PowerPivotPro.com.
Recently I had a specific requirement where I needed to do a quick validation check over some CSV files to make sure that no data was missing. I could have spent 2 minutes writing a quick function and then import all the files but it occurred to me that there is another way.
Why is combining multiple files hard at all?
Power Query has an inbuilt feature that allows you to import data from a folder (shown below), so why is this a problem at all?
Well there are 2 reasons.
- Firstly it is common for the files you want to combine to have a header row in each file. If it wasn’t for this header row then you could just combine the contents of a folder into a single import and you would be done (as long as the files are all the same shape).
- The second issue is that Power Query only loads a few thousand rows of data into the preview window. So sometimes you can’t even see the next header row to filter it out. All looks fine when you write the query but then it fails on refresh.
So what to do? – well there are a couple of tricks you can use
I have created some sample data – I have 3 csv files all of the same format all the same folder. Each of these sample files has 100,000 rows of data – sales for 1 day in each file. You can download the sample files here if you want to use them to work through this yourself.
I open a blank workbook and select Power QueryFrom FileFrom Folder and navigate to the folder where my files are stored.
I then select to expand the “Content” column that shows “Binary” in the rows.
As you can see below, Power Query only loads about 1,000 rows (shown as 1 below) and indicates that the list of values may be incomplete (shown as 2 below). So I click on load more (shown as 3 below).
And I get an error message. “ [DataFormat.Error] We couldn’t parse the input provided as a Date value.” The issue is that Power Query automatically changed the data types of my columns for me based on the first 1,000 rows. But my files have 100,000 rows of data, and the first row of the next file and subsequent files will always be another header. It is this second (and subsequent) header(s) that cause the error.
So the next thing is to delete* this “Changed Type” step by clicking the cross next to the step in the Applied Steps window.
Now when I click on the “load more” button, I get a full list of values including the next header record. All I need to do is deselect “date” and all the files will load successfully. In my sample data my header row has “date” as the header in this column. I could apply this same step in either of the other columns to filter out the header row, but of course I would have to filter out the word “Product” from the second column or “Qty” from the third column.
After deselecting Date, you should then go back and manually apply the “Changed Type” step again to change the data types.
*Note: If you prefer, you could keep the original “changed type” step and simply insert a step in the process before the data types were changed – it is up to you.
But what if you have lots of large files that are slow to load?
Sometimes files are slow to load into Power Query, particularly when you have a lot of very large/wide Excel files. When this happens, clicking “load more” can be quite a slow process. There is an alternate trick that you can use instead of the steps show above.
First deselect any item in the list. it doesn’t matter which one – you just do this to force Power Query to create the line of code that I will manually edit.
After you have completed the above step, you will be able to see the Power Query Formula Language code in the formula bar (shown as 1 below). If you can’t see the formula bar then you can turn it on by selecting ViewFormula Bar (steps 2 and 3 below).
In my sample, the code generated by Power Query is.
= Table.SelectRows(#"Promoted Headers", each ([Date] <>
“15-Aug”
))
The only important thing to note is the bit after “not equal to” ie after “< >”. The last step then is to replace the value “15-Aug” in this line of code with the value you want to exclude – in this case it is the text “Date” (because this is the text in the header row that I want to exclude).
= Table.SelectRows(#"Promoted Headers", each ([Date] <>
“Date”
))
It is also possible to use this trick to remove Null entries from your data. To do this, you would use the keyword null with no quotes as follows
= Table.SelectRows(#"Promoted Headers", each ([Date] <> null))
Wrap Up
And there you have it. A method to quickly combine multiple files in a folder with Power Query without having to write a function.
If you want a comprehensive lesson on how to use Power Query, checkout my training course here https://exceleratorbi.com.au/power-query-online-training/
You are brilliant. I’m reading your book Supercharge Power BI and I appreciate your way of demonstrating the measures and the metaform you use. It helps me a lot. Not only to learn Power BI but to learn communicating different subjects with my auditors. Thanks Matt!
Awesome Matt!
It helped in solving my problem.
Thanks
Mirza
Hi Matt,
I have come across approaches for combining multiple files in a folder (via Chris Webb and here I think) where the loading for one is set up, then a second query runs to process each file in turn.
Is this method, or the one you outline above a better one to use?
Cheers
Phil
I wouldn’t say one is better than the other, just different. If you have simple file structure with a single header row in each file followed by many rows of data, then the approach on this post is quick and easy. If you have something more complex with the need to clear out some rows from the header section before processing the file, then using a function is probably better. I recommend you learn both techniques. If you read the post I did on PowerPivotPro.com (linked in this blog above) where I combine multiple sheets from 1 workbook using a function, you will learn the easiest way to combine multiple objects using a function. I may even rewrite that PPP blog here using multiple files – for completeness.
Hope that makes sense