Level: Intermediate
[Edit: April 2019]
Microsoft made the PDF Connector in Power BI generally available in April 2019. This is another instance to show that Microsoft is listening to the community to add or improve the Power BI features. The PDF File connector will appear under the File category in the Get Data dialog as shown below.
Select the PDF connector and specify the path to your PDF file. Power Query will extract tables automatically and present them to you in the Navigator dialog, where you can preview and select one or multiple tables.
My original article about the workaround prior to the PDF Connector in Power BI follows below.
Importing Tabular Data from a PDF Document into Excel (or Power BI) using Power Query
Today I am sharing a process I developed that allows you to import tabular data from a PDF document into Excel (or Power BI) using Power Query. I didn’t want to purchase software to do this task so I started experimenting on how I could do it with the tools I already have, and I am sharing my solution here today.
Note: This will only work for tabular data in a PDF – exactly the same as downloading from a Web Page must be in tabular form.
I also demo some neat Power Query tricks further down in this post, so make sure you read through to the end. These tricks are not hard, and you may be surprised at what you can do.
Process Overview
Here are the steps I use to grab the table from the PDF.
- Open the PDF in Microsoft Word.
- Save the file as a single file web page.
- Import the single file web page into Power Query as HTML.
- Go through the normal cleansing and transformation process you would normally do (plus of course the cool tricks I use below).
Worked Through Example
I did a quick Google to find a PDF containing something I could use in my demo from the Web. I found this PDF that contains a list of US States with their abbreviations
http://www.siue.edu/postal/pdf/Abbreviation-List.pdf Note the data I want from this file is in tabular format.
I saved the PDF to a known location on my PC.
Convert the PDF to a Web Page File
I opened Microsoft Word and then I opened the PDF file from within Word. I got this warning message below. I just clicked “don’t show this message again” and clicked OK.
I then selected File\Save As and saved the file as a Single File Web Page (shown below).
Import into Power Query
There are a couple of ways you can do the next step. I found the following to be the easiest.
Create a new query that connects to a text file (example shown below – Excel 2016, but it is similar in all other versions and Power BI)
In the dialog, change the file filter so it can find all file types (as shown below), then browsed until I found my file.
Then select the MHT file created earlier and click “import”
Power Query will not correctly identify the file type, so right click on the file (shown below) and the select HTML.
Power Query now correctly identifies the table in the document. I expanded the one identified as “3DTableGrid”.
In your own PDF documents, you may need to use some trial and error to find the right table.
I then clicked on the Table I wanted (shown as 2 above).
Cleansing the Data
The data is starting to look good at this stage. But note there is a bit of “noise” in the file (see below). Power Query is a great tool for fixing that.
First I got rid of the first row (Remove Top Rows, 1).
Next I copied the text <= /span> by right clicking on one of the cells containing this text, and selecting “copy”. Then I selected all 4 columns and did a Transform\Replace Values and replaced <= /span> with nothing.
I didn’t need the bottom 11 rows, so I did Remove Rows\Remove Bottom Rows\11
Now for Some Power Query Tricks
Power Query is a formula language. If you analyse each of the steps in the Applied Steps window, you will notice that each step is a formula, and the formula takes the output of the previous formula as the input step to the next formula. The UI assumes each step will take the previous step as an input to the new step, but it doesn’t have to be that way. You can override that if you want – I show you how below.
Turn on the formula bar before proceeding.
At this point, I have my data in 4 columns, but I really want all the data in just 2 columns.
This is easy to fix with some simple Power Query tricks, with very little PQL coding at all.
First I renamed this step to be called All4Columns. I like to rename significant steps in my Applied Steps window to make it easier to find the ones I need later.
Then I removed the last 2 columns using the UI. Select the columns to remove, right click and then remove columns.
Before moving on I renamed the 2 columns to be called “State” and “Abbreviation”, and renamed the step to be called First2Columns using the same renaming approach as before.
Now the trick. At any point in your query, you can add a custom query step by clicking the button shown in 1 below.
Power Query assumes you want to add a new step that starts from the previous step. So Power Query automatically adds the previous step as the starting point (as shown below).
But you don’t have to keep the previous step. In this case I want to refer to the step “All4Columns” instead. So I just typed in the name of the step I wanted (as shown below) to replace what Power Query added for me.
= All4Columns
Now that I had all 4 columns back, I went through the following steps.
- removed the first 2 columns
- renamed the columns to be “State” and “Abbreviation” as before
- renamed the step to be called Second2Columns
This then left me with 2 non-sequential steps (First2Columns, Second2Columns) that each contained half of the data.
Append The 2 Tables into a Single Table
I didn’t know how to write the code to do this, so I just used the UI again. I selected “Append Queries” and appended the query to itself. That forced the UI to write the code for me as shown below.
Not only did the UI write the code for me, but I learnt a single PQL function that I think I can probably remember Table.Combine( ) and hence I may be able to do the same step by hand next time around (this is how you can learn too). Then all I needed to do was change the code above so that instead of appending to itself, it would append the First2Columns to the Second2Columns. You don’t need to be a rocket scientist to work out how to do this :-). It looks like this after I made the change.
= Table.Combine({First2Columns, Second2Columns})
Wrap Up
I hope you found this article useful, and you have learnt a few things.
- How to import data from a PDF
- How to refer to non sequential rows in a query
- How to use the UI in Power Query to help you write more advanced queries.
- How to refer to the Formula Bar to start to build your knowledge of the Power Query Formula Language.
If you are into Power Pivot and you haven’t already done so, you may like to download my paper “Ten Things I Wish I Knew When I Started With Power Pivot” from the link below.
Edit: 24th Nov 2016
A friend of mine Dave Marriott was using my approach above but had a stack of PDFs that he needed to access. He decided to write some Word VBA code that automates the task of converting the PDFs to MHT files. Dave is a professional VBA programmer based in Adelaide, Australia if you would like to hire him to help you be more efficient in your business. You can contact him at LinkedIn
Copy this VBA code into Microsoft Word. Why not add it in your Normal Template so it is always available? If you haven’t done this before, you can follow my instructions on how to do this in Excel (it is exactly the same process in Word). Run the code to launch a dialog. You can multi-select as many PDFs as you like and the code will convert them all for you.
Sub ConvertToMHT() Dim f_dialog As FileDialog Dim FileChosen As Integer Dim FileName As String Dim i As Integer Set f_dialog = Application.FileDialog(msoFileDialogFilePicker) f_dialog.InitialView = msoFileDialogViewList f_dialog.AllowMultiSelect = True FileChosen = f_dialog.Show If FileChosen = -1 Then For i = 1 To f_dialog.SelectedItems.Count Documents.Open FileName:=f_dialog.SelectedItems(i) ActiveDocument.SaveAs2 FileName:=Mid(f_dialog.SelectedItems(i), 1, Len(f_dialog.SelectedItems(i)) - 4) & ".mht", FileFormat:=wdFormatWebArchive ActiveWindow.Close Next i End If End Sub
if you want a comprehensive lesson on how to use Power Query, checkout my training course here https://exceleratorbi.com.au/product/power-query-excel-power-bi-online-training/
This is Awesome!! Thanks a looooooot
Awesome post. I’m going to use it in a training session next month.
thanks!!!
thanks
how can i save this query and reuse it again?
all queries can be copied from the Advanced Editor. Just copy the code and save it somewhere (I use OneNote). Then cut and paste it back into a new query (advanced editor) when you are ready
I commented that I learned what was related to the steps but something simpler occurred to me when you combined and performed the steps of the two columns, I did the same but with the source file until the step where you delete the two columanas, so duplicate both files and delete the rows as you did only in the duplicated files, then combine them, I found it much easier but this thanks to your excellent explanation. I send you a big hug. Beware.
wow great Matt, thanks.
Hi,
Thanks for the macro!
For some reason when importing the mht file into power query few lines disappeared.. so after many tries I turned of the RelyOnCSS option from the web options before saving the file and the problem was fixed..
“With ActiveDocument.WebOptions
.RelyOnCSS = False
.OptimizeForBrowser = False
.OrganizeInFolder = True
.UseLongFileNames = True
.RelyOnVML = False
.AllowPNG = False
.ScreenSize = msoScreenSize1024x768
.PixelsPerInch = 96
.Encoding = msoEncodingGreek
End With”
Matt, Thank you saved my day .
Wow, Great trick !! Thanks !
@Matt – Open PDF in word is available only from Office 2016 onward correct ?
No, this is definitely available in at least office 2013 – I saw someone use it yesterday.
Matt,
Wow. Simply amazing. Sure a clear and useful blog post.
You’re the Willie Wonka of Power Query solutions.
Thank you for sharing.
Excellent Post! Thanks!
Really good stuff there Matt, I’ve certainly learnt a lot of new things from this post.
I marvel at how you think of these things. The first2columns, Second2columns steps are simple genius.
Awesome, thanks !
Actually, I voted for this Power BI idea a couple of weeks ago : https://ideas.powerbi.com/forums/265200-power-bi-ideas/suggestions/6973371-tables-in-pdf-files
It would be even simpler to have PDF as a data source or, at least, the ability to easily pull tabular data from it !
Clicked the link and they started 3 days ago!
Actually it is further advanced than that. Miguel demo’d this at the Apps Summit
Another excellent and highly practical post. Thank you Matt.
Handy! Thanks for the tip.