Import Tabular Data from PDF using Power Query - Excelerator BI

Import Tabular Data from PDF using Power Query

Level: Intermediate

[Edit: April 2019]

Microsoft made the PDF Connector in Power BI generally available in April 2019. This is another instance to show that Microsoft is listening to the community to add or improve the Power BI features. The PDF File connector will appear under the File category in the Get Data dialog as shown below.

Pdf Connector

Select the PDF connector and specify the path to your PDF file. Power Query will extract tables automatically and present them to you in the Navigator dialog, where you can preview and select one or multiple tables.

clip_image044

My original article about the workaround prior to the PDF Connector in Power BI follows below.

Importing Tabular Data from a PDF Document into Excel (or Power BI) using Power Query

Today I am sharing a process I developed that allows you to import tabular data from a PDF document into Excel (or Power BI) using Power Query.  I didn’t want to purchase software to do this task so I started experimenting on how I could do it with the tools I already have, and I am sharing my solution here today.

pdf-to-pq

Note:  This will only work for tabular data in a PDF – exactly the same as downloading from a Web Page must be in tabular form.

I also demo some neat Power Query tricks further down in this post, so make sure you read through to the end.  These tricks are not hard, and you may be surprised at what you can do.

Process Overview

Here are the steps I use to grab the table from the PDF.

  • Open the PDF in Microsoft Word.
  • Save the file as a single file web page.
  • Import the single file web page into Power Query as HTML.
  • Go through the normal cleansing and transformation process you would normally do (plus of course the cool tricks I use below).

Worked Through Example

I did a quick Google to find a PDF containing something I could use in my demo from the Web.  I found this PDF that contains a list of US States with their abbreviations

http://www.siue.edu/postal/pdf/Abbreviation-List.pdf    Note the data I want from this file is in tabular format.

image

I saved the PDF to a known location on my PC.

Convert the PDF to a Web Page File

I opened Microsoft Word and then I opened the PDF file from within Word.  I got this warning message below.  I just clicked “don’t show this message again” and clicked OK.

image

I then selected File\Save As and saved the file as a Single File Web Page (shown below).

2016-11-18_121653

Import into Power Query

There are a couple of ways you can do the next step. I found the following to be the easiest.

Create a new query that connects to a text file (example shown below – Excel 2016, but it is similar in all other versions and Power BI)

image

In the dialog, change the file filter so it can find all file types (as shown below), then browsed until I found my file.

image

Then select the MHT file created earlier and click “import”

image

Power Query will not correctly identify the file type, so right click on the file (shown below) and the select HTML.

image

Power Query now correctly identifies the table in the document.  I expanded the one identified as “3DTableGrid”.

image

In your own PDF documents, you may need to use some trial and error to find the right table.

I then clicked on the Table I wanted (shown as 2 above).

Cleansing the Data

The data is starting to look good at this stage.  But note there is a bit of “noise” in the file (see below). Power Query is a great tool for fixing that.

image

First I got rid of the first row (Remove Top Rows, 1).

Next I copied the text <= /span> by right clicking on one of the cells containing this text, and selecting “copy”.  Then I selected all 4 columns and did a Transform\Replace Values and replaced <= /span> with nothing.

I didn’t need the bottom 11 rows, so I did Remove Rows\Remove Bottom Rows\11

image

Power Query Online Training

Now for Some Power Query Tricks

Power Query is a formula language. If you analyse each of the steps in the Applied Steps window, you will notice that each step is a formula, and the formula takes the output of the previous formula as the input step to the next formula.  The UI assumes each step will take the previous step as an input to the new step, but it doesn’t have to be that way.  You can override that if you want – I show you how below.

Turn on the formula bar before proceeding.

formula

At this point, I have my data in 4 columns, but I really want all the data in just 2 columns.

image

This is easy to fix with some simple Power Query tricks, with very little PQL coding at all.

First I renamed this step to be called All4Columns.  I like to rename significant steps in my Applied Steps window to make it easier to find the ones I need later.

image

Then I removed the last 2 columns using the UI.  Select the columns to remove, right click and then remove columns.

image

Before moving on I renamed the 2 columns to be called “State” and “Abbreviation”, and renamed the step to be called First2Columns using the same renaming approach as before.

image

Now the trick. At any point in your query, you can add a custom query step by clicking the button shown in 1 below.

image

Power Query assumes you want to add a new step that starts from the previous step. So Power Query automatically adds the previous step as the starting point (as shown below).

image

But you don’t have to keep the previous step.  In this case I want to refer to the step “All4Columns” instead.  So I just typed in the name of the step I wanted (as shown below) to replace what Power Query added for me.

= All4Columns

Now that I had all 4 columns back, I went through the following steps.

  • removed the first 2 columns
  • renamed the columns to be “State” and “Abbreviation” as before
  • renamed the step to be called Second2Columns

This then left me with 2 non-sequential steps (First2Columns, Second2Columns) that each contained half of the data.

Append The 2 Tables into a Single Table

I didn’t know how to write the code to do this, so I just used the UI again.  I selected “Append Queries” and appended the query to itself.  That forced the UI to write the code for me as shown below.

image

Not only did the UI write the code for me, but I learnt a single PQL function that I think I can probably remember Table.Combine( ) and hence I may be able to do the same step by hand next time around (this is how you can learn too).  Then all I needed to do was change the code above so that instead of appending to itself, it would append the First2Columns to the Second2Columns.  You don’t need to be a rocket scientist to work out how to do this :-).   It looks like this after I made the change.

= Table.Combine({First2Columns, Second2Columns})

Wrap Up

I hope you found this article useful, and you have learnt a few things.

  1. How to import data from a PDF
  2. How to refer to non sequential rows in a query
  3. How to use the UI in Power Query to help you write more advanced queries.
  4. How to refer to the Formula Bar to start to build your knowledge of the Power Query Formula Language.

If you are into Power Pivot and you haven’t already done so, you may like to download my paper “Ten Things I Wish I Knew When I Started With Power Pivot” from the link below.

Edit: 24th Nov 2016
A friend of mine Dave Marriott was using my approach above but had a stack of PDFs that he needed to access. He decided to write some Word VBA code that automates the task of converting the PDFs to MHT files. Dave is a professional VBA programmer based in Adelaide, Australia if you would like to hire him to help you be more efficient in your business. You can contact him at LinkedIn

Copy this VBA code into Microsoft Word.  Why not add it in your Normal Template so it is always available?  If you haven’t done this before, you can follow my instructions on how to do this in Excel (it is exactly the same process in Word).  Run the code to launch a dialog. You can multi-select as many PDFs as you like and the code will convert them all for you.

Sub ConvertToMHT()

Dim f_dialog As FileDialog
Dim FileChosen As Integer
Dim FileName As String
Dim i As Integer
Set f_dialog = Application.FileDialog(msoFileDialogFilePicker)
f_dialog.InitialView = msoFileDialogViewList
f_dialog.AllowMultiSelect = True

FileChosen = f_dialog.Show
If FileChosen = -1 Then
     For i = 1 To f_dialog.SelectedItems.Count
          Documents.Open FileName:=f_dialog.SelectedItems(i)
          ActiveDocument.SaveAs2 FileName:=Mid(f_dialog.SelectedItems(i), 1, Len(f_dialog.SelectedItems(i)) - 4) & ".mht", FileFormat:=wdFormatWebArchive
          ActiveWindow.Close
     Next i
End If

End Sub

if you want a comprehensive lesson on how to use Power Query, checkout my training course here https://exceleratorbi.com.au/product/power-query-excel-power-bi-online-training/

21 thoughts on “Import Tabular Data from PDF using Power Query”

    1. all queries can be copied from the Advanced Editor. Just copy the code and save it somewhere (I use OneNote). Then cut and paste it back into a new query (advanced editor) when you are ready

  1. I commented that I learned what was related to the steps but something simpler occurred to me when you combined and performed the steps of the two columns, I did the same but with the source file until the step where you delete the two columanas, so duplicate both files and delete the rows as you did only in the duplicated files, then combine them, I found it much easier but this thanks to your excellent explanation. I send you a big hug. Beware.

  2. Hi,

    Thanks for the macro!

    For some reason when importing the mht file into power query few lines disappeared.. so after many tries I turned of the RelyOnCSS option from the web options before saving the file and the problem was fixed..

    “With ActiveDocument.WebOptions
    .RelyOnCSS = False
    .OptimizeForBrowser = False
    .OrganizeInFolder = True
    .UseLongFileNames = True
    .RelyOnVML = False
    .AllowPNG = False
    .ScreenSize = msoScreenSize1024x768
    .PixelsPerInch = 96
    .Encoding = msoEncodingGreek
    End With”

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top