Import Tabular Data from PDF using Power Query

Level: Intermediate

Today I am sharing a process I developed that allows you to import tabular data from a PDF document into Excel (or Power BI) using Power Query.  I didn’t want to purchase software to do this task so I started experimenting on how I could do it with the tools I already have, and I am sharing my solution here today.

pdf-to-pq

Note:  This will only work for tabular data in a PDF – exactly the same as downloading from a Web Page must be in tabular form.

I also demo some neat Power Query tricks further down in this post, so make sure you read through to the end.  These tricks are not hard, and you may be surprised at what you can do.

Process Overview

Here are the steps I use to grab the table from the PDF.

  • Open the PDF in Microsoft Word.
  • Save the file as a single file web page.
  • Import the single file web page into Power Query as HTML.
  • Go through the normal cleansing and transformation process you would normally do (plus of course the cool tricks I use below).

Worked Through Example

I did a quick Google to find a PDF containing something I could use in my demo from the Web.  I found this PDF that contains a list of US States with their abbreviations

http://www.siue.edu/postal/pdf/Abbreviation-List.pdf    Note the data I want from this file is in tabular format.

image

I saved the PDF to a known location on my PC.

Convert the PDF to a Web Page File

I opened Microsoft Word and then I opened the PDF file from within Word.  I got this warning message below.  I just clicked “don’t show this message again” and clicked OK.

image

I then selected File\Save As and saved the file as a Single File Web Page (shown below).

2016-11-18_121653

 

Import into Power Query

There are a couple of ways you can do the next step. I found the following to be the easiest.

Create a new query that connects to a text file (example shown below – Excel 2016, but it is similar in all other versions and Power BI)

image

In the dialog, change the file filter so it can find all file types (as shown below), then browsed until I found my file.

image

Then select the MHT file created earlier and click “import”

image

Power Query will not correctly identify the file type, so right click on the file (shown below) and the select HTML.

image

Power Query now correctly identifies the table in the document.  I expanded the one identified as “3DTableGrid”.

image

In your own PDF documents, you may need to use some trial and error to find the right table.

I then clicked on the Table I wanted (shown as 2 above).

Cleansing the Data

The data is starting to look good at this stage.  But note there is a bit of “noise” in the file (see below). Power Query is a great tool for fixing that.

image

First I got rid of the first row (Remove Top Rows, 1).

Next I copied the text <= /span> by right clicking on one of the cells containing this text, and selecting “copy”.  Then I selected all 4 columns and did a Transform\Replace Values and replaced <= /span> with nothing.

I didn’t need the bottom 11 rows, so I did Remove Rows\Remove Bottom Rows\11

image

Now for Some Power Query Tricks

Power Query is a formula language. If you analyse each of the steps in the Applied Steps window, you will notice that each step is a formula, and the formula takes the output of the previous formula as the input step to the next formula.  The UI assumes each step will take the previous step as an input to the new step, but it doesn’t have to be that way.  You can override that if you want – I show you how below.

Turn on the formula bar before proceeding.

formula

At this point, I have my data in 4 columns, but I really want all the data in just 2 columns.

image

This is easy to fix with some simple Power Query tricks, with very little PQL coding at all.

First I renamed this step to be called All4Columns.  I like to rename significant steps in my Applied Steps window to make it easier to find the ones I need later.

image

Then I removed the last 2 columns using the UI.  Select the columns to remove, right click and then remove columns.

image

Before moving on I renamed the 2 columns to be called “State” and “Abbreviation”, and renamed the step to be called First2Columns using the same renaming approach as before.

image

Now the trick. At any point in your query, you can add a custom query step by clicking the button shown in 1 below.

image

Power Query assumes you want to add a new step that starts from the previous step. So Power Query automatically adds the previous step as the starting point (as shown below).

image

But you don’t have to keep the previous step.  In this case I want to refer to the step “All4Columns” instead.  So I just typed in the name of the step I wanted (as shown below) to replace what Power Query added for me.

= All4Columns

Now that I had all 4 columns back, I went through the following steps.

  • removed the first 2 columns
  • renamed the columns to be “State” and “Abbreviation” as before
  • renamed the step to be called Second2Columns

This then left me with 2 non-sequential steps (First2Columns, Second2Columns) that each contained half of the data.

Append The 2 Tables into a Single Table

I didn’t know how to write the code to do this, so I just used the UI again.  I selected “Append Queries” and appended the query to itself.  That forced the UI to write the code for me as shown below.

image

Not only did the UI write the code for me, but I learnt a single PQL function that I think I can probably remember Table.Combine( ) and hence I may be able to do the same step by hand next time around (this is how you can learn too).  Then all I needed to do was change the code above so that instead of appending to itself, it would append the First2Columns to the Second2Columns.  You don’t need to be a rocket scientist to work out how to do this :-).   It looks like this after I made the change.

= Table.Combine({First2Columns, Second2Columns})

Wrap Up

I hope you found this article useful, and you have learnt a few things.

  1. How to import data from a PDF
  2. How to refer to non sequential rows in a query
  3. How to use the UI in Power Query to help you write more advanced queries.
  4. How to refer to the Formula Bar to start to build your knowledge of the Power Query Formula Language.

If you are into Power Pivot and you haven’t already done so, you may like to download my paper “Ten Things I Wish I Knew When I Started With Power Pivot” from the link below.

Edit: 24th Nov 2016
A friend of mine Dave Marriott was using my approach above but had a stack of PDFs that he needed to access. He decided to write some Word VBA code that automates the task of converting the PDFs to MHT files. Dave is a professional VBA programmer based in Adelaide, Australia if you would like to hire him to help you be more efficient in your business. You can contact him at LinkedIn

Copy this VBA code into Microsoft Word.  Why not add it in your Normal Template so it is always available?  If you haven’t done this before, you can follow my instructions on how to do this in Excel (it is exactly the same process in Word).  Run the code to launch a dialog. You can multi-select as many PDFs as you like and the code will convert them all for you.

Sub ConvertToMHT()

Dim f_dialog As FileDialog
Dim FileChosen As Integer
Dim FileName As String
Dim i As Integer
Set f_dialog = Application.FileDialog(msoFileDialogFilePicker)
f_dialog.InitialView = msoFileDialogViewList
f_dialog.AllowMultiSelect = True

FileChosen = f_dialog.Show
If FileChosen = -1 Then
     For i = 1 To f_dialog.SelectedItems.Count
          Documents.Open FileName:=f_dialog.SelectedItems(i)
          ActiveDocument.SaveAs2 FileName:=Mid(f_dialog.SelectedItems(i), 1, Len(f_dialog.SelectedItems(i)) - 4) & ".mht", FileFormat:=wdFormatWebArchive
          ActiveWindow.Close
     Next i
End If

End Sub

if you want a comprehensive lesson on how to use Power Query, checkout my training course here https://exceleratorbi.com.au/product/power-query-excel-power-bi-online-training/


Share?

Comments

  1. I marvel at how you think of these things. The first2columns, Second2columns steps are simple genius.

  2. Matt,

    Wow. Simply amazing. Sure a clear and useful blog post.

    You’re the Willie Wonka of Power Query solutions.

    Thank you for sharing.

  3. Hi,

    Thanks for the macro!

    For some reason when importing the mht file into power query few lines disappeared.. so after many tries I turned of the RelyOnCSS option from the web options before saving the file and the problem was fixed..

    “With ActiveDocument.WebOptions
    .RelyOnCSS = False
    .OptimizeForBrowser = False
    .OrganizeInFolder = True
    .UseLongFileNames = True
    .RelyOnVML = False
    .AllowPNG = False
    .ScreenSize = msoScreenSize1024x768
    .PixelsPerInch = 96
    .Encoding = msoEncodingGreek
    End With”

Leave a Reply

Your email address will not be published. Required fields are marked *

Ten Things I Wish I Knew When I Started With Power Pivot.   Learn  Without The Pain!

x