Using Power Query for Extracting Non-Tabular Data from Web Pages

Matt Here:  I first met Ivan via Twitter when he was living in Russia, and then in person when he moved to Australia to live in Sydney with his family. Ivan is a talented Power BI professional and I have learnt many things from him over the last few years.  Today Ivan is writing a guest blog and he will share with you how he used Power Query to help him with his move to Australia. If you like what you read, be sure to subscribe to Ivan’s blog (link at the bottom of the page).  I will hand over to Ivan.

Introduction

Ivan: I recently moved to Australia with my family. While searching for a place to live I had to look for a good pair of suburb and school.  I found interesting data at https://www.myschool.edu.au/. However, there is no option to compare schools. It just gives some statistics about Australian schools, one page at a time. The information presented on a typical school profile page (e.g. Burwood Public School) is as shown below.

It means that in worst case, we can manually copy / paste data to Excel and perform own analysis. Of course, this can be done only for limited number of schools. Even for twenty schools it would take a bit of time.  However, nowadays, when we have such tools as Power Query (Get & Transform) in Excel, we can collect data from web sites efficiently, and easily “refreshable”.  Everything that I describe and share in this post is created for non-commercial educational purpose only.

Getting Data of One School

Let’s take Burwood school as an example. Copy URL from browser

https://www.myschool.edu.au/SchoolProfile/Index/104736/BurwoodPublicSchool/41369.

And query data from this URL in Excel.

Note: I usually do all queries in Excel. If needed, I copy/paste them to Power BI.

Power Query automatically detects all tables available on web page, and presents options so we can choose the one we are interested in. When we need more than one table we need to create another query.

For this post, let’s say we are interested in parameter “Language background other than English”. But in this case, Power Query detects only one table on the school’s page, not the data I want!

The data I want is just a value on a chart and is not in a table (#1 below) and hence cannot be used for comparison purpose.

So the required data is not in tabular format. You can only retrieve tables from Web pages using Get Data from Web.  This means we have to find a different way to get this data from a web page.  Fortunately, there is a way to do this and I am sharing it with you in this post.

The Method

In Google Chrome, we can Inspect elements of web page by right-clicking and then selecting Inspect (#1 below).

In the HTML code below, we can see that “97%” (#1 below) is a text between tags “<tspan>” (#2 below) that in turn belongs to parent tag “<text>” (#3 below), and all of this is part of “<div id=”nonEnglishSpeakingStudents”…” (#4 below). Probably, this is what we need.

This may look scary, but these tags will help us to find the data we want in the HTML text of the web page. So we need to first get that text.  Going back to Power Query, “From Web”, right-click on the URL (#1 below) and then click on Edit (#2 below).

Power Query automatically wraps content received from web into Web.Page function as shown below.

But what we need is just the text. So we have to change the setting “Open file as” to “Text File” in the Source step.  Click on the cog to edit the source step.

When we use “Text File”, Power Query loads the web page html rows in one column.

We can filter the column in order to find the line that contains the text “nonEnglishSpeakingStudents”.

The step above leaves 1 single row in the table.  All other rows have been removed.  Selecting the single row that remains after filtering, we can see the required “97%” as part of the kept text.

Now, we can easily extract this text by using standard function from Power Query “Text Between Delimiters”.

All we need to do is specify Start and End delimiters, as show below. Very easy, no need to program own functions, just use the UI to help.

This adds a new column with the value we are looking for.

Getting Data for Another School

In theory, if we change the URL with another school ID, it should work the same way, because HTML text will have similar structure on the other school’s profile page.
Click on “Advanced Editor” and locate the URL with school ID in the ‘M’ language code.

To apply our query to another school’s profile page we need to change only the URL.
Let’s take, for example, Concord Public School. I changed the query and now the query returned 69 (as shown below), which is same as on the school’s page – it works!

Convert URL to Parameter

Now it works, I will convert the URL to a parameter instead of it coded as a string. It is better to create a parameter to make it more flexible.  Home\Manage Parameters\New Parameter

Fill necessary fields (Parameter Name = URL, Text, Any Value)

I use another school here (Meadowbank Public School https://www.myschool.edu.au/school/41257) to check that solution works regardless of the school selected.

Now that I have URL as a parameter, I need to change the Power Query code to make it point to the parameter instead of hard coded string.

To do this, go to settings of step Source of the query. And select Parameter in drop-down list.

Creating a Function from Query

The reason why we started this task is to compare schools. So we need links for all the school profile pages.

The most effective way to do this in Power Query is to create function from existing query. In Power Query Editor, right-click on the query and select “Create Function”.

Provide a name to your function.

Function replicates M-code used in the initial query but adds an option to call itself with a parameter URL. Note that the original query is still there, plus new function.

Get URLs of Multiple Schools

Current version of website allows to search schools, for example, by suburb name or postal code.

While we search, in the browser we can see URL that contains search parameters –
https://www.myschool.edu.au/school-search?FormPosted=True&SchoolSearchQuery=ryde&SchoolSector=G&SchoolType=P&State=NSW

It uses following parameters:
SchoolSearchQuery=ryde – e.g. suburbs around Ryde
SchoolSector=G – Government
SchoolType=P – Primary
State=NSW
So, if needed, such string can be generated with a formula in Power Query.
In the same way as described earlier for ‘Getting data of one school’, we can query this search results web page and extract useful information from the text.

For example, we can get all the information displayed on the search results page along with the schools’ URLs that we require as parameters as shown below.

Having the URL of each school, we can extend this table with information from each school profile page, e.g. with “Non English Speaking Students %” as shown below by calling the function as an added column.

That’s it. The required information to compare schools is at hand now.

You can apply the same technique to any website where data is published in non-tabular format.

Practice Exercises

Now that you know how to query non-tabular data from web, try to get currency exchange rates from the following pages:

https://www.xe.com/currencytables/?from=AUD&date=2018-06-19

http://www.floatrates.com/daily/AUD.xml

https://www.exchange-rates.org/converter/AUD/EUR/1

https://www.x-rates.com/table/?from=AUD&amount=1

Simple Way to Get Data About Australian Schools

Though the intention of this post is to demonstrate what is possible with Power Query, in case you became inquisitive to know more about Australian schools, there is a simpler option to retrieve schools’ data.
In the footer of https://www.myschool.edu.au website you can find a link to http://www.acara.edu.au/contact-us/acara-data-access where all the data is available in Excel format.

About Author

Ivan Bondarenko (aka Ivan Bond) is a specialist in Excel-based and Power BI reporting, VBA developer, author of open-source SAP Business Objects Automation Tool and Excel-based solution for scheduling update of Excel files (aka Power Refresh).

If you are keen to read about Power Query techniques from time to time, subscribe to Ivan’s blog or Twitter @_Ivan_Bond.

Sample File

The Sample file for this post can be downloaded here.

8 thoughts on “Using Power Query for Extracting Non-Tabular Data from Web Pages”

  1. What a great article / technique. This is something that I wanted to do for quite a while and wasn’t able to figure it out so I gave up. Now I can go back and pull information from lots of web pages. THANKS!

  2. I do not have the option to “Open file as” There are no options in my source step. Running Microsoft Office 365 ProPlus, Excel Version 1803 (Build 9126.2227)

    Is there a solution to add this feature?

    Thank you.

    Robert

    “But what we need is just the text. So we have to change the setting “Open file as” to “Text File” in the Source step. Click on the cog to edit the source step.

  3. Wow! Well articulated and extremely useful. I’ve stumbled into similar scenarios but you’ve made wonderfully explicit your process – thank you so much!

Leave a Comment

Your email address will not be published. Required fields are marked *