Matt Here: I first met Ivan via Twitter when he was living in Russia, and then in person when he moved to Australia to live in Sydney with his family. Ivan is a talented Power BI professional and I have learnt many things from him over the last few years. Today Ivan is writing a guest blog and he will share with you how he used Power Query to help him with his move to Australia. If you like what you read, be sure to subscribe to Ivan’s blog (link at the bottom of the page). I will hand over to Ivan.
Introduction
Ivan: I recently moved to Australia with my family. While searching for a place to live I had to look for a good pair of suburb and school. I found interesting data at https://www.myschool.edu.au/. However, there is no option to compare schools. It just gives some statistics about Australian schools, one page at a time. The information presented on a typical school profile page (e.g. Burwood Public School) is as shown below.
It means that in worst case, we can manually copy / paste data to Excel and perform own analysis. Of course, this can be done only for limited number of schools. Even for twenty schools it would take a bit of time. However, nowadays, when we have such tools as Power Query (Get & Transform) in Excel, we can collect data from web sites efficiently, and easily “refreshable”. Everything that I describe and share in this post is created for non-commercial educational purpose only.
Getting Data of One School
Let’s take Burwood school as an example. Copy URL from browser
https://www.myschool.edu.au/SchoolProfile/Index/104736/BurwoodPublicSchool/41369.
And query data from this URL in Excel.
Note: I usually do all queries in Excel. If needed, I copy/paste them to Power BI.
Power Query automatically detects all tables available on web page, and presents options so we can choose the one we are interested in. When we need more than one table we need to create another query.
For this post, let’s say we are interested in parameter “Language background other than English”. But in this case, Power Query detects only one table on the school’s page, not the data I want!
The data I want is just a value on a chart and is not in a table (#1 below) and hence cannot be used for comparison purpose.
So the required data is not in tabular format. You can only retrieve tables from Web pages using Get Data from Web. This means we have to find a different way to get this data from a web page. Fortunately, there is a way to do this and I am sharing it with you in this post.
The Method
In Google Chrome, we can Inspect elements of web page by right-clicking and then selecting Inspect (#1 below).
In the HTML code below, we can see that “97%” (#1 below) is a text between tags “<tspan>” (#2 below) that in turn belongs to parent tag “<text>” (#3 below), and all of this is part of “<div id=”nonEnglishSpeakingStudents”…” (#4 below). Probably, this is what we need.
This may look scary, but these tags will help us to find the data we want in the HTML text of the web page. So we need to first get that text. Going back to Power Query, “From Web”, right-click on the URL (#1 below) and then click on Edit (#2 below).
Power Query automatically wraps content received from web into Web.Page function as shown below.
But what we need is just the text. So we have to change the setting “Open file as” to “Text File” in the Source step. Click on the cog to edit the source step.
When we use “Text File”, Power Query loads the web page html rows in one column.
We can filter the column in order to find the line that contains the text “nonEnglishSpeakingStudents”.
The step above leaves 1 single row in the table. All other rows have been removed. Selecting the single row that remains after filtering, we can see the required “97%” as part of the kept text.
Now, we can easily extract this text by using standard function from Power Query “Text Between Delimiters”.
All we need to do is specify Start and End delimiters, as show below. Very easy, no need to program own functions, just use the UI to help.
This adds a new column with the value we are looking for.
Getting Data for Another School
In theory, if we change the URL with another school ID, it should work the same way, because HTML text will have similar structure on the other school’s profile page.
Click on “Advanced Editor” and locate the URL with school ID in the ‘M’ language code.
To apply our query to another school’s profile page we need to change only the URL.
Let’s take, for example, Concord Public School. I changed the query and now the query returned 69 (as shown below), which is same as on the school’s page – it works!
Convert URL to Parameter
Now it works, I will convert the URL to a parameter instead of it coded as a string. It is better to create a parameter to make it more flexible. Home\Manage Parameters\New Parameter
Fill necessary fields (Parameter Name = URL, Text, Any Value)
I use another school here (Meadowbank Public School https://www.myschool.edu.au/school/41257) to check that solution works regardless of the school selected.
Now that I have URL as a parameter, I need to change the Power Query code to make it point to the parameter instead of hard coded string.
To do this, go to settings of step Source of the query. And select Parameter in drop-down list.
Creating a Function from Query
The reason why we started this task is to compare schools. So we need links for all the school profile pages.
The most effective way to do this in Power Query is to create function from existing query. In Power Query Editor, right-click on the query and select “Create Function”.
Provide a name to your function.
Function replicates M-code used in the initial query but adds an option to call itself with a parameter URL. Note that the original query is still there, plus new function.
Get URLs of Multiple Schools
Current version of website allows to search schools, for example, by suburb name or postal code.
While we search, in the browser we can see URL that contains search parameters –
https://www.myschool.edu.au/school-search?FormPosted=True&SchoolSearchQuery=ryde&SchoolSector=G&SchoolType=P&State=NSW
It uses following parameters:
SchoolSearchQuery=ryde – e.g. suburbs around Ryde
SchoolSector=G – Government
SchoolType=P – Primary
State=NSW
So, if needed, such string can be generated with a formula in Power Query.
In the same way as described earlier for ‘Getting data of one school’, we can query this search results web page and extract useful information from the text.
For example, we can get all the information displayed on the search results page along with the schools’ URLs that we require as parameters as shown below.
Having the URL of each school, we can extend this table with information from each school profile page, e.g. with “Non English Speaking Students %” as shown below by calling the function as an added column.
That’s it. The required information to compare schools is at hand now.
You can apply the same technique to any website where data is published in non-tabular format.
Practice Exercises
Now that you know how to query non-tabular data from web, try to get currency exchange rates from the following pages:
https://www.xe.com/currencytables/?from=AUD&date=2018-06-19
http://www.floatrates.com/daily/AUD.xml
https://www.exchange-rates.org/converter/AUD/EUR/1
https://www.x-rates.com/table/?from=AUD&amount=1
Simple Way to Get Data About Australian Schools
Though the intention of this post is to demonstrate what is possible with Power Query, in case you became inquisitive to know more about Australian schools, there is a simpler option to retrieve schools’ data.
In the footer of https://www.myschool.edu.au website you can find a link to http://www.acara.edu.au/contact-us/acara-data-access where all the data is available in Excel format.
About Author
Ivan Bondarenko (aka Ivan Bond) is a specialist in Excel-based and Power BI reporting, VBA developer, author of open-source SAP Business Objects Automation Tool and Excel-based solution for scheduling update of Excel files (aka Power Refresh).
If you are keen to read about Power Query techniques from time to time, subscribe to Ivan’s blog or Twitter @_Ivan_Bond.
Sample File
The Sample file for this post can be downloaded here.
I can’t do this on MacOs. Can you help me?
Power Query is only available to Mac Office 365 users at present.
If you don’t have access to Power Query in your version of Excel for Mac, then your best bet is to use Data Flows.
Data Flows allows you to use Power Query online, and then you can access your transformed data by connecting to the Data Flow, or by downloading its results.
You can find out more about Data Flows here:
https://learn.microsoft.com/en-us/power-bi/transform-model/dataflows/dataflows-create
Thank you very much for this article. Instant bookmark!
Thank you. Thank you. Thank you. You’re an absolute CHAMPION!
Hi Ivan,thank you for this demonstration.I applied the same procedures to extract data from not tabular web page,for the first original page it works, however when I change URL to extract data in another web with the same format as the first one,it displays nothing.How can I fix this?
Have a look at this article.
Scraping Data from JavaScript Websites
This may assist, depending on the cause of the issue
Hi, great information in this post. One question I have, when trying to reproduce, I get a first page on the URL query to accept terms and conditions. When browsing thru a browser I just check the box and get to the page you mention in the article. But in Power Query, how do I got thru that instance of interaction to get to the page of the information to extract?
Thanks in advance,
Gabriel
Sounds bad to me. I am not sure it can be done. It may be a job for Python.
I’m not finding the Paramater field when doing this step below
“To do this, go to settings of step Source of the query. And select Parameter in drop-down list.”
oh found it !
thanks for the awesome technique !
Really Helpful! Just used this as I couldn’t figure out how to get to the source code of a website!
What a great article / technique. This is something that I wanted to do for quite a while and wasn’t able to figure it out so I gave up. Now I can go back and pull information from lots of web pages. THANKS!
Excellent post Ivan. Love the ending. Started for the query and was interested in the topic, then you closed with the data in Oz. Thanks again for the great post.
Good one Ivan… Very well written and explained.
I do not have the option to “Open file as” There are no options in my source step. Running Microsoft Office 365 ProPlus, Excel Version 1803 (Build 9126.2227)
Is there a solution to add this feature?
Thank you.
Robert
“But what we need is just the text. So we have to change the setting “Open file as” to “Text File” in the Source step. Click on the cog to edit the source step.
Oooops. I missed the step to Right Click Source. That was the solution.
Well done.
What do you think about the new “From Web” experience – could it help there?
Wow! Well articulated and extremely useful. I’ve stumbled into similar scenarios but you’ve made wonderfully explicit your process – thank you so much!
Great blog post, Ivan! Matt knows how to pick guest writers 🙂