Extracting Website Data
As discussed in the previous article, prior to the development of the New Web Table Inference feature in Power BI, extracting tabular data from websites only worked well when the page had an underlying HTML table presenting the data. But with the new Web Table Inference feature in your toolbox, Power Query analyses the underlying code in the rendered page of the target website to work out what data to extract. At the time of this writing, the New Web Table Inference feature is still being developed and hence doesn’t work for all web pages, and sometimes needs a little massaging, but the potential of this feature is very exciting.
To activate this feature, go to File\Options and Settings\Options\Preview Features and turn on New Web Table Inference.
Scraping Dynamic Data
The data I want to extract is details of the Fortune 500 Companies, as listed by Fortune.com.
Before Turning on New Web Table Inference
If I try to extract the list without using the new preview feature, Power Query presents only a table captured from the footer of the page.
Note how Power Query was unable to identify the tabular structure of the core content of the page and instead offers a single object (Table 1) revealing the data from the Sections table of the footer contents. This standard approach to scraping is insufficient to get data from such a complex website and hence we need to use the new preview feature.
The Power of New Web Table Inference
With the new feature enabled, there are 2 table objects Power Query has detected. While Table 1 looks very similar to the contents of the dynamic table I am attempting to scrape, it is missing some of the columns, so I need to implement the “Add table using examples” button to train Power Query exactly what I am targeting.
NOTE: If you are attempting this at home, and only see one table (Sections data) presented even after the New Web Table Inference feature is enabled, just close your Get Data wizard, and attempt to connect to the web data source again. Also, try adjusting the url to connect via http:// rather than https://. We’re pushing the capabilities of this new feature to the limits by asking it to scrape such a complicated website, and as the feature is still in development it doesn’t work perfectly every time.
Training the query
By clicking “Add table using examples”, Power Query allows you to build your own table by showing it what you expect the table to reveal. In this instance I am targeting the complete table, but it is best practice to just scrape the data you need.
After clicking the button you are presented with a split screen. At the top is an interactive version of the web page of your target website, and on the bottom will be a blank spreadsheet. As you start to fill in the spreadsheet data, Intellisense helps you select the valid fields, presenting a list of possible values. As you work through the spreadsheet, Power Query is learning which values in the website relate to the table, and by the time you have completed a second row value, Power Query will be able to recognise the pattern of the web data and auto populate the remaining rows. Then just click “OK” and load the data into Power BI.
NOTE: While the New Web Table Inference feature has an amazing ability to scrape data from these complicated websites, one limitation that currently exists with the technology is that it is only capable of scraping from what is shown on the default web page instance. So, in this example, even if you were to click and expand the data set to 100 rows, the Power Query engine is still only going to detect the first 10.
Refresh Your Query in Power BI
Finally, in the case that Power BI loaded empty tables after completing the Get Data wizard. Simply refresh your query to scrape the page again.