Extracting Data from Complex Web Pages with Power BI - Excelerator BI

Extracting Data from Complex Web Pages with Power BI

Back in 2018 I wrote an article showing how to extract data from websites using Power Query in Excel. For simple websites this process is very straightforward. For more complex websites the process is not always that easy. Microsoft is working on improving the “from web” experience and is building new capabilities. New features in Power Query will always be delivered as part of Power BI before being made available in Excel. In this article I will cover a new Power Query feature that is only available in Power BI that makes the process of extracting data from complex web pages far easier. The new feature is called “new web table inference”.

Extracting Complex Web Page Data

Extracting tabular data from a web page previously only worked well if there was an underlying HTML table presenting the data onto the screen. The “new web table from inference” feature is a step change improvement in the way Power Query extracts data.  In short, Power Query looks at the final rendered web page on the screen, then it analyses the underlying code (php, css, js etc) to work out how to extract the correct data.   This feature is still being developed and hence it wont work for all web pages, but it is pretty damn good already.

Before you can use this feature, you need to turn it on in the preview features settings.  Go to File\Options\Options and Settings\Preview Features and then turn it on as shown below.

new w

An Example of the New Web Table from Inference

In this article I will show you how to use this new preview feature to extract data from websites using the same example that I used in my previous article. The data I want to extract is the number of views of the posts by topic in the PowerPivot forum that I own and operate. When I searched the website for my posts, I found 1,067 matches.  As you can see in the image below, these posts are delivered to me 10 posts per web page across 107 web pages.  For this demo, I will extract a table containing the forum name, the topic and the number of views (#2 below) from the first page.

Ppf Posts 1

Of course it is possible to then create a function and extract all 1,067 articles, but I have covered that in other posts.

Before Turning on New Web Table from Inference

If I try to extract the list of forum articles above without using the new preview feature, this is what Power Query presents to me.

Before

Note how Power Query cannot identify the tabular structure of the page and instead shows a single table (#1 above) that is simply some HTML technical data (#2 above).  This standard approach is not going to solve the problem and hence I will need to use the new preview feature instead.

Enter the New Web Table from Inference Preview Feature

With the new feature, there are 2 options to extract the table I am after.

Suggested Tables

As you can see in the image below, there are now 7 new suggested tables that could be imported.  This is the new feature – it infers the tables by analysing the page.  Now if one of these tables contains everything you need, then you can simply start with that suggested table and make the required changes from there.  This will work fine as long as one of the suggested tables contains a superset of everything you need, missing nothing.  If at least 1 column of data is missing, you will need to use the other approach and build your own table.

Suggested

As it turns out in my demo example, I could have used Table 1 above, but instead I will use the second approach.

Using the “From web by example” option

This option allows you to build your own table by showing Power Query what you expect.  This feature has been available for a while already, but it is being continuously improved.  Even if you used it before and it didn’t work for you, it is worth taking another look.

To use the manually created table, click on the “Add table using examples” in the bottom left corner (#1 below).

Ppf Posts 2

After clicking you will be presented a spreadsheet style layout where you can type in examples of what you want to see in the final table (see the bottom half of the image below). Note that Intellisense helps you and presents a list of possible values as you start to enter data – simply select the one you want from the list of options presented.

Ppf Posts 3

In the image above, I am extracting the forum name (#1 above) into column 1.  To do this I need to type the exact name of the forum, then pick the correct representation I want from the list. Try to avoid typing these out in full – any spelling error or typo will prevent Power Query from completing the task hence it is better to pick from the list.

Once this is done, I did the same for the topic (#2) and views (#3) as can be seen below.

Ppf Posts 4

It is common that you will need to enter a second row of sample data so that Power Query can be sure of the pattern of the web data. You should choose the next consecutive entry on the web page, otherwise the process will not correctly extract all the data for the page.  The image below shows that I have selected the values for the first 2 columns in the second row – Power BI filled in the third column value and also the rest of the table (the other 8 rows).  Once the table is auto populated with the correct values, you can be sure that Power Query has correctly identified the pattern.

Ppf Posts 5

Now you can just click “OK” and load the data.

What next?

I really like this feature and I think it is a great example of how Microsoft is improving the end user experience by using artificial intelligence techniques under the hood.  You can expect a lot more features like this in the future, I think.

15 thoughts on “Extracting Data from Complex Web Pages with Power BI”

  1. Hi Matt,

    Is this feature still available in Power BI? If not, any pointers on what could be the other way of extracting data out of large html table embedded on a web page?

    Thanks!
    Sharman

    1. Power BI is constantly improving. I’m using the latest version (Nov 2020) and it finds the table just fine on its own without the need for me to manually specify the one I want. It extracts the first 60 players. You can get the next 60 by changing the URL parameter – this can then be done with a parameter by passing the numbers 60,120,180,240 to the main query.

  2. Daniel Aubert Damico

    Yes, I did. Unfortunately I always get refresh errors 🙁
    I hope they make this GA soon then.
    Thank you

  3. Daniel Aubert Damico

    Hi Matt, excellent post as always.
    My question is, once you publish this to the PBI service, you can’t refresh the dataset automatically, correct?
    Every time I use the “add table using examples” it works fine in PBI desktop but it doesn’t work in the service (dataset refresh).

    Do you have a workaround for this?

    1. I didn’t realise this. Have you tested it lately? I assume they haven’t migrated the code into the service as yet, but that situation doesn’t normally persist for long. Certainly once it is out of preview it will work.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top