Archive for Power Query

Use Power Query to Compare Database Records

I was helping a user on community.powerbi.com this week. The user had a list of service desk tickets for multiple customers and the requirement was to identify when a customer opened a second new ticket within 7 days of closing a previous ticket.  This is not an uncommon scenario so I thought it would make a good blog article.

image

One of the challenges that new users have when using Power BI is to decide if they should use Power Pivot (DAX modelling) or Power Query (PQL shaping) to solve each problem.  And if the answer is DAX, then they also need to decide if it should be a measure or calculated columnThere is no one single correct answer to these questions and it always depends on the circumstances.  I generally recommend to shape your data properly prior to loading to Power Pivot, that way the DAX formulas are easier to write and the reporting database will be more performant.   I looked at the options for this problem and decided to use Power Query to shape the data before loading.  Power Pivot and DAX are not really designed to do ordinal row comparisons, so this task is better handled with Power Query.

For this particular problem I thought I would produce a video that demonstrates one way in which this problem can be solved.  Here it is.

My Online Power Query Video Training

One of my new year’s resolutions was to baton down and create an online Power Query training course.  This has been on my to-do list for almost 2 years now and I am well under way.  The video above is a sample of my video course.  I start from the beginning and walk the viewer through the foundation skills they need to know to be a Power Query guru.  Towards the end of the course I will have a stack of videos like the one above where you can broaden your awareness of how you can use your new skills in real world situations.  If you would like to sign up and be notified once the course is finished (no obligation), then please enter your contact details in the form below.

Best Practices for Power Pivot, Power Query and Power BI

Level: Beginners

There are many best practices for Power BI, Power Pivot and Power Query.  I know these things so intuitively now that it is very easy to forget how I incrementally learnt these things along the journey.  Most of these things are not “right” vs “wrong” – they are more often simply “better”.  I thought there would be value in producing a page that clearly outlines the important best practices as a learning reference for everyone.  In all cases I have outlined why it is a best practice to help the understanding.  You would be well placed to adopt these best practices as part of your DAX journey.

Naming Conventions

Naming your Columns and Measures

  • Always write a Column in the format TableName[Column Name]
  • Always write a Measure in the format [Measure Name]

This is the foundation of all the DAX formulas you will write.  Both columns and measures use the same square bracket syntax.  It is technically possible to write measures and columns both including the table name as follows.

  • TableName[Column Name]
  • TableName[Measure Name]

These 2 examples above do not cause a problem because their names makes it easy to know what they are, but if you always place the table name at the front of every measure and every column, then it will be impossible to tell them apart by reading the formula.  Take the following example

  • TableName[Total Sales]

Is the above a column or a measure?  It is impossible to tell unless you are using the best practice naming convention.

Measures and columns are very different in the DAX language.  It is essential that you can tell at a glance which is which.  In addition if you hard code a measure with the table name and then you later move the measure to another table, any formulas referring to this measure will stop working.

Give Tables a Single Noun Name

  • Don’t just accept the table name from your source system.  Preferably give the table a single word noun description/name.

Many BI data sources will have long table names like fctSalesTransactionsHistory or dimCustomerMasterFile.  This is a common practice in IT and is related to the Kimball dimension modelling methodology.  The problem is that with Self Service BI these table names, column names and measure names are more visible to business users than ever before.  Given many business users are going to be reading these table names, it is much easier for them to comprehend the “Sales” table rather than the “fctSalesTransactionsHistory” table.  In addition earlier versions of Power Pivot do not have fully featured intellisense – you are required to refer to columns by starting to type the table name from the beginning.  If every table starts with either fct or dim, you have just added 3 additional characters you have to type for each formula before Intellisense can help you.

Also PowerBI.com has a natural langauge query tool that allows you to ask quetions of your data.  If your table names are ‘words’, then you are helping the engine find what you are looking for.

Using Spaces in Names

  • Don’t use spaces in table names
  • Do use spaces in column names
  • Do use spaces in measure names

If you use spaces in table names you will be forced to add single quotes around the table name each time you refer to it in a formula.  This makes the code longer, harder to read and “untidy” (IMO anyway).  It is better to use underscore_characters or CamelCase instead of spaces (or better still use a single noun name as mentioned above).

Columns and measures always need to be wrapped in [Square Brackets] anyway and hence adding spaces does not make the code any more complex.  Columns and measures are easier to read if they have spaces

Don’t Overly Abbreviate Business Terms

  • Give your tables, columns and measures descriptive business names without overly short abbreviations.

Firstly you should use language and abbreviations that are commonly used in your organisation.  So if “Year to Date” is commonly abbreviated to YTD, then for sure you can use this abbreviation in your measure names eg [Total Sales YTD].  However if you develop a new measure called [Total Sales Last Rolling Quarter] and this is not a common concept across the organisation, then you are just making it hard for yourself if you call your measure [Ttl Sales LRQ].  You will simply have people calling you asking what it means.

Secondly Power BI has a feature called Q&A that allows a user to ask a natural language question about data.

eg.  What were the total sales for bikes last year

This natural language algorithm looks for matches in words in the question against the data model to help answer the question.  If you abbreviate your measure names to for example [TtlSales] instead of [Total Sales], you are making it hard for Q&A to do its work.  You can help Q&A using synonyms, but do yourself a favour and don’t over abbreviate your name.s

Measures or Calculated Fields

  • Measures is a better name than Calculated Fields

The term measures has been a Microsoft term for a BI formula for many years.  In the first release of Power Pivot in Excel 2010, Microsoft adopted this term.  Unfortunately in Excel 2013 somehow a decision was taken to rename “measures” to be called “calculated fields”.  This was a bad decision and thanks to lobbying from many people Excel 2016 reverted to using the term measures (as does Power BI).  I always now use the term measures and never refer to Calculated Fields unless I am explaining to Excel 2013 users why they are stuck with a bad name.

Loading and Shaping Data

Push Shaping as Close to the Source as Possible

  • Wherever possible, you should do your data shaping as close as possible to the data source.

There are many ways that you can shape your data in the Microsoft BI stack.  Power Query is a great tool to reshape your data however you can also use Power Pivot (Calculated Columns, Filters on load) and Power BI also includes Calculated Tables.  And you can always write SQL code and paste that into the tools to extract the data that way.  The main problem with these approaches is you are effectively hard coding a solution for a single data set.  If you want to build another data set in the future, the work needs to be done again (either copy or re-write).  The data shaping tools are designed to allow you to do what ever you need without having to rely on a third party – use these tools if you need to.  However if you have a common need for data in a particular shape and you can get support (from IT or otherwise) to shape your data at the source so you can easily get what you need, then there is definitely value in doing that.

Shape in Power Query, Model in Power Pivot

Power Query and Power Pivot were built to do 2 completely different tasks.  Power Query is built for cleansing and shaping while Power Pivot is built for modelling and reporting.  It is possible that you can shape your data in Power Pivot (eg you can write calculated columns, you can add calculated tables (in the newer versions) etc).  But just because you can do these things in Power Pivot, doesn’t mean you should.   For example it is possible to write letters to people using Excel, but Word is a much better tool for this task (I knew someone that once did that!).

Best practice is that you should use Power Query to shape your data before/during load, and then use Power Pivot for measures and reporting. I have deeper coverage on this topic here.

Use A Calendar Table

  • If you want to any sort of time calculations, get a Calendar table

It is possible that you can analyse your data in a single flat table without using any lookup/dimension tables.  A Calendar table is a special type of lookup/dimension table because it can be used to perform time intelligence functions.  I have an article on time intelligence here and another on Calendar tables here.  Bottom line – get a Calendar table.

A Star Schema is Optimal

  • Power Pivot is optimised to use a Star Schema table structure

I have an in-depth article about star schemas here that you can read if need be.  I am not saying this is the only layout that will work, or that other designs will always be slow.  I am saying that if you start out thinking about a star schema and aim to build that design you will be well under way to success.  Two key things you should know.

  • Don’t just bring in what is in your source transactional database – that would likely put you into a world of pain.
  • There is no need to create a lookup/dimension table just for the sake of it.  If your sales table has customer name and you don’t care about anything else about the customer (eg city, state etc), then there is no need to create a lookup table just for the sake of creating a star schema.  If you have 2 or more columns relating to the same object in your data table, then it is time to consider a lookup table.

You Should Prefer Long and Narrow Tables

  • Short wide tables are generally bad for Power Pivot but long narrow tables are great.

image

There are 2 main reasons why loading data this way is a good idea.

  • Power Pivot is a column store database.  It uses advanced compression techniques to store the data efficiently so it takes up less space and so it is fast to access the data when needed.  Simplistically speaking, long narrow tables compress better than short wide tables.
  • Power Pivot is designed to quickly and easily filter your data.  It is much easier/better to write one formula to add up a single column and then filter on an attribute column (such as month name in the green table above) than it is to write many different measures to add up each column separately.

Only Load the Data You Need

  • Load all the data you need, and nothing you don’t need.

If you have data (particularly in extra columns) you don’t need loaded, then don’t load it. Loading data you don’t need will make your workbooks bigger and slower than they need to be.  In the old world of Excel we all used to ask IT to “give me everything” because it was too hard to go back and add the missing columns of data later.  This is no longer the case – it is very easy to change your data load query to add in a column you are missing.  So bring in all of what you need and nothing you don’t.  If you need something else later, then go and get it later.  It is even advisable to use a tool like PP Utilities to show you which columns are not in use so you can delete them.  Focus mainly on your large data tables – the lookup/dimension tables tend to be smaller and hence are generally less of an issue (not always).

Don’t use Linked Tables

It is possible to add your data to a table in Excel and then use a Linked Table to load it into Power Pivot.  You simply select the data, go to the Power Pivot Menu (1 below) and click Add to Data Model (2 below).

The trouble with doing this is that you end up with 2 copies of the data in your workbook. The Excel table is an uncompressed copy and then you also have a compressed copy inside Power Pivot.  In the example (shown as 3 above) there are many thousands of rows of data.  The uncompressed data can take up 6 – 10 times more space than the equivalent compressed data.  If you have a small table with a couple of columns and 10 or 20 rows then it is fine.  However if you have anything more than that you are better to have 1 workbook for your data and then import the table directly into Power Pivot without storing it in Excel at all.

Modelling

Avoid Bi-Directional Relationships

  • Avoid using the default bi-directional relationships in Power BI unless you need them.

image

Microsoft introduce bi-directional filter propagation in Power BI (this is currently not available in Excel).  This is a good feature for beginners and those that don’t really understand how the plumbing works.  But it comes at a cost in that:

  • The performance can be negatively affected
  • If there is more than 1 data table, there can be circular relationships created (just like in cells in Excel)

I recommend you turn your bi-directional relationships to single direction (double click to change) and only turn them back on if you really need them.

Measures are Better than Calculated Columns

  • Wherever possible you should write Measures in Preference to Calculated Columns Where Possible

I have been a strong proponent of this for many years.  It mainly applies to Excel users that don’t have any formal learning about database design.  It is very easy for an Excel user to think they should write a calculated column (because they know how to do that) and not a measure (because that is a bit foreign to an Excel user).  I am not going to cover this in depth again now as I have already covered it here.  Just do yourself a favour Excel folk and start with the assumption that a measure is what you should write unless you know why a calculated column is a better fit.

For the record there are times when a calculated column is the best option, but 99.9% of all use cases coming from new Excel users won’t need them.  The main exception is if you need to use the formula in a slicer to filter your data – then you will need a column.

Store Measures in the Table Where the Data Comes from

  • You have a choice where to store your measures, so place them in the table where the data comes from.

Take for example a measure like this.

Total Sales = SUM(Sales[Extended Amount])

The data for this measure is coming from the [Extended Amount] column in the sales table.  You should therefore store the measure in the sales table.   If you place it in (say) the Calendar table, Excel will likely give you a warning similar to shown below.

image

This issue doesn’t occur in Power BI.

Break Measures into Interim Parts

  • Break your DAX problems into manageable pieces and solve each piece one at a time.

You wouldn’t use a single cell in a spreadsheet to build a financial model.  The cells are there to be used and make it easier to build a solution that meets your needs.  You should think the same way about measures.  The following formula is valid however it is hard to read, write and debug.  It also repeats the same line of code multiple times throughout the measure.  Having said that it will give you the % change in sales vs last year.

 

It is much easier to write interim measures and then join the pieces together to solve your problem.  Plus you get each interim measure available for reuse elsewhere in your model.  I am sure you will agree the following set of 4 measures are much easier to understand.

Total Sales = SUMX(Sales,Sales[Qty] * Sales[Unit Price])

Total Sales LY  = CALCULATE([Total Sales],SAMEPERIODLASTYEAR(Calendar[Date]))

Chg vs LY = [Total Sales] – [Total Sales LY]

% Chg vs LY = DIVIDE ( [Chg vs LY], [Total Sales LY] )

Don’t Break Calculated Columns into Interim Parts

  • It is good to have interim measures but it is bad to keep interim columns.

Interim measures are calculated on the fly on demand, they take up little space and make it easier to write your DAX.  As with measures, I t is easier to write calculated columns using interim calculated columns, however the problem is that every column is pre-calculated and stored on disk, and each additional column makes the data model take up more space on disk and memory and hence makes it less efficient.   By all means write interim columns if you need to in order to create a calculate column (not withstanding the earlier comments of columns vs measures) however once you have worked out the correct syntax, concatenate all the code into a single “Mega DAX formula” in a single column. This is an Excel concept I learnt from John Walkenbach.

Other Advice

You Can’t Start to Learn DAX by Reading Alone

I say this up front in my book “Learn to Write DAX”.  If you think you are going to learn a new skill like Power Pivot, Power Query or Power BI by reading a book and not getting your hands in the system, let me tell you “you can’t”.  The exception is if you are a professional SQL Server database user and have a solid background in reporting and analytics database technology, then I am sure you can learn this way. For the rest of us Excel folk, there is no substitute for practicing what you read – so do yourself a favour.

Use 64 bit If You Can

This one causes a world of pain for many people.  Power Pivot is the ONLY MS Office product that can benefit from 64 bit, but unfortunately it is all or nothing.  Most organisations have deployed 32 bit and will be very reluctant to give you 64 bit Office.  You can read my article on this topic here and also read up about my work arounds including installing 64 bit Power BI Desktop with 32 bit Office, and then also using Power BI Desktop as a local server for Excel.

Cleansing Data with Power Query

Today I am combining a few techniques to show how to build a robust cleansing approach for your data using Power Query.  This article will demonstrate the following Power Query techniques

  • Joining tables (Merge Query)
  • Self Referencing tables
  • Adding Custom Columns

Sample Data

I am using some sample data as shown below.  The Country data is the data I want to cleanse so there are consistent country values in the Country column.

image

I first load this data into Power Query and set “Close and Load to” so that it only creates a connection.  I then have Power Query access to the source data table that can be reused multiple times.

image

Create a Table of All Entered Countries

The next step is to produce a list of the unique entries in the source data.  To do this, I referenced the above table by right clicking (1 below), then Reference (2)

image

I then removed the Name column (1 below) and then removed the duplicates in the remaining Country Column (2 below).

image

I then sorted the column before loading the data back to Excel as shown below.

image

Create the Substitutes with a Self Referencing Table

I first covered this trick of a self referencing table in this blog article.  Continuing on with my test data above, I have manually added a new column to the green table above and entered the substitutions as needed.  Where there is no substitute needed I have simply left the substitute blank.

image

As it currently stands above, if the source table is sorted and then the green table is refreshed, the substitute column will fall out of alignment with the source table – that is bad.  Now for the self referencing trick.  I load this green table above back into Power Query as a new query, setting it to only create a connection as I did for the original set of data. I named this new query “Substitutes”

I then edited the original query that created the original green table and joined it to the substitute table (aka self referencing).  While editing the Entered Names table, I selected Merge Queries (1 below), selected the Substitute table (2), then joined on the Country Columns (3, 4) with a left outer join (5).

image

I then expanded the new column to extract the substitutes.

image

 

This now gives me a self referencing table that will refresh while keeping the countries and substitutes correctly aligned regardless of what happens to the source table.  This is the key point about a self referencing table (covered in my previous article).

image

Update the New Data

Now when new data is added to the source table, it will appear in the substitute table after refresh as follows.

image

What would be much better is to be clear which are the correct entries that have previously been checked so it is more obvious which are the new entries.  So I decided to enter the correct country name for each entry as follows.

image

It is now much clearer that there are 3 new entries that need to be maintained, and I then updated the new data below.

image

Use the Substitute Table

Now that I have this substitute table working for me, all I need to do is create a new copy of my source table that uses the substitute country rather than the original country entered.  I create a new query referencing my original source data (right click, reference).

Merge the queries as before, joining on the original country column.

image

expand the table to retrieve the substitute column

image

Then delete the original country column and rename the substitute column to be called “Country”

image

As you can see below, the substitute table is correctly remapping the data as needed.

image

When the new data is added to the source table, the substitute table will have blank entries as shown below.

image

As a nice touch, I went back into the final query and added a new custom column as follows

and then used this new column as the country column.  Now when I refresh the data, it is clear that I need to do some maintenance on the substitute table.

Data Modelling with OneDrive Source Files

Last week I shared how it was possible to easily create an online survey using Excel for O365.  Once such a survey is created, the data can be imported into PowerBI.com and used to create a simple report.  The biggest issue with directly connecting to OneDrive like I showed last week is that you can’t do any serious data modelling.  You can’t write any DAX measures or join other tables into the data model etc.  Not being able to build a proper data model with my data set last week created the following issues (I am sure there are more).

  1. I couldn’t easily create a sort order column for my “how long have you been reading” question (hence why I needed to prepend a letter A, B, C etc to get the answers to sort).
  2. Without data modelling, there was no easy way for me to cleans the data. eg Some people said they came from “United States” and others said US, but that didn’t work in my map (it had to be consistent).
  3. image<

  4. I couldn’t write proper measures such as countrows(survey) to find out how many surveys there were.  Instead I had to rely on “Count of Name”.  If someone left the name field blank, it would not be counted.
  5. I couldn’t change the column names to be something more suitable for reporting – I was stuck with the actual survey questions in the visualisations.

All of these issues (and more) are solvable if you can simply build a regular data model, but this is not possible when you directly connect PowerBI.com to a flat table of data in the cloud.

This article explains how you can create a data model with a cloud based service (like OneDrive) and also how I then went about improving the reporting from last week given I could crate a proper data model.

Connecting Power BI Desktop to OneDrive

The alternative to connecting PowerBI.com to the OneDrive file is to use Power BI Desktop instead.

  • First build a data model using Power BI Desktop connecting to the cloud copy of the OneDrive file.
  • Publish the Power BI Desktop file to the service in the cloud (after you have done your data modelling of course)
  • Configure Power BI cloud service to directly authenticate to OneDrive in the cloud so that it can directly access the OneDrive data without further configuration or gateways.  Power BI automatically refreshes the data for you approximately once every hour or so.

Let me demonstrate how to do this using the same survey from last week.  If you haven’t already completed the survey you can do so by clicking here and then see you survey response included in the reporting below.  You can read about how to create an Excel survey like this from the blog article last week.

Copy the URL to your OneDrive File in the Cloud

There are a few ways to do this but the way I did it was to copy the URL of my OneDrive XLSX file and then use that to import the data “From Web”.

Get the URL

The tricky bit is finding the correct web URL of your file on OneDrive.  The way to do this is to open the Excel file in Excel on your PC and extract the URL from there.  You can either open the file from OneDrive in the cloud or the copy on your PC (I used the one on my PC).

Once you have the file open in Excel, click on the File menu, then Info (1 below), click on the OneDrive Link (2 below) and then click “Copy link to Clipboard” (3 below).  This placed the link into the clipboard.

image

With the URL still in the clipboard, I created a new Power BI Desktop report using Get data, Web as shown below.

image

I then pasted the URL from the clipboard and manually deleted everything from the question mark to the right (highlighted below).

image

So I was left with the following

image

I then got the following complicated set of options for Authentication. I took the last option to sign into my Organizational Account, and signed in with my Office 365 credentials

image

image

After authenticating, I was able to import the table of data into Power BI Desktop and Load as normal.

image

Load to Power BI and Configure Credentials

I will continue the process of building out my data model further down the page, but first let me complete the process to publish and configure the online credentials process.

First I saved the query (close and apply), saved the workbook and published it to my PowerBI.com account.  Once the dataset was in the cloud, I needed to configure my credentials for the dataset.    I clicked on Settings (1 and 2), Datasets (3) and then selected my dataset (4 below)

image

Because this is an online data source, I was able to edit the credentials as shown below.

image

After clicking Edit Credentials, I selected OAuth2 and signed in.

image

And I then got this confirmation.

image

Improving the Reporting Using Data Modelling

Back to the data modelling in Power BI Desktop.  The first thing I did was turn off data load for the table I had previously loaded.  This was quite easy. Right click on the table, then uncheck “Enable Load”.  This keeps the connection to the table but doesn’t load it in the current form.

image

The rest of this article explains how I changed the data to make it more useable in my final report.

Cleanse the Country Data

The next thing I did was create a data cleansing table for the list of countries. Ideally the survey tool would provide a list of all valid countries to select from a list, however this is not available using the simple online Excel survey tool I am using.  Instead I created a substitution table where I can manually create substitutes for data that I wanted to change.

I created a new query with reference to the first.  Right click on the first query (1 below) and then select Reference (2 below).

image

I then deleted all columns other than the country column, removed duplicates and sorted the column.  Finally I set it to disable load.  It is now possible to see all the data points entered and identify the dirty data that was included (shown below).

image

I then created a manual mapping table to make it easier to control the data cleansing process.  To do this, I simply created a new manual table in Power BI Desktop as show below (and also turned of data load).

image

Sorry to the non descript North American if you happen to be Canadian, Mexican or of some other Nationality.  I decided to map this to the USA.

Now I had the mapping table, the next step was to replace the original country data with the cleansed data. I created a new query referencing Table1, then joined on the country columns to the CountryCleanse table.

image

Using a Left Outer Join as shown below

image

I then expanded the new column to extract the cleansed version of the data.

image

This gave me a new column that contains the substitutes in the case there was a match in the cleansing table.

image

I then created a new custom column that uses this cleansed column if it exists, otherwise the original column.

image

image

I then deleted the other 2 columns and renamed my new column so that I had a cleansed copy of the Country.

Cleanse and Sort “How Long” Answers

In my survey question, I asked how long had the participant been reading my blog.  When the answers were placed in Power BI they didn’t sort based on length of time (they sorted alphabetically).  To fix this (using the original approach I blogged about last week) I was forced to add a prefix to force the answers to sort logically.  Using Data Modelling, it is possible to solve this problem properly.  For variety (in this blog), I decided to use a different mapping table solution. I first created a mapping table with the original survey choices, a new set of labels and a sort column as follows.

image

Then I loaded this mapping table as a lookup table to my data model and joined it on the Original column as the key.

image

Then to avoid confusion in the final report, I hid the original columns in both tables.  Right click on the column (1 below) and then select Hide in Report View (2 below).

image

Of course I sorted the Tenure column

image

Adding “All of the Above” into the other 3 choices

The last problem I wanted to solve was that many people chose “All of the above” when asked about their favourite topics.  When this was put in a visual (shown as 1 below) it is not possible to see the overall most popular item.

image

There are many ways to handle this problem.  The measure I wrote was as follows:

Count Most Useful =
COUNTROWS ( Data )
    + CALCULATE (
        COUNTROWS ( data ),
        Data[Which topics do you find most useful?] = "All of the above"
    )

The measures simply adds the count for each item and also adds “All of the above” as well (shown as 3 above) and I then simply filtered out the “All of the above” answer from the visual (shown as 4 above).

My report was starting to look good (see below) but I still wasn’t happy.

image

Ideally I should be able to show “what is most useful” and also “What do you want to see more of” in a single bar chart.   But the problem was that there were 2 different measures and the “bars” are coming from 2 different columns in the data source.  This could not be solved without thinking differently about the problem.

I created a new table that has the 3 responses and loaded it to the data model.  I didn’t connect this table to any other tables in the data model.

image

I placed this new column of data as the Axis of my visual and place the 2 measures as Values.  This then gave me a chart that looked like I needed, but of course there was no filtering, because the new “Types” table is not connected to anything in the data model.

This is important:  The Axis is part of the filter context, the Types table is being filtered by the visual, but this filtering is not being used in the measures I wrote.

image

So I changed the measures so they would “harvest” the value on the Axis to filter the results.

Count Most Useful =
IF (
    HASONEVALUE ( Types[Product] ),
    CALCULATE (
        COUNTROWS ( Data ),
        Data[Which topics do you find most useful?] = VALUES ( Types[Product] )
    )
        + CALCULATE (
            COUNTROWS ( data ),
            Data[Which topics do you find most useful?] = "All of the above"
        )
)

The final visual looks like this, which is much easier to read and understand.

image

And here is the final report

Easy Online Surveys with Power BI Reporting

I think today’s article will be of interest to my readers even though it is a little astray from my normally pure Power BI, Power Pivot and Power Query content. I will show you how to quickly and easily create an On-Line Survey that you can distribute to anyone that has an Internet connection, and then use Power BI to report on the results from your survey.

Office 365 Excel Surveys

You may not be aware that O365 has an inbuilt Excel survey tool that is very easy to use.  And who wouldn’t want to use Excel to solve a problem if they could?  You can find the survey tool by going to OneDrive online.  I am using OneDrive for business for this demo, but it also works for OneDrive Personal.

Here is Survey – Please Participate

For demonstration purposes, I have created a survey to ask my readers what they would like to see more of on my website and to see which countries the readers live in.  Edit 7 Feb 2016.  The survey is now closed, but you can still read on.

image

Now that you have completed the survey, read on to see how you can do the same yourself and to see the latest results via an embedded a Power BI report at the bottom of the page.  Note that it can take up to 1 hour for your specific survey response to be available in the report (ie it is near real time – not real time).

How I Created the Survey

First Open OneDrive Online

Frankly I get confused with all the logins for Microsoft sites. The easiest way I know how to login to OneDrive Online is to Open Windows Explorer, right click on the OneDrive folder and then select View Online.

image

Create Your Excel Survey

Creating an Excel survey couldn’t be easier.  Just click New\Excel Survey as shown below.

image

I immediately got some survey header information to fill in as shown below.

image

There are no user manuals required to work out how to do this.  I just updated the information and moved on to the next section.  It is easy to enter the questions you have, and there are a number of intuitive response types that you can choose to meet your needs.

image

Survey Results Database

After you have completed your survey, you can see the underlying Excel spreadsheet that stores your survey results.  It is possible to add new columns to the right of the results table.  Additional columns added manually can be used to store information you want to add without it being part of the survey.

image

Connecting the Data to PowerBI.com

There are a few ways you can create a Power BI Report.  If you want a fully featured data model including the ability to import other tables of data and write your own DAX, then you need to import this new data source into Power BI Desktop as you would with any other report.  For this demo today, I have use the “Get Data” feature from directly inside Power BI Service to connect to the Excel workbook on OneDrive.  To do this, I first logged into PowerBI.com and selected get data from files as shown below.

image

Then I selected OneDrive – Business as shown below.

image

Then I selected my survey file and clicked Connect

image

I was then prompted to either import the data or connect.  The connect option works just like Excel Online – not what I wanted.  So I selected Import.  When you use import in this way, Power BI will check your OneDrive account about once each hour to check for new data and will automatically update your report.

image

Power BI then created a new Dataset for me (1 below).  You can see the table of data shown in 2 below.  It also created a Dashboard, however this was not much use so I simply deleted the dashboard.

image

Here is the Final Power BI Report

Once I had the data connected, I simply created a few simple visualisations using the capabilities of Power BI Service.  You can see the final interactive report below.

Be sure to read next week’s post where I cover how you can use Power BI Desktop to model your online data. http://exceleratorbi.com.au/data-modelling-onedrive-source-files/

Import Tabular Data from PDF using Power Query

Today I am sharing a process I developed that allows you to import tabular data from a PDF document into Excel (or Power BI) using Power Query.  I didn’t want to purchase software to do this task so I started experimenting on how I could do it with the tools I already have, and I am sharing my solution here today.

pdf-to-pq

Note:  This will only work for tabular data in a PDF – exactly the same as downloading from a Web Page must be in tabular form.

I also demo some neat Power Query tricks further down in this post, so make sure you read through to the end.  These tricks are not hard, and you may be surprised at what you can do.

Process Overview

Here are the steps I use to grab the table from the PDF.

  • Open the PDF in Microsoft Word.
  • Save the file as a single file web page.
  • Import the single file web page into Power Query as HTML.
  • Go through the normal cleansing and transformation process you would normally do (plus of course the cool tricks I use below).

Worked Through Example

I did a quick Google to find a PDF containing something I could use in my demo from the Web.  I found this PDF that contains a list of US States with their abbreviations

www.siue.edu/postal/pdf/Abbreviation-List.pdf    Note the data I want from this file is in tabular format.

image

I saved the PDF to a known location on my PC.

Convert the PDF to a Web Page File

I opened Microsoft Word and then I opened the PDF file from within Word.  I got this warning message below.  I just clicked “don’t show this message again” and clicked OK.

image

I then selected File\Save As and saved the file as a Single File Web Page (shown below).

2016-11-18_121653

 

Import into Power Query

There are a couple of ways you can do the next step. I found the following to be the easiest.

Create a new query that connects to a text file (example shown below – Excel 2016, but it is similar in all other versions and Power BI)

image

In the dialog, change the file filter so it can find all file types (as shown below), then browsed until I found my file.

image

Then select the MHT file created earlier and click “import”

image

Power Query will not correctly identify the file type, so right click on the file (shown below) and the select HTML.

image

Power Query now correctly identifies the table in the document.  I expanded the one identified as “3DTableGrid”.

image

In your own PDF documents, you may need to use some trial and error to find the right table.

I then clicked on the Table I wanted (shown as 2 above).

Cleansing the Data

The data is starting to look good at this stage.  But note there is a bit of “noise” in the file (see below). Power Query is a great tool for fixing that.

image

First I got rid of the first row (Remove Top Rows, 1).

Next I copied the text <= /span> by right clicking on one of the cells containing this text, and selecting “copy”.  Then I selected all 4 columns and did a Transform\Replace Values and replaced <= /span> with nothing.

I didn’t need the bottom 11 rows, so I did Remove Rows\Remove Bottom Rows\11

image

Now for Some Power Query Tricks

Power Query is a formula language. If you analyse each of the steps in the Applied Steps window, you will notice that each step is a formula, and the formula takes the output of the previous formula as the input step to the next formula.  The UI assumes each step will take the previous step as an input to the new step, but it doesn’t have to be that way.  You can override that if you want – I show you how below.

Turn on the formula bar before proceeding.

formula

At this point, I have my data in 4 columns, but I really want all the data in just 2 columns.

image

This is easy to fix with some simple Power Query tricks, with very little PQL coding at all.

First I renamed this step to be called All4Columns.  I like to rename significant steps in my Applied Steps window to make it easier to find the ones I need later.

image

Then I removed the last 2 columns using the UI.  Select the columns to remove, right click and then remove columns.

image

Before moving on I renamed the 2 columns to be called “State” and “Abbreviation”, and renamed the step to be called First2Columns using the same renaming approach as before.

image

Now the trick. At any point in your query, you can add a custom query step by clicking the button shown in 1 below.

image

Power Query assumes you want to add a new step that starts from the previous step. So Power Query automatically adds the previous step as the starting point (as shown below).

image

But you don’t have to keep the previous step.  In this case I want to refer to the step “All4Columns” instead.  So I just typed in the name of the step I wanted (as shown below) to replace what Power Query added for me.

= All4Columns

Now that I had all 4 columns back, I went through the following steps.

  • removed the first 2 columns
  • renamed the columns to be “State” and “Abbreviation” as before
  • renamed the step to be called Second2Columns

This then left me with 2 non-sequential steps (First2Columns, Second2Columns) that each contained half of the data.

Append The 2 Tables into a Single Table

I didn’t know how to write the code to do this, so I just used the UI again.  I selected “Append Queries” and appended the query to itself.  That forced the UI to write the code for me as shown below.

image

Not only did the UI write the code for me, but I learnt a single PQL function that I think I can probably remember Table.Combine( ) and hence I may be able to do the same step by hand next time around (this is how you can learn too).  Then all I needed to do was change the code above so that instead of appending to itself, it would append the First2Columns to the Second2Columns.  You don’t need to be a rocket scientist to work out how to do this :-).   It looks like this after I made the change.

= Table.Combine({First2Columns, Second2Columns})

Wrap Up

I hope you found this article useful, and you have learnt a few things.

  1. How to import data from a PDF
  2. How to refer to non sequential rows in a query
  3. How to use the UI in Power Query to help you write more advanced queries.
  4. How to refer to the Formula Bar to start to build your knowledge of the Power Query Formula Language.

If you are into Power Pivot and you haven’t already done so, you may like to download my paper “Ten Things I Wish I Knew When I Started With Power Pivot” from the link below.

Edit: 24th Nov 2016
A friend of mine Dave Marriott was using my approach above but had a stack of PDFs that he needed to access. He decided to write some Word VBA code that automates the task of converting the PDFs to MHT files. Dave is a professional VBA programmer based in Adelaide, Australia if you would like to hire him to help you be more efficient in your business. You can contact him at LinkedIn

Copy this VBA code into Microsoft Word.  Why not add it in your Normal Template so it is always available?  If you haven’t done this before, you can follow my instructions on how to do this in Excel (it is exactly the same process in Word).  Run the code to launch a dialog. You can multi-select as many PDFs as you like and the code will convert them all for you.

Sub ConvertToMHT()

Dim f_dialog As FileDialog
Dim FileChosen As Integer
Dim FileName As String
Dim i As Integer
Set f_dialog = Application.FileDialog(msoFileDialogFilePicker)
f_dialog.InitialView = msoFileDialogViewList
f_dialog.AllowMultiSelect = True

FileChosen = f_dialog.Show
If FileChosen = -1 Then
     For i = 1 To f_dialog.SelectedItems.Count
          Documents.Open FileName:=f_dialog.SelectedItems(i)
          ActiveDocument.SaveAs2 FileName:=Mid(f_dialog.SelectedItems(i), 1, Len(f_dialog.SelectedItems(i)) - 4) & ".mht", FileFormat:=wdFormatWebArchive
          ActiveWindow.Close
     Next i
End If

End Sub

Find Duplicate Files on Your PC with Power BI

Level: Beginners

If you want to learn new skills using a new tool, then you simply must practice.  One great way to practice is to weave the new tool into you daily problem solving.  If you have something meaningful to do with the new tool, then you are much more likely to be motivated to practice.  And the new tool I am talking about of course is Power BI.

Last week I showed how easy it is to use Power BI to help you track down large files saved in Dropbox so you could manage the overall space usage.  As a result of that article, Graham Whiteman posted a comment suggesting it would be a good next step to find duplicate files.  I think that is a great idea, so I decided to test it out on my PC.  Read on to see how I did it, and how you can do it too.

Create a Query to Fetch All PC files

I started a new Power BI Desktop file, then connected to my PC documents folder

image

image

I immediately selected Edit query as shown in 1 below.

image

The only time you would immediately select Load (2 above) is if the data you are imported is already in the correct shape for Power BI.

The only columns I need are the file name, date modified, attributes and path (shown below).  I Multi selected the columns I wanted to keep, then I right clicked and select “remove other columns”.

image

The next step was to extract the file size from the attributes list. To do this, I expanded the list of attributes (1 below), deselected all the columns and then reselected the file size (3 below).

image

Then I renamed the query (1 below) and changed the query so it didn’t load to Power BI by right clicking on the query and un-checking the enable load option.

image

This created a query that links to the PC, keeps the columns of data needed but didn’t load anything to Power BI yet.

Create a New Query that Accesses the File List

The next step was to create a new query that references the File List.  I right clicked on the first query (1 below) and then selected Reference (2 below).  Note how the File List query is shown in Italics indicating that it won’t load to Power BI.

image

The next step was to merge this data with the itself by going to the Home Ribbon and selecting Merge Queries.

image.

In the Merge Queries dialog, I joined the list of files File List (2) with the original query File List so that it was joined to itself on 3 columns (the File Name, Modify Date and File Size) but not the File Path as shown below.

join file list

The above steps added a new column to the query.  I then expanded the new column as shown below making sure to keep the original column name prefix.

image

Find The Duplicate Files

The second query now looked like this.  As you can see in the image below, the query returned all the files (name column) along with the folder paths from the query “File List” shown as 1, and a second column containing the folder paths from the query “File List (2)” shown as 2 below.

image

The next step was to get rid of all rows in this query where the 2 folder paths are identical.  Doing this is easy with a custom column.  I added a custom column (steps 1 and 2), and wrote a formula to return TRUE if the 2 folder paths were identical.

image

I then filtered out everything that returned a TRUE in the new column using the filter button as shown below.

remove matches

I then deleted this custom column as it was no longer needed.  I just right clicked and selected remove.

Format the Number Columns

It is very important in Power BI to set the number formats before loading the data.  Any numeric column that has a data type “Any” should be changed to a suitable numeric format (as shown below).

image

I did this, renamed the query to be called “Duplicates” and then selected  “Close and Load” to get the data into Power BI.

Time to Write Some DAX

Now the data is loaded, you of course I could just drag the one or more of the columns to the Power BI canvas.  But remember half the reason of doing this is to get some new skills.  So instead of dragging the Size column and creating an implicit measure, I wrote some DAX – it isn’t hard to get started with such simple formulas.  Here’s how to do it.

Select the Size column, go to the Modelling Ribbon and select New Measure.

image

The formula I wrote is as follows

File Size MB = sum(Duplicates[Size])/(1024 * 1024)

image

Note a few things that I was able to do by writing this measure myself

  1. I converted the units of the result from bytes to megabytes by dividing by (1024 x 1024).
  2. I gave the measure a more meaningful name “File Size MB”
  3. I was able to set the formatting to comma separated with 1 decimal place

And of course I practiced my DAX.

And the Results

I simply then added the file size, File Name, Folder Path and Second Folder Path to a table in Power BI like shown below. image

I then discovered I had around 9 GB of duplicate files on my PC.  I sorted the table by File Size descending and discovered that I had multiple identical copies of a contoso.pbix.  It looks above like there are 6 copies of contoso.pbix but this is deceiving. Every copy of a file will find a match with every other copy.  If  you note in the Folder Path column, there are only 3 unique folder paths, hence 3 files.

The next thing I did was add a Tree Map as shown, with the file name in the Group section and File Size MB in the Values section.

image

To find out accurately how many copies of each file there were, I had to write some more DAX.  This formula is a bit more involved (intermediate DAX).

2016-10-31_120903

Let me explain this formula starting from the inside out.  There are 4 functions in this DAX formula and I describe their role below.

  1. SUMX is an iterator.  It iterates over a table specified as the first parameter (VALUES in this case).  You can read more about SUMX here.
  2. The VALUES function returns a table of unique file names (in this case it is unique values in the column Duplicates[Name]).  So SUMX above will iterate over each file name in the name column.
  3. SUMX is iterating over a Virtual Table (VALUES).  The CALCULATE is required to force context transition.
  4. Then for each file name in the table (in 2 above), DISTINCTCOUNT will count how many unique folder names there are.

I then added the new File Count measure to the Colour Saturation section of the Tree Map (1 below).  This does 2 things.  Firstly it shows the high folder count files as being a darker colour, and secondly it adds the file count to the tool tips (visible when you hover the mouse over the visual).

image

And Now Some Fun

I’ve been looking for an excuse to do this for some time.  I want to find the fattest fish in my pond (aka most space taken by file name).  I went to visuals.powerbi.com and downloaded the Enlighten Aquarium custom visual.

app.powerbi.com/visuals/show/Aquarium1442671919391

I then imported the custom visual into Power BI Desktop

image

The I copied my Tree Map visual (Ctrl-c, Ctrl-v), selected the copy and changed the visualisation to be the Aquarium.  This visual is showing the largest individual files regardless of location or how many copies.  I am not saying this is the best way to visualise data, but surely it is one of the most creative.

fish

Here is my final workbook canvas

image

For the purists out there, I wrote a new file size formula as follows.

final

 

The original formula I wrote double counts the file size when there are multiple duplicates.  The above formula is almost identical to the File Count I explained above.  The only difference really is the inclusion of MAX(Duplicates[Size]).  This is a “trick” to handle the fact that for each file name there will be multiple records in the data model.  Each file will have the exact same file size, so by selecting MAX I simply get to access the file size.  I could have used any other aggregator (eg Min, Avg, Sum) and got the same outcome.

I haven’t shared the actual workbook here. The whole idea is for you do try this yourself so you get 3 benefits; more disk space, some practice with Power BI Desktop and have some fun.

Use Power Query to Manage Dropbox Space

Level: Beginners

I got this dreaded Dropbox email recently as shown below.

image

I needed to clear out some of the files I have loaded in Dropbox so I didn’t have to upgrade my account.  It occurred to me that I could make this process a lot easier by using Power BI to quickly show me where my big files were located in Dropbox.  This post today explains how I did it. What I ended up with is a report like this that allowed me drill down on the large sub folders to easily find my big files.

dropbox size

Note, there is a great tool called WinDirStat that you can download here that does this too – I use WinDirStat all the time. But I never want to miss an opportunity to do something with Power BI.

Process to Build the “File Space Usage” Tool

First I created a new Power BI report and connected it to my Dropbox folder.

image

You can of course use the same process on any other folder on your computer, or even the entire Hard Disk if you want.

I then imported the columns I thought would be of use, and loaded them into the data model.

image

I figured the Hidden flag and Date Accessed might be useful at some stage, so I brought those in too.

I then wrote some measures that I thought would be useful.

image

I encourage you to write your own measures rather than use the implicit measures created when you drag a column of values to the report.  By writing your own measures, you “Learn” how to write DAX and that will help you become a Power BI ninja.

The last thing I did was to create a report that made it easy to see where my big files were located and find out what they were.

image

I have loaded a short 4 minute video that shows how quick and easy it is to do this from scratch.

What Obtuse uses have you found for Power BI?

I would love to hear from others about how they are using Power BI in ways that are not immediately obvious.

Who Needs Power Pivot, Power Query and Power BI Anyway?

Level: Beginners

One of the great challenges Microsoft has faced with its “new” suite of Self Service BI tools (particularly Power Pivot) is that most people that could benefit from the holy trinity (Power Pivot, Power Query and Power BI) don’t even know these tools exist, let alone how the tools can help them succeed in their jobs.  The situation is definitely getting better as Power BI starts to get a presence in the market place, however I still talk to people who have heard of Power BI, but have no idea what Power Pivot or Power Query are, and what’s more they don’t know why they should care.  I personally believe a big part of the awareness problem is that Microsoft insists on shipping Microsoft Excel with the Power Pivot plugin disabled.  There is no reference to Power Pivot when you excitedly receive your brand spanking new version of Excel 2016 – what a marketing opportunity gone begging!

image

I have been an Excel nerd for 30 years.  There is nothing I used to enjoy more than installing a new version of Excel, and clicking through every menu item to find something shiny and new that would make my life easier.  By not turning on the Power Pivot menu by default, Microsoft is missing one of the best silent selling opportunities is has for this fabulous addition to Excel.

Given there is no “pull through” on the menus, many people don’t know what these products are or why they should care.  I am often asked by people “who can benefit from these tools?”.  This post sets out to explain who can benefit and why.  Note when I say “who can benefit”, I am not talking about “consumers of reports” here, I am talking about “authors of reports”.  It is clear that people that consume quality reports and analysis will benefit, whatever the tool.  This article is focused on the benefits to those people that are responsible for building the reports and analysis that others will end up consuming.

Power BI

Who can benefit from Power BI is probably the easiest to understand.  The product is well marketed and has a clear role to play.   Power BI is a complete self service BI tool.  It is designed to bring business intelligence capabilities to the masses instead of the elite (e.g. instead of just highly skilled IT MDX programmers).  Rob Collie wrote a good article last week about the democratisation of BI tools.  Power BI will add value to people who:

  1. Have problems sharing reports with others because the file size is too large.
  2. Need to share data with people on the go, that maybe only have a Tablet or a Mobile phone.
  3. Have large data sets that can’t be managed in traditional Excel.
  4. Are spending too much time each week/month manually maintaining reports with new source data and/or new visualisation requests.
  5. Can’t get the (timely) support they need from their IT department using traditional Enterprise BI tools.

Power BI is great because it puts capabilities across the end to end BI stack into the hands of end users (authors), including:

  1. Extraction of data from the source (using the Power Query engine)
  2. Transformation of that data into a new shape that is optimum for BI reporting and analytics (Power Query again).
  3. Data modelling capabilities, so you can turn the raw data into business meaningful insights (using the Power Pivot engine).
  4. A reporting and analytics front end allowing you to build reports to visualise your data (Power BI Desktop and Power BI Service).
  5. A fully integrated cloud based delivery framework so you can easily share with anyone over the internet (Power BI Service).
  6. A full set of Mobile applications across the major operating systems (Power BI Mobile).

Notice how steps 1 and 2 use Power Query, and step 3 uses Power Pivot.  So if you want to learn about Power BI, you really need to learn about Power Pivot and Power Query too.

Power Pivot

Conversely, Power Pivot is the hardest to understand – I.e. it is the hardest for individuals (potential authors) to understand “what does Power Pivot do for me and why do I need it?”.  I have had people enquire about Power BI training courses that have not been interested in Power Pivot or DAX*. But the truth is, if you want be able to write your own reports in Power BI, you really need to learn at least some basic Power Pivot skills.

Power Pivot is a data modelling tool.  It is the underlying reporting engine that enables Power BI and Modern Excel to delivery those modern funky reports that can help you succeed in business.  The Power Pivot engine allows you (the report author) to take your business knowledge and to configure the reporting tools so that Power BI and Excel Pivot tables can be used to find and report on insights in your data.

Most business users have never heard of the term “Data Modelling” before, and the reason for this is quite simple – it has always been the IT department that has been responsible for data modelling.  Power Pivot brings the power of data modelling and puts it in the hands of competent business/Excel users.    An example of data modelling will make it easier to understand.

Example of Data Modelling

Consider a scenario where you download sales data from your company transaction system and it looks something like this in a spreadsheet.

image

You can see the sell price and the cost price information exists in the table above.  But there is nothing about the $ Margin per product and nothing about the % Margin, let alone insights like Sales Year to Date, Top 3 selling products, Fastest growing product etc.  In a traditional Excel world you would simply write formulas in your spreadsheet(s) to enhance this raw data and extract the additional insights.  The trouble with this approach is that all these formulas are only ever “one off”.  You write them for this report, and then when you create another report later, you have to write the formulas again.  Power Pivot handles this problem in a different way.  Power Pivot is a data modelling tool, and instead of writing formulas in your final reports, you write them “under the hood directly in the Power Pivot engine” as part of the data modelYou build the “rules” that describe how to calculate $ Margin and % Margin ONCE and only once directly in the Power Pivot engine.  Once you have created the rules for each insight (e.g. $ Margin, % Margin etc), it is forever available for you (and others) to use over and over again in any report, chart, Pivot Table or any other visualisation you can think of.  Never again to you have to write/copy a new formula every time you create a new report.

*What About DAX?

Data Analysis Expressions (DAX) is the formula language of Power Pivot.  DAX is very similar to the Excel formula language yet there are also a lot of differences that you will have to learn from scratch.  Simply put, if you want to learn Power Pivot (and/or Power BI), you will need to learn at least some DAX as well as lots of other things about how the Power Pivot engine works.

Who needs Power Pivot and why?

People that need this tool are typically Microsoft Excel users/report authors that analyse or report on data, particularly when the following conditions apply:

    1. There are lots of data that stretch the capacity of traditional Excel (file size, re-calculation speed etc).
    2. If you use Pivot Tables a lot to analyse your data.
    3. If you are writing a lot of VLOOKUP (or INDEX/MATCH) formulas to join data from different tables so you can analyse in a Pivot Table.
    4. If you have had to create bespoke reports in traditional Excel, but then have to spend hours rebuilding the report when asked for a different view of the data (e.g. you have a report that shows sales by half year, and then you are asked to produce the same report by month – and have to start again almost from scratch).
    5. Anyone that wants to start using Power BI for anything more than the most simple aggregation reports.  The “out of the box” capabilities of data modelling in Power BI are limited, and you will quickly realise that you need to learn some new skills (Power Pivot/DAX) to be able to leverage the strengths of Power BI.

Power Query

Power Query is a user friendly ETL (Extract, Transform, Load) tool.  Unfortunately Microsoft muddied the waters by renaming Power Query “Get and Transform” in Excel 2016, and “Get Data” in Power BI Desktop.  That aside, Power Query is used to:

  1. Extract:  Connect directly to the data source and ingest the data (into Power BI or Excel) so you can work with it.  It retains a connection to the source so when the source data is updated you can automatically “refresh” the data without having to go through the import process again from scratch. It is amazing.
  2. Transform:  You can clean and reshape the data so that by the time you are ready to use it, it already is in the format you need instead of the format you are given.  Again, you do this only once, and then you can automatically refresh when the data is updated later.
  3. Load the data directly to the place where you want to use it.  The end location for the data will typically be inside Power Pivot or Power BI, but it is also an invaluable tool for use with traditional Excel.

Don’t think of Power Query as simply a tool for self service BI reporting.  I have used Power Query to automatically audit information (eg XLSX files, csv extracts from other systems etc) and produce exception reports directly in Excel.  Once you understand what Power Query can do, you will start to realise all the ways you could use it to make your life easier.

Who needs Power Query and why?

People that need this tool typically are trying to solve the following problems.

  1. They regularly receive file extracts from someone (or some system) and need to manually manipulate this data before they can use it.
  2. They want to use Power BI/Power Pivot but can’t get the IT department to shape the data they need prior to loading it into Power BI/Power Pivot.
  3. People that are doing repetitive tasks such as matching invoices on a remittance advice against a bank statement (or similar) – Power Query eats this kind of work.
  4. Anyone that is given data in the wrong shape (example below).  And yes I quickly used Power Query to un-pivot the data in the blue table so it looked like the green table.  It is worth getting and using Power Query just to get this un-pivot feature alone!
    image

Wrap Up

Hopefully this overview has helped demystify how Power Pivot, Power Query and Power BI can help you in your job.  If you would like to learn more about Power Pivot, you can read my book “Learn to Write DAX“.  If you want to learn more about Power Query, I recommend Ken and Miguel’s book “M is for Data Monkey“.  If you live in Australia, you can attend one of my live training courses where I cover what you need to know about Power Pivot, Power Query and Power BI.

Query Dependency View in Power BI Desktop

I have been on holidays taking a much needed break from work, so it has been very quite on my blog over the last few weeks.  But I am back now! I woke up this morning to a very pleasant surprise – Microsoft has delivered the long waited for Query Dependency View in Power BI Desktop.

“What is this and why are you so excited Matt” I hear you ask.

What is Dependency View

Simply stated, it is a “table view” of all of your queries showing the linkages between the tables/queries.  Now if you are simply loading data directly into the data model with no transformation, then this really doesn’t add any value.  However if you are doing a lot of transformation using staging queries etc, then this is a must have tool.

How to Access Dependency View

Download and install the latest version of Power BI Desktop.  Open a workbook that has data loaded, and go into the edit query mode.  Do this via Home, Edit Queries, Edit Queries as shown below.

image

Then click on View, Query Dependencies.

image

What it Looks Like in Action

The best way I can demonstrate why I think the Query Dependencies view is required is to show you the Dependencies View from a workbook that I developed for a client last year.  As you could probably imagine, this workbook was very difficult to manage without a view like this.  What I ended up doing was to develop my own dependencies view using Microsoft Visio.  This took quite some time and effort, and thankfully is no longer required.

image

It is worth pointing out that this workbook was not a reporting tool using Power Pivot/Power BI, but it was an auditing tool.  It was used to import data from many spreadsheets, compare this data against master data, and identify data errors and exceptions.  This tool was built using lots of interim queries with lots of cross dependencies (as you can see).

The Dependencies Query view is still not available in Power Query for Excel however it is easy to import your workbook into Power BI Desktop to immediately leverage the new feature.

Improvements Still Needed

So this is version 1 of this view, and it is simply great.  There are some things that still need improvement in my view, including.

  • There is no maximise button for the window.
  • You currently can’t reposition the tables on the screen manually
  • It could benefit from some user controlled colouring capabilities to highlight different types of tables, such as those that are loaded into the data model etc.
  • Deliver it in Power Query for Excel of course.

But having said that, it is still great now so get to it and start using it.

Shaping vs Modelling in Power BI

Level: Beginners

Power Pivot, Power Query and Power BI are 3 products that are closely related to each other and were all built for the same purpose – enabling Self Service Business Intelligence.  I first learnt to use Power Pivot for Excel, then Power Query for Excel, and finally Power BI.  But there is a new wave of users that have never heard of Power Pivot and Power Query, and their first exposure to the underlying technologies is with Power BI.  This can be confusing and hence I explain what you need to know below.

Power BI Can be Confusing to First Timers

I realised recently that people who have never used Power Pivot and Power Query could easily be confused by the different options to do things in Power BI.  There are lots of reasons to be confused, including:

  • There is no reference to the term “Power Query” inside the Power BI tools.  Microsoft stopped using this name in favour of terms like “Get Data”, and “Edit Queries”.  I think this was a bad decision and it has fuelled the confusion. When I tell people to do something in Power Query, they therefore don’t know what I am talking about.
  • There is no reference to the term “Power Pivot” inside Power BI either.  The most relevant proxy for Power Pivot inside Power BI is the “Modelling” menu. Use of the term Power Pivot therefore doesn’t make any sense to “Power BI only users” either.

Shaping and Modelling Your Data

The implications of the above is that there is no clear way to delineate (with language/labels) between the process of loading the data and the process of modelling the data.  I like the term “Shaping” as a replacement for Power Query/Get Data and “Modelling” as a replacement for Power Pivot.

Definition of Shaping

Shaping is the process of extracting your data from the source, and transforming the data so that you can load it into Power BI (often referred to as ETL).  The reason I like the word Shaping is that it clearly describes what is normally needed as the first step in any BI project.  The shape of your data refers to

  • What data you extract
  • How your data is loaded into one or more tables, and which tables you import
  • What names you give to the columns and tables
  • Whether your tables are wide and short (many columns), or long and narrow (fewer columns)

For example data tables (also called fact tables) are normally best shaped into long, narrow tables (lots of rows, and a minimum number of columns), whereas lookup (dimension) tables can be as wide as needed to for the task.

image

All tables should only be loaded with data that you actually need.  One big difference between Power Pivot/Power Query and Excel, is it is easy to go back and get the data you missed later.

Definition of Modelling

Modelling is the process of building the relationships between your tables and converting the business requirements into measures and calculated columns using the DAX language.  The measures contain

  • The business logic that transforms your raw data to a usable, value adding calculation
  • The formatting required for the numbers (currency, percentage)
  • A good business name for the measure that accurately describes the measure

Prior to Self Service Business Intelligence, data modelling was always done by expert IT professionals using a complex language like MDX.  The complexity of such languages put it out of reach of users, as it was simply too hard to learn casually.

2 Ways to Add A Column – Wait What?

A perfect example of the type of confusion Power BI first timers face is that there are 2 ways to add columns to your data.  You can add a column during the shaping process, and you can add a column during the modelling process.

image

So which should you use?  Well it depends on what you are trying to do.  My general advice on adding columns to your tables is to make the change as close to the source as possible. So in order of preference:

  1. Add the column in your source data (eg your Data Warehouse) with the help of your IT friends if necessary
  2. Add the column during the shaping process
  3. Add the column during the modelling process

If you add the column at the source, then you have access to it for your current BI needs but also for future BI needs.  Adding the column at the source provides the most reuse, and if you need to make a change down the track, it will be the least maintenance.  But sometimes you can’t do that for various reasons, or you can’t do it quickly.

If you add the column during shaping, it will be loaded and compressed with all other columns during data load.  This normally means better overall table compression and hence a good outcome for the overall model.

The most common reason you would add a column during the modelling process is if you need to leverage other parts of the model to create the data.  An example would be if you wanted to use a measure (eg Total Sales) in your data model to create the new column in the model (eg Banded groups of Customers based on High, Med, Low annual sales).  But don’t fall into the trap of using too many calculated columns instead of measures. I rant on about that here.  Use them when they are needed, otherwise avoid them.

Wrap Up

Hopefully this explanation of the role of shaping and modelling in Power BI has helped clarify some things for you.  Don’t forget to check out the other articles in my knowledge base here.

Power Query to Combine Web Pages

Level: Intermediate

There was an interesting question this week on powerpivotforum.com.au asking if there was a smarter way to user Power Query over multiple identical web pages to scrape the data in a single query.  I have been meaning to blog about this for a while, so it is a great opportunity for a mini-Friday post.

The process of typing all the steps on how to do this is a lot longer than simply recording a video.  So instead of a long post, I have embedded a video below from my YouTube channel.

First the Problem

Take a look at this web page  www.calendar-australia.com/holidays/2016/

It contains all the public holidays for the calendar year 2016.  But there are 2 similar web pages with the same structure but different results

www.calendar-australia.com/holidays/2017/

www.calendar-australia.com/holidays/2018/

How to do it in Power Query

I’m sure I first learnt this technique watching Miguel Llopis from Microsoft, probably at PASS SQL BA Conference in 2015.  The technique can be reused in many different scenarios like this one. The trick is virtually identical to the technique I showed in this post about combining multiple Excel workbooks into a single query. The solution works equally well in Power Query for Excel as well as in Power BI (Get and Transform). The steps are

  1. Build a query for one of the sets of data
  2. Turn the query into a function using the simple parameter step
  3. Modify the query to use the parameter instead of the hard coded set of data
  4. Create a table with a column of the parameters you want to use (in this case the data must be text, not numbers).
  5. Add a new column that takes the text from the first column and processes the function for each row.
  6. Expand the column into a single table.

Here is a video showing how easy it is to do.

Seems there is a special birthday this weekend.  Should be fun!

Power Query Over a Command Screen Output File

Level: Intermediate Power Query.

I spent a lot of last week helping to configure Power BI in preparation for go live for a client.  One of the important things to do when designing a Power BI solution is to make sure you have a good design for your user security access.   Today I am going to share the approach I used to track user security configuration with Active Directory Groups.

This post is more about Power Query than it is about Active Directory, so even if you don’t have this specific scenario, it is still worth reading this blog post to see how I dealt with some tricky issues in cleaning data that was extracted into a text file using Power Query.

Using Active Directory Groups

I don’t want to go into a lot of detail here, but in short, we set up dashboards and reports targeted at different groups of usersRather than assign users access directly to the dashboards in Power BI (by adding their email addresses one by one), a much better solution is to create an Active Directly Group, assign the users to the active directory group using the normal business process, and then assign the AD group to the Dashboard in Power BI.

image

Note that you must assign an email address to the Active Directory group to be able to do this.

The benefit of this approach is that you can have a single corporate process for assigning user access, and this is normally Active Directory (for medium to large companies).

The Problem – How to Keep Track of Which Users Have Access

If you are an Active Directory Administrator, then I am sure you have all sorts of tools to work out who has what access.  But let me tell you – I don’t have those tools (or don’t know about them anyway), and I quickly found out it was a pain to try to check user access when you don’t have the right tools.

Command Line Tricks to Check Users in an AD Group

A quick question around the office, and Lester told me how to run a command line request to find out which users are in a group.  To do this:

  • Open a new command prompt on your PC (that is connected to the company network of course).
  • type the following command
net group /domain "GROUP NAME IN INVERTED COMMAS"

The above command line string will send a list of all users assigned to the group to the screen like shown below

image

Checking One Group at a Time is a Pain too

This was great, but soon I realised I wanted to get ALL the groups I wanted, and then look at the data in Excel.  I didn’t want to have to go to the command line each time, so I developed the following process to create a simple to use Excel file.  This is a summary of the steps I took to achieve this outcome.

  1. Created a batch file that ran the command line code for all the groups I needed to interrogate (one after the other), and then sent the results of each of these commands to a single text file.
  2. Imported and cleansed the data in the file to Excel using Power Query
  3. Set up some slicers on top of my final Excel table to make it easy to access the information I needed.

This is what I ended up with

ad groups

Below I step through the details on how to complete each step to achieve this final report.

Step 1.  Move the Manual Code into a Batch File

This is easy, even if you haven’t done it before

  • Open notepad.exe and paste the first line of code that I showed above.
  • I appended some more code at the end of the line to tell the command line to send the results to a file (instead of sending the results to the screen).
    Note I added a single > symbol pointing to the name of the output file in the first line of code below.
  • Then I duplicated the first line of code multiple times (cut and paste), once for each group I needed – then changed the group name of course.  Also note that the > symbol is a >> for lines 2+ in the batch file (> means send the data to a new file, >> means append the data to an existing file).  I had many more groups in real life.
net group /domain "FINCUBE GROUP" > c:\users\allingtm\Documents\list.txt
net group /domain "FINCUBE ALL" >> c:\users\allingtm\Documents\list.txt
net group /domain "FINCUBE ADHOC" >> c:\users\allingtm\Documents\list.txt

Now save the text file with a *.bat extension in an easy to find location (note, first change the file type filter in Notepad to be *.* instead of *.txt).  From the command line, execute the batch file by typing the name of the file (including the full path and excluding the bat extension).  The command line will then chug away and produce the output file with all the Active Directory information needed.  This is what the final file output looks like in a txt file (sample of test data only of course).  In my real scenario there were many more names and many more groups which is of course exponentially harder to manage manually than this simple test data.

image

Step 2. Import into Excel Using Power Query

I am not going to go through every step I used in Power Query, one at a time here as this is an intermediate post.  I have however attached the workbook at the bottom of the page – download it and take a look if you want more details about each step in the transformation.  I am just going to call out some of the more interesting things I had to do to get the final output.

The Default File Output is not Very Excel Friendly

The first thing to note is that this output is great for a command line screen, but not very good for Excel.  So you have to deal with these issues in Power Query the best you can.  It is easy to deal with blank lines and dotted lines etc with a simple filter.  Just deselect anything that is not of value for the task at hand as shown below.

image

All of the lines I have deselected (shown in the image above) are repetitive lines that are created for each of the AD Group commands in my batch file.

Once I had done this, I had a single column of text data as shown below.  Note the name of the groups appear first, followed by one or more rows of user names.  This pattern is then repeated down the page.  Each row of names has 1 or more user names per row – tricky!!

image

I then added a new custom column in Power Query by hand writing an IF formula.  An If formula in Power Query Language (PQL) is in the format

if something1 = something2 then something3 else something4

Note PQL is case sensitive for the formulas and the text it is looking at.

See my formula below along with the output in a new custom column.

image

Basically I am checking the first column to see if the row is a Group Name or a list of User Names.  If it is a Group Name, then return that Group Name.  If it is a User Name, then return null.  The reason null is required is important – it is then easy to copy the group name down the list as shown below

image

Now that every row has a group name as part of the row record, it is now OK to go ahead and delete every row in the first column that starts with the text “Group name”.  I did this with a filter rather than manually selecting each row – that way it is future proof if I add new group names down the track.

image

Remember everything is case sensitive, so typing “group name” would fail here as my data uses “Group name”.

Split the User Names into Columns

This is what I have now (after some reorg).  Time to split the users into columns.

image

This looked easy, but there was a surprise waiting for me. When I split on “space” repeatedly, I discovered the number of spaces between each name varied by line – anywhere from around 17 spaces up to about 21 spaces in total.  This stumped me – how was I going to iterate over this file and remove a seemingly random number of spaces to parse out the names?  I gave up and went to lunch.

During a nice kebab with my family, I had a brain wave – I wondered if the file may have been a fixed width file format, adding more or less spaces depending on the number of characters in each user name.  I then remembered an article (Chris Webb I think) talking about the new monospaced font feature in Power Query – just what I needed.  As soon as I got home, I found the setting for the mono spaced font and turned it on – BAM!  My hunch was correct.

image

It wasn’t until later that I realised that I should have known this from looking at the original text file – oh well.

I then selected one of the cells of user names (shown as 1 below) and then selected the results pane (shown as 2 below).  Once the cursor is in the results pane, I used my arrow keys to move the cursor left and right to count how many characters made up each “column” of user names (25 as it turned out).

image

So I applied a “split column by number of characters” step as shown below.

image

I then un-pivoted the resulting user name columns, trimmed out the trailing spaces, removed the columns I didn’t need and did a general clean up. This is what I then sent to a table in Excel

image

The last thing I did was add a slicer for each of the columns.  This gave me a very usable tool where I could select either a group or a user name, and see the relationship between the two in the resulting filtered table.  See below.

ad groups

And given I had set this up using a batch file and a Power Query report, the process is infinitely repeatable with very little effort.  20 seconds to re-run the batch file, 40 seconds to refresh the Power Query workbook, and in as little as 1 minute I can have a refreshed set of data.

You can download the sample text extract, the sample batch file and my Power Query workbook here if you would like to look in more detail at the process.

Self Referencing Tables in Power Query

I have had this idea in my head for over a year, and today was the day that I tested a few scenarios until I got a working solution. Let me start with a problem description and then my solution.

Add Comments to a Bank Statement

The problem I was trying to solve was when I download a digital copy of my bank statement, loaded it up into Excel using Power Query and then wanted to add some commentary to some of the transactions.  If you just add the comments to the source file then everything is easy.  But it is often not practical to add the comments directly into the source file for a number of reasons.  This could be because you get a new copy of the source data each time you refresh (eg you get a CSV file that replaces the old file, and the file just gets larger each month), or it could be if you are using a “combine multiple files” technique and you only want to deal with a summarised, cleansed version of the final data instead of dealing with the source files directly.

Once you have loaded the table in Power Query, it is possible to add a new column to the resulting table (shown below).  The problem with this approach is that the comments are not logically linked to the rows of data in the table.  Take the example below.  The data is loaded from table 1 using Power Query, into table 2 on the right.  I then manually added a new comment column and added some comments (as shown below).

comments

The problem that can happen is demonstrated below.  If the sort order of the source table changes and then you refresh the query, the comments no longer align with the original data.  And this is despite having a unique ID column in the original table.

sort

The problem is that the new column I manually added to the Power Query table on the right is not logically joined to the actual table, and hence the comments are actually in a column next to the table rather than part of the rows in the main table.

Enter Self Referencing Tables

I read this blog post from Imke quite some time ago, and that was what gave me the idea on how I could solve this problem. The idea is to load table 2 above a second time (after adding the comments), and then joining it back to itself, hence logically joining the manually added comments to the rows in the main table.

Note: Prior to completing the self referencing steps below, I used Power Query to create the second table from the source table. This was my starting point.

Then I loaded the resulting table 2 a second time as follows:

  1. Select in the second table
  2. Go to the Power Query menu
  3. Create a new query using “from table”.

image

I renamed this new query “Comments” and selected “Close and Load To” so that it only created a connection but didn’t load the new table to the worksheet.

image

The next thing to do is to merge the original query with the new query.  To do this,

  1. Go back and edit the original query for table 2
  2. Add a new step to the query using “Merge Query”
  3. Merge the second “Comments” query with the original query by joining on the ID column as shown below.

image

This will give you a new column in your query, and the column contains a table (as shown below).  You can then expand this new column to extract the comments.

image

I only extracted the comments column, and deselected the last option as shown below. Note, it is important that you deselect “Use original column name as prefix” so that the new column has the same name as the original source column.

Click OK, then close and load.

image

When you look at the table now, it all looks the same as before (see below), but there is one important difference.  The last column in the Power Query Table is now a part of the table and not a manually added on column.  You now have a self referencing table.

image

To prove this, I completed the same sort test as before.  First I sort the source table, then refresh the Power Query table – you can see the results below.  Note how this time the comments stick with the row in the table – sweet!

sort2

Incrementally Add Comments

Now that the queries are all set up, you can incrementally add comments to the Power Query table any time you like.  And you can also add new data to the source data – it will all work as you would expect as shown below.

final

Real Life Data

The above demo is of course test data.  A more realistic real life scenario would be to download a CSV file from your bank, and use one of the many “combine multiple files” techniques to import the data.  It is likely that your source data is messy, and contains lots of columns you don’t need. In this real life scenario, it is not easy to add comments to the source data unless you open each file and add the comments there – this is not ideal.  Another problem could be that your data source could be a CSV file that uses a “delete and replace with the new larger file containing the latest records” method.  In both of these scenarios, you will need to have your manually added comments located somewhere other than the source table.  The solution provided here is a simple and intuitive way to manage this for the user (albeit with a little bit of setup first).

The example shown in this post will work just fine as long as 2 conditions are maintained.

  1. You have a unique ID for each record
  2. You don’t accidentally load duplicate data – so don’t do that!

But What if I don’t have an ID column

The next logical issue you may have is a scenario where you don’t have an ID column.  I tested quite a few scenarios including adding an ID column inside the first Power Query step (add index column).  I haven’t been able to find a solution that works 100% of the time, but there is one that can work for most people.  If your data is such that each row in the table is unique (eg date, maybe a time column if you are lucky enough to have this, amount of transaction, remaining balance etc) you could create a unique ID by concatenating these columns into a single unique column.  This should solve the issue for most people in non-commercial scenarios (those in commercial scenarios are more likely to have a reference ID anyway).  You could even create an audit query that checks for duplicates and warns you if this occurs, but that sounds like another blog post.

I hope you find this article useful, and or it inspires you to come up with your own innovative ways to use Power Query.

Conditional Columns in Power BI Desktop

Microsoft recently announced a new conditional column feature in Power Query (Power BI Desktop version) that warrants a blog post.  This feature makes the process of creating banded columns much easier for the average user.  I blogged about banded columns using DAX here, and now there is this second, potentially easier approach.  In the demo below, I will add a new column of “age bands” just like I did with the DAX example above.

This column is then part of the data model and can be used in Power BI Desktop to slice and dice the visualisations.

The New Conditional Column Menu Item

I am using Adventure works and importing the Customer table (shown as 1 below) for this demo.  Note on the Add Column menu (2) there is a new menu item “Conditional Column” (3).

image

Add an Age Column

The first thing I need to do is create an Age column.  To do this, I first created a “Year of Birth” column by selecting the Birth Date column and then selecting Date\Year\Year.  This will add a new column containing the year of birth (keeps the original Birth Date column).

image

I then edited the code created by Power Query directly in the Formula Bar.

image

The original Code created by Power Query was this:

= Table.AddColumn(_Customers, "Custom", each Date.Year([BirthDate]), type number)

I simply changed the code directly in the Formula Bar to be like this:

= Table.AddColumn(_Customers, "Age", each 2016 - Date.Year([BirthDate]), type number)

Once you get the hang of the way Power Query formulas are constructed, you can easily make small changes like this to streamline the process.  Of course this is a hard coded approximate age, and it will not be 100% correct after the person has a birthday, but it is good enough for the purpose of this demo.

Add the Conditional Column

Then I clicked the new Conditional Column menu button (shown earlier) and the following dialogue box appeared.

image

It is just a matter of using the drop down menu items to build the logic for the banded column using standard Boolean If/Then logic.

image

The main thing to note is that you need to build the logic in a way that it incrementally checks each step and then it exits the logical test once it finds the right band.  In my example above, I started with the oldest band, and each test is a “Greater Than”.  I could have also started with the youngest band and applied  “Less Than” test.  But it would not work if I started with the youngest band and did a “Greater Than” test, or if I started with the oldest band and did a “Less Than” test.

The Final Result

Note how the new banded column now appears (shown as 1 below) and also note the code in the Formula Bar (shown as 2 below).

image

It has always been possible to write this code (shown in the formula bar 2 above) using the “Add Custom Column” feature.  What Microsoft has done here is to create the user interface to help the user write this custom column without having to learn the PQL language.  In fact this is what Power Query is all about – it is simply a user interface driven tool that sits on top of a power formula language, and the user interface does most of the heavy lifting for the user.

The product keeps getting better – keep up the good work Microsoft.

Sydney Modern Excel and Power BI User Group is Growing

Power Bi User Group_NEW 150In October last year, Iman Eftekhari and I set up the Official Microsoft Power BI User Group (PUG) here in Sydney Australia.  Since then we have gone from strength to strength and as of the time of this post the user group consists of a total of 330 members.  If you live in Sydney and would like to join the User Group, then we would love to have you on board.

Objectives of the User Group

The primary objective of this group is to share and learn about data analytics & reporting using Modern Excel and the Microsoft Power BI suite of tools, including

  • Power BI in the Cloud (what is Power BI?)
  • Power BI Desktop
  • Power Query for Excel
  • Power Pivot for Excel

Our Meetup sessions typically include a quick update about what is new in Power BI, tips & tricks, followed by one major presentation.  Major presentations may be an introduction to one of the above tools, a demonstration or tips/advice/best practice or a special guest speaker. Our meetings include lots of opportunities to ask questions, share your own experiences, or ask for help and meet like minded data experts who are going through the same challenges as you. Beginners are welcome as are people that are well into the Microsoft suite of Power Tools.

Our April Meetup this week is already Sold Out

I would love to invite readers along to our Meetup this week but we are already at 100% capacity with 90 users planning to attend.  This week we have a great session from Gin Jia (Agile BI) who will be sharing some of the techniques he has used to leverage the power of Power BI.  In recent months we have had presentations from Marco Russo, Ken Puls, Bill Jelen (aka Mr Excel) as well as official presentations from Microsoft.

Sign up now to be part of our Community

Now is a good time to sign up to be part of the Sydney Modern Excel and Power BI User Group.  There are 2 places to register.

image

When you get to the above link, click on the “join group” button (shown as 1 above) which will take you to our Meetup site for registration.  All of our Meetup events will be communicated at this site including meeting registration and confirmations so you should sign up there.  You should also join the Microsoft Power BI Community and claim your PUG badge (shown as 2 above) indicating your membership to the Sydney PUG.  This affiliates all of your Power BI Community activity with the Sydney PUG, and that helps us get official recognition from Microsoft which in turn helps us get access to exclusive resources to make the PUG more valuable to members.

If you don’t live in Sydney, there are lots of other Power BI User Groups around the world that you can join.  Just go to the link provided above and find the nearest group to you.  And if you find there is not an Official PUG in your area, then why not set one up.

What is Power BI

This is the third in my “what is” series having previously blogged about “What is Power Pivot” and “What is Power Query”.  This topic is the hardest of the three to give a straight answer, however it is also the most important to understand (as I will explain below).

Why Power BI is so Important

Microsoft Power BI is the single most important thing to happen in the data analytics market since the introduction of Microsoft Excel 30 years ago.  The reasons it is so important include:

  1. It is a Microsoft product, and it is a strategic priority for the company.
  2. It is built on the heritage of two of the most mature and best BI products available (SQL Server Analysis Services and Microsoft Excel)
  3. Despite the heritage from SSAS and Excel, Power BI is actually being built/rebuilt using the latest technologies including HTML 5.0, cloud computing, column store databases and modern smart phone mobile apps.
  4. Microsoft is listening to the community to determine what it builds and delivers next.  Let me state that again – Microsoft is listening and delivering!
  5. Microsoft has opened up the visualisation layer to open source contributions via the custom visuals gallery, so there is (and will continue to be) be lots of community added value.  Read about that in my blog here.

Gartner has recognised the trend towards self service business intelligence and also Microsoft’s leading position in this space.  You can look at the 2016 report on this topic here if you are interested.

Why it is Hard to Explain What Power BI is

Microsoft is known for its great products, however it doesn’t always have a great record when it comes to naming products.  The naming of Power BI products has been confusing to say the least.  I wont spend time going back through all the confusing history but instead I will provide my spin (below) on the least confusing way to consider Power BI products today.  I first covered this framework in my blog about the best training for Business Analysts.

So, What is Power BI?

Power BI is a suite of Power BI branded products plus an extended family of umbrella products that collectively change the way people use and analyse data.  All Power BI products (regardless of which group they are in) have all the attributes of a robust enterprise strength BI solution but in addition they all have best in class self service BI capabilities What this means to the average business analyst is that you don’t need to run to IT every time you need something done hence avoiding the normal time, cost, and money conflicts that so often prevent success.

If you want to learn to be a Power BI Ninja, you will need to learn skills in all of the following areas:

  • Extract and Load your data ready for analysis – you use Power Query for this.
  • Enhance the data you have loaded (using Power Query) with relationships and calculations (measures) that can be used to show and find business insights – you use Power Pivot for this.
  • Create reports that bring out the insights from the data, and allow users to easily interact with the data – you use Power BI Desktop for this
  • Distribute the reports  and Dashboards to those that need it – you use Power BI Service and Power BI Mobile for this.

Power BI Branded Products

There are 3 Power BI branded products.

Power BI Service

Power BI ServiceThis is the cloud solution that makes sharing live business analytics a snap.  Need to share a 250MB report – no problem!  It is possible to share small or large reports and dashboards internally within your organisation, externally with third parties, and also share non sensitive information publicly over the web (as I have done with this demo).

The Power BI Service has a completely new visualisation front end built from scratch. The capabilities are still relatively immature however the pace of improvement is stunning, and the addition of the Open Source visualisation tools discussed earlier make it a sure bet.

The service can store your data online, and/or connect to your data in the cloud via Azure (or similar), and/or it can connect to your in house data via a Power BI Gateway.

Power BI Desktop

This is the free Power BI authoring tool that allows you to quickly and easily analyse your data and find insights directly on your PC. The general process is to connect to your data, reshape the data if needed, load the data into the tool, then build striking visualisations that help you analyse your business.  You then can save the workbook (just like you do in Excel), distribute the file for free to other people (eg email, place on a network drive) or better still – publish the file directly to the Power BI Service with just a couple of clicks.  Once the analysis is loaded to the Power BI Service, it is easy to share with others.

You should author your Power BI reports in Desktop over Service, as you can backup your Desktop files but you can’t backup your Service reports.

Power BI Mobile

Power BI MobileThis is the free mobile tool that allows you to consume any of the reports that are available to you in the Power BI Service.  There are versions for all the major tablets and smart phones.  This software is also fairly new and still being enhanced.  It is not perfect, but it is good enough to get started and it will only get better from here.

Power BI Umbrella Products

There is a suite of underlying “products” and technologies that support the Power BI branded products.  These umbrella products covered below are the supporting technologies sitting behind Microsoft’s Self Service BI Strategy (and also Enterprise BI in some cases).

Power Pivot

Power PivotPower Pivot is a data modelling tool.  It allows you to take tables of data, create relationships between those tables so they work together (without VLOOKUP!), and then enhance the data model so that it includes specific calculations that turn your raw data into useful business insights.  This work has historically been done by the IT department, however it can now be done by competent business users using the language of Power Pivot – DAX (Data Analysis Expressions).

Power Pivot is available in Excel 2010+, SSAS 2012+, and Power BI Desktop/Service.  It is the same technology across all of these products.  If you learn it once, you can use the same skills across all of these products.  You can read more about Power Pivot here.

Power Query

Power QueryPower Query is a data acquisition and transformation tool.  It helps business users grab data from anywhere it resides, then cleanse, reshape and prepare the data ready for the analytics work to come – no IT guru needed any more. Power Query is a user interface driven tool that makes is easy for anyone to manage and load data, however it is supported by a powerful programming language that makes it super powerful.  You can read more about Power Query here.

Power Query is available in Excel 2010+ and Power BI Desktop/Service.  It  (annoyingly) has been renamed “Get & Transform” in Excel 2016 and “Get Data” in Power BI Desktop.  What were they thinking?

L2WD banner ad

What is the right way to learn Power BI?

There are lots of different learning styles, but here are the best resources I know of for learning these tools.

Live Training

Reading Books

I keep an up to date recommended reading list in my knowledge base here.  As well as information about books, I keep a list of great websites to learn from down the bottom of the page.

Online Learning

There are lots of online resources to learn. Some of the best I have found are:

What is Power Query

I am still astounded by the number of people I meet that have never heard of Power Query and Power Pivot.  The good news is there are lots of people out there in for a pleasant surprise.  I think Power BI will solve the “lack of awareness of Power Query” problem over the coming months and years.  Power BI is much more visible as can been seen in this Google Trends chart that I first created in Dec 2014.

I have written a blog titled “What is Power Pivot” before and I refer people to that page all the time – this helps me quickly give people a detailed summary of what they need to know.  Today I am writing about Power Query for the same reason – to have something that I can point people towards when they are just starting out.

So What is Power Query (aka Get and Transform)?

Power Query is an Extract, Transform and Load tool (ETL) that helps Excel and Power BI users connect to data, transform the shape of the data as required and then load the data for further use, typically into Excel, Power BI, CSV etc.  ETL tools are not new – there are many enterprise strength tools in the marketplace that are used by IT professionals.  What makes Power Query such a ground-breaking tool is it is the first tool built specifically for business users.

Who can benefit from Power Query?

If one or more of the following scenarios applies to you, then Power Query is a great tool to make your life better.

  • You spend hours and hours of repetitive work combining data from multiple files (CSV, Excel) into a single Excel workbook before you can start the real work.
  • You have data in a file but it is the wrong shape for the task at hand.  You need to spend time “reshaping” the data before you can use it.
  • You write lots of VLOOKUP formulas over multiple tables of data to “join” the data together.
  • You manually have to “cleanse” your data every month removing data and errors that you already know about but you can’t permanently solve back at the source.
  • You need to get data from web pages into Excel so you can analyse the data. You cut and paste the data and spend hours reformatting it to remove the noise from the data.

What Can Power Query Do?

Power Query can be used for the following (and probably many more things too).

  • A data loading tool
  • A data cleansing tool (remove unwanted columns, rows, characters etc)
  • A data reshaping tool including a very powerful “unpivot data” function (example shown below)

unpivot

  • A tool to combined data from multiple worksheets or workbooks into a single file/table

combine

  • A tool to join data from different sources into 1 coherent data set or table.
  • An auditing tool – compare lists of data and find the similarities and differences.
  • A website data scraping tool – easily load tabular data from a web page into Excel or Power BI.
  • A data creation tool (eg it can create a calendar table for Power Pivot).  Read my blog posts about how create a regular Calendar table here, or a 445 Calendar table here.

What Makes Power Query So Great?

Power Query is great for many reasons including:

  • It has a simple User Interface that any Excel user can learn.  Note the simple menu driven features in the image below.
    PQUI[3]
  • The user interface is built on top of a very powerful formula language that has immense capabilities well beyond what most users will ever need.  If you do ever need something more complex than the UI can support, the language is there (as shown above) and can be used to build a working solution without switching tools.
    image
  • Power Query records repeatable “Process Steps” that can be reused over and over without destroying the underlying data.  You can go back through the Process Steps and see the data before the transformation took place.  You can even insert new transformation steps earlier in the process flow if needed.
    Process Steps
  • Repeatable process steps means that you can build the steps for one set of data and then you can “reapply” those same process steps to a different set of data (provided both sets of data are fundamentally the same “shape”).  This is great for updating weekly or monthly data (for example).
    reusable process

Versions of Power Query

Excel 2010/2013

Power Query was first released as a free Addin for Excel 2010 and Excel 2013.  After you install the Addin, you get a new menu on the Excel Ribbon shown below.

image

You can find the download details for Excel 2010/2013 on my downloads page here.

Excel 2016

In Excel 2016 the product was fully integrated into Excel and renamed “Get and Transform” – unfortunately.  To find it in Excel 2016, you need to go to the Data tab (#1 below) and then you will see the Get and Transform group (#2 below).

image

Power BI Desktop

Power Query is also fully integrated into Power BI Desktop where it is called “Get Data” – unfortunately.  You can find it by navigating to the Home Tab (#1 below) and then clicking Get Data (#2 below).  Power Query (Get Data) is the only way to load data in Power BI Desktop, so if you have used this tool to build a new workbook then you have already used Power Query (Get Data).

image

Why Microsoft couldn’t stick with the name “Power Query” is anyone’s guess.  Maybe they thought users were “scared off” by the “Power” term – who knows.
L2WD banner ad

The Power Query Formula Language (PQL)

The official name of the underlying language of Power Query is the “Power Query Formula Language”.  This language was developed from another language called “M”.  Both these names are “unfortunate” as the former is a mouth full and the later is very difficult to search for in Google.  For these reasons, I have come up with my own name (which I hope will catch on) called Power Query Language – PQL  (pronounced Peequel – as in SQL Seequel).  PQL is a much better name, easy to search on Google and has a good fit to is cousin SQL.

Where to learn more

I often blog about how to get value from Power Query.  You can subscribe to my blog via email at the top of this page.  You can also explore the previous Power Query blog posts that have have written here.

There is a great book by Ken Puls and Miguel Escobar – I wrote a book review here.  I am also a big fan of Ken’s blog that you can find here.

What is the Best Training for Business Analysts?

The rate of change in the tools available to a Business Analyst has never been faster than it is in 2016.  The market is changing so quickly that many Business Analysts (BAs) are not even aware of the great new tools that are available to help them succeed in their jobs and careers.  Regular readers of my blog will already know what is hot, but let me cover off a bit of background to what is happening for the benefit of others before I answer the question about what training a BA should be doing (hyperlink to the bottom of the page if you can’t wait that long).

I am going to cover:

Read More

Combine Excel Workbooks with Power Query – Method 2

Edit: Feb 2017.  Microsoft has released an update that solves the problem described in this article.  This article still has learning value however you should also take a look at the announcement from Microsoft here blogs.office.com/2017/02/08/january-2017-updates-for-get-transform-in-excel-2016-and-the-power-query-add-in/

There are many ways to combine data from multiple files into a single file/table in Excel.  In my first article in this series a few weeks ago, I described how to use a function to combine data from identical Excel Workbooks into a single file using Power Query.  Then last week I showed how to combine multiple CSV files into a single workbook/table using the “import from folder” method.  This week I will cover the second method to combine Excel Workbooks together.

Combining Excel files using the Folder Method

The process is essentially the same as covered last week with the CSV method with one difference covered below.

Read More