Data Modelling for DirectQuery - Excelerator BI

Data Modelling for DirectQuery

In my last article I introduced the three different connectivity modes that are available for Power BI to fetch data from a source, with a specific focus on when and where to use DirectQuery connections.  In this article I am going to go deeper and introduce the modelling aspects of a DirectQuery data source.

Directquery Connection

As a quick recap before we get started.  There are different methods for connecting to a data source in Power BI, and it is important to choose the right one for your reporting needs.  Import models, where Power BI loads a copy of the data from the data source into the Power BI Vertipaq database is the default and most common connection in Power BI and for the most part should be what you are using.  The Live connection, used in thin workbooks connected to Golden Datasets, is the connection you want to use if you need multiple reports to spin off a single source of truth, as it retains the efficiencies of an Import model while ensuring that the data remains external to the report, and finally there is the DirectQuery connections, where Power BI does not store any of the data, but just a local copy of the data source’s schema (table names, column names, relationships, etc), and resultantly needs to query the source database every time an interaction with a visual occurs.

From my experience, most people who have reports built on a DirectQuery connection into their data source did so because of a lack of understanding of what the DirectQuery connection was designed to achieve.  For the vast majority of reports, Import mode is the best solution for working with data in Power BI.  DirectQuery should really only ever be applied when you are trying to solve one of the following challenges.

  1. Real-time Data – you need to see the latest available data from the source
  2. Huge Datasets – you have many billions of rows of data (more than 10Gb) so you just can’t import it into Power BI
  3. Regulatory Compliance – the data must stay in the source for data security/privacy reasons

Working With a DirectQuery Source

Okay, so you’ve had a solid think about your data import options, and you have come to the conclusion that you are going to buy your Database Administrator (DBA) a beer and become their new best friend, because you absolutely have to use DirectQuery.  That being the case, the very next thing you need to realise and discuss with your DBA is that the data must be modelled to Power BI best practice at the data source, as you will not have an opportunity to tune, prepare and shape it as it is loaded into Power BI.  If you’d like to find out more about preparing your data, we’ve blogged about Shaping and Modelling for Power BI best practice previously.

Now that you do not have the option to import the data into Power BI’s VertiPaq database, the data transformations you can achieve with Power Query against your DirectQuery data source will be significantly reduced, and for the most part you should adopt the idea that you simply no longer have the option to use the Power Query editor to transform your data in any way.  The reason for adopting this mind set is very simple.  Any transformations made by Power Query in a DirectQuery connection must be applied to every query that gets sent to the underlying data source, rather than once on data refresh.  This means that the transformations must be simple and able to reasonably translate into a single native query.  If you use a transformation that’s too complex, you get an error that either must be deleted or the connection model switched to import.

In a similar vein, this also means you should avoid using Calculated Columns for the same reason.

Given these significant limitations, if you need to use DirectQuery, the data source must have the data properly prepared and have the optimal shape for Power BI before you start with your connection attempt, and that means your DirectQuery data source should be a dimensional Star Schema based model.

I am going to say that again because this point is critical to understand.

If you need to use DirectQuery, you cannot simply connect to your operational Transactional Database.  The data source you are connecting to via DirectQuery should be structured as a Reporting Database, and that means a Star Schema.

Transactional vs Reporting Database

The important thing to realise here is that there are two different types of databases, one is suitable for using in Power BI DirectQuery and the other is not.

Transactional Database

A Relational or Transactional Database is designed to serve the specific purpose of inserting new data, updating existing data or deleting existing data (INSERT, EDIT, DELETE).  These databases have what is known as a Normalised structure, which means they have low-to-no data redundancy.  This effectively means that when there is a duplication of a data value, the data is broken apart into subsequent related tables, ensuring that every related table contains unique information.  This structure ensures that the ability to update and edit the data in the database is quick and efficient.

In a nutshell, Transactional Databases are designed to capture and store unique transactions quickly and accurately.  It is conceivable that in a large organisation, the Transactional Database could be expected to capture and store 100’s of unique transactions within seconds of each other.  This means the database needs a structure that is capable of supporting HIGH Frequency, LOW Complexity interactions.

The image below shows the relational model for the Adventure Works SQL Server database.  This is a Transactional Database and is not suitable for use in Power BI.

Transactional Db

Reporting Database

Whether you are importing data into Power BI, or using DirectQuery, Power BI expects to be working with a Reporting Database not a Transactional Database.  While there are variant designs on the structure and shape of a reporting database, the most common and easiest to work with in Power BI is the Star Schema.  The Star Schema separates business process data into facts, which hold the measurable, quantitative data about the business, and dimensions which are the descriptive attributes related to the factual data.

The principal reason why Power BI is designed around a Star Schema comes down to the role the table relationships play.  In a normalised relational database, relationships between the tables do several different things depending on the task at hand.  In Power BI, a relationship between two tables exists for a single purpose: to pass filters from one table to another.  That’s it!

It is for this reason, that the design of your model should have a structure that ensures this job can be performed quickly and efficiently, and this is most easily achieved when there is only one relationship between the descriptive data selected to filter the quantitative data.

In a nutshell, Reporting Databases are designed to report on the unique transactions quickly and accurately.  It is unlikely that a reporting database is going to have to return multiple answers within sub-seconds of each other.  It is far more conceivable that the database could be expected to report on a single-complex insight across thousands to millions of rows of unique transactions within seconds of the question being asked.  This means the database needs a structure that is capable of supporting LOW Frequency, HIGH Complexity interactions.  The exact opposite of the Transactional Database!

Reporting Db

Final Comments

When you are designing the data model for your Power BI report, you will ultimately have to choose between Import, DirectQuery, or Live connectivity to the data source.  However, before you can make that decision, you will need to have a clear understanding of the current shape of the data source you intend to connect to.  If the data source is a relational database, used in daily operations to capture and store data, then the data source is unlikely to be fit for use in Power BI in its native state.  You will then need to determine how to get the data fit for use.  In most situations this will lead you to building an Import model, as it will allow you the opportunity to transform the raw data into a Star Schema as it is loaded into Power BI.  Power BI will then store this data in the VertiPaq database, allowing Power BI to use the in-memory database engine to analyse your data.  If you are not in a position to Import and transform the raw data yourself, because you need to harness a DirectQuery connection, you will need to ensure that the owner of the data source has created an appropriate schema for Power BI to use.  Connecting directly to a Transactional database is likely to result in poor performance, as well as making your DAX significantly more difficult to write.

 

 

 

3 thoughts on “Data Modelling for DirectQuery”

  1. Patrick Murphy

    Hi Jason, Nice Article. Like many of your readers and ‘Data Professionals’ the most difficult issue developer run into is using the de-normalized reporting data source for BI. IT won’t support or develop accesses for ‘Star Schema’ data . Thus the developer must use it as is or re-normalize it for BI use. continued discussions providing tools, how-to, and guidance in going from the de-normalized source to one that is BI optimized would be a great new thread and greatly appreciated by all.

    1. Jason Cockington

      Hi Patrick,
      I’m glad you enjoyed the article.
      Yes. This is not an uncommon situation and is exactly why I wrote this article.
      The point to take away from the article is, if your data source is not in a star schema, you should not be chasing a DirectQuery solution, and should instead stick with the default Import model.
      As for your suggestion about a How-To Guide for de-normalising a Transactional Database or normalising a set of spreadsheets, this is a great idea, and I will definitely chalk it up on the ToDo list.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top