I have been doing some work with a customer over the last couple of weeks preparing to produce a new data model. Part of the process is to get an extract of invoices (header detail and line item detail) from the mainframe so that I can easily access all the records I want in SQL Server, and then bring them into PowerPivot as needed.
As you can see in this example, there are more than 19 million rows of data (40 columns in total).
When I look at the file size in SQL Server Express, you can see it is 5.4 GB of data.
The next thing I did was import the entire database (19 million rows) into PowerPivot. As I imported the data I started to think, and that gave me an idea about the title of this post – just how good is PowerPivot compression? Have a guess what the file size is for exactly the same data.
No, your eyes are not deceiving you. The PowerPivot version of exactly the same data is 424.4 MB. That is a compression ratio of 92.2% (the new file is just 7.8% the size of the original DB). I am sure you will agree that this is nothing short of amazing, and it is due to the way that PowerPivot leverags its xVelocity memory-optimised columnstore.
How does a Columnar Data Store Work?
They way xVelocity achieves such great compression is to store the data physically on the disk (and in memory) in columns of data instead of rows. When data is stored in rows, there is often very little opportunity for compression. That is because the data in each cell in the row is often very different to the data in the cell next to it. However when data is stored on disk in columns, the data in each cell in the column is often very similar, particularly if the column is sorted first. Let’s take an example of a column that contains a Goods and Services Tax code. For this GST code, there are only 2 possible values, “Y” or “N”. So if you sort this column of data before it is saved to disk, there will be massive compression saving opportunities.
Another truly amazing story about PowerPivot.