Johnny take a tour of your data, and tune…
…up your columns and your rows, to free up some room
You’ve been following some rules, with the former Vertipaq
Now you’re seeing some results, that you don’t understand
(Click image for the original music and lyrics)
Modifying Lyrics From U2’s Best Album Ever
Yes, “Achtung Baby” is U2’s best album ever. Yes, I am deliberately trying to pick a fight.
(“Former” VertiPaq? Why yes, Microsoft recently renamed the storage engine under the hood of PowerPivot to be xVelocity. Yes, VertiPaq was a better name. Yes, there were marketing reasons for the rename, to create an “umbrella brand” over many technologies. No, this does not help any of us out here in the real world. I wonder how long I will keep calling it VertiPaq. Probably a long time.)
An Exception to One “Rule,” and the Reinforcement of Another
Longtime readers of this blog have seen me tout the benefits of reducing columns in your data, even if that means adding a lot more rows (see also David cover it here).
Well, I recently discovered an exception to that “rule.” And that same data set also yielded a VERY dramatic example of the benefits of sorting your data prior to import.
Let’s cover one at a time. But first, some perspective.
Disclaimer: This Post is Probably Just Brain Candy
The data sets I am working with are in the tens of millions of rows and therefore don’t reflect “normal” usage for most people reading this (although I’d argue that is changing, to an extent). What I’m seeing in this data is very unlikely to repeat itself in smaller data sets.
That said, the impact of reducing columns (even at the expense of more rows) is not something you will typically notice with small data sets. If you only have, say, 20 thousand rows of data in a table, I would not worry too much about reducing column sets – you still shouldn’t import columns that you don’t need, but I wouldn’t go to great lengths to reshape your data set.
So file this post under “interesting brain candy that may be relevant someday” rather than “something you intrinsically need to know.” This is 90% fun, with a slice of 10% relevance. OK, time to dig in.
Original Data Set: Two Wide Tables
I had a two wide tables in my data. Yes they look VERY similar but trust me, they are different tables:
Total File Size: 561 MB
OK, Now Make Them Tall and Narrow!
I was naturally tempted to try importing these same tables as tall and narrow, by introducing an amount type column.
Total File Size: 1,207 MB
Whoa, it Doubled the File Size!
Yes, a normally-reliable trick has betrayed us here. Why?
Well, I have not confirmed this with my former colleagues at Microsoft, but I suspect it has something to do with VertiPaq storing data in 1M-row partitions.
That’s right – if you have more than one million rows in a table, they are not all compressed and stored in a single unit. Your data is first broken into partitions of one million rows each, and then each partition is compressed and stored as an individual partition (“chunk” in Rob-speak).
“Chunking” does reduce the effective compression rate. In some cases we would get MUCH better overall compression if all the data were stored in a single chunk, but that comes with other tradeoffs that I don’t have room to go into here.
Why Does the One Million Row “Chunk Size” Matter?
Well first of all, in the original “wide” data set, I had 45M and 44M rows. So, a total of 89 partitions/chunks.
In the new data set, I have 366 chunks. So each chunk would need to be 4x smaller than the original chunk size just to result in a similar file size.
Given that the file size doubled, we can guess that each new chunk was half the size of the original chunks.
Impact of Sort Order
Let’s take that knowledge of 1M row chunks and consider the impact of sort order as you import data.
My database is sorted by the leftmost columns in the table – Date, PeriodType, Product, etc.
That means my new AmountType column is basically unsorted – look at the pictures above and you will see that even in the tiny number of rows I snapshotted, AmountType is varying whereas all of the columns to the left display the same values over and over.
That means my AmountType column is very “fragmented” in each of the 1M row chunks, and I am not getting as much savings from the shape change as I would if it were not fragmented.
So, let’s try sorting the data by AmountType before importing it!
Sorting by Amount Type Decreases File Size by 35%!
Note that This Was Just Table #1
(I Deleted Table #2 Before Running This Sorting Test)
Marco and Alberto were the first people to bring sort order to my attention. In their experiment, they found a 25% savings on sorted versus unsorted data, but not much difference when they sorted by different columns.
In my case, the data was already sorted. Changing which column I sorted by, BEFORE import, made a 35% difference. So, your mileage may clearly vary.
Only Sort BEFORE Import Matters!
Just want to be overly clear here: once you import the data, sorting in the PowerPivot window has NO impact on file size. Actually, it has no impact whatsoever, except that it helps you navigate your data in the PowerPivot window.
Only sorting the data in the source, before importing into PowerPivot, has an impact.
In This Case, I Went Back to Wide Tables
All right, even with the 35% reduction from sorting, my resulting files would still be bigger than the original “wide” format.
I also ran some quick performance tests and both formats were returning my test pivot in about 4 seconds.
So in this one case, going tall/narrow didn’t help. But that won’t discourage me from trying it again next time. I am pretty sure this data set is an outlier – it currently only has two values in the date column for instance.
The moral of the story: the tall/narrow trick is generally reliable. But always test to make sure.