Speed: Another Reason to “Trim” Calendar Tables

February 25, 2014

image

An 11,000-Row Calendar Table Spanning from 2000 to 2030:
Most of the Time This is Harmless Overkill

A 60x Speed Improvement From a Most Ordinary Place

I’ve been doing some work lately for a client who really pushes the DAX envelope.  One of the top-three models I’ve ever worked on in terms of complexity, no doubt.  And really, my role is just to help fine-tune it and add a few bells and whistles.  They built this sucker themselves and I am way impressed.

Crazy stuff.  Formulas that use outlier dates from one Data table (“fact” table) to then subsequently filter another Data table (via its related Calendar table), but then wrap that up inside a MAXX inside a SUMX…  and it all makes perfect business sense.  It’s magic.

But speed ain’t its strong suit.  We tried all the usual tricks – “de-importing” unneeded columns, replacing calculated columns with imported versions, etc.

And it was still way too slow.  Then we tried something even simpler, and things got 60x faster.

Read the rest of this entry »


Stars and Snowflakes and Bears, Oh My!

January 31, 2013

 
Stars and Snowflakes in PowerPivot.  Snowflakes are slow and evil.

A PowerPivot Model with “Single-Step” Relationships is a “Star,” and one with “Daisy-Chained” Relationships is a “Snowflake.”  Snowflakes are Often Slow.  And Evil.  Don’t trust them.
(Images taken from page 194 of the book).

“Hey, Why Did Things Get So Slow?”

Got a great question this week from Dave Boylan:

“I think I read in your book that you should always use dimension (lookup table) data in your slicers and page fields (even if the same data is in your fact/data table) because they behave better and perhaps use less memory.  This makes sense as my fact table has 2MM rows and the slicer has three distinct values.”

Yes – if you have a lookup table, you should use fields from that lookup table on your pivot, rather than their equivalent fields from your data table.  My recent post on using fields from your date table on the pivot is just one example of this.

Read the rest of this entry »


PowerPivot Compression: Mysterious Ways

March 29, 2012

 
image

Johnny take a tour of your data, and tune…
…up your columns and your rows, to free up some room
You’ve been following some rules, with the former Vertipaq
Now you’re seeing some results, that you don’t understand
(Click image for the original music and lyrics)

Modifying Lyrics From U2’s Best Album Ever

Yes, “Achtung Baby” is U2’s best album ever.  Yes, I am deliberately trying to pick a fight.

(“Former” VertiPaq?  Why yes, Microsoft recently renamed the storage engine under the hood of PowerPivot to be xVelocity.  Yes, VertiPaq was a better name.  Yes, there were marketing reasons for the rename, to create an “umbrella brand” over many technologies.  No, this does not help any of us out here in the real world.  I wonder how long I will keep calling it VertiPaq.  Probably a long time.)

An Exception to One “Rule,” and the Reinforcement of Another

Longtime readers of this blog have seen me tout the benefits of reducing columns in your data, even if that means adding a lot more rows (see also David cover it here).

Well, I recently discovered an exception to that “rule.”  And that same data set also yielded a VERY dramatic example of the benefits of sorting your data prior to import.

Let’s cover one at a time.  But first, some perspective.

Disclaimer:  This Post is Probably Just Brain Candy

The data sets I am working with are in the tens of millions of rows and therefore don’t reflect “normal” usage for most people reading this (although I’d argue that is changing, to an extent).  What I’m seeing in this data is very unlikely to repeat itself in smaller data sets.

That said, the impact of reducing columns (even at the expense of more rows) is not something you will typically notice with small data sets.  If you only have, say, 20 thousand rows of data in a table, I would not worry too much about reducing column sets – you still shouldn’t import columns that you don’t need, but I wouldn’t go to great lengths to reshape your data set.

So file this post under “interesting brain candy that may be relevant someday” rather than “something you intrinsically need to know.”  This is 90% fun, with a slice of 10% relevance.  OK, time to dig in.

Original Data Set:  Two Wide Tables

I had a two wide tables in my data.  Yes they look VERY similar but trust me, they are different tables:

image

image

Total File Size:  561 MB

OK, Now Make Them Tall and Narrow!

I was naturally tempted to try importing these same tables as tall and narrow, by introducing an amount type column.

image

image

Total File Size: 1,207 MB

Whoa, it Doubled the File Size!

Yes, a normally-reliable trick has betrayed us here.  Why?

Well, I have not confirmed this with my former colleagues at Microsoft, but I suspect it has something to do with VertiPaq storing data in 1M-row partitions.

That’s right – if you have more than one million rows in a table, they are not all compressed and stored in a single unit.  Your data is first broken into partitions of one million rows each, and then each partition is compressed and stored as an individual partition (“chunk” in Rob-speak).

“Chunking” does reduce the effective compression rate.  In some cases we would get MUCH better overall compression if all the data were stored in a single chunk, but that comes with other tradeoffs that I don’t have room to go into here.

Why Does the One Million Row “Chunk Size” Matter?

Well first of all, in the original “wide” data set, I had 45M and 44M rows.  So, a total of 89 partitions/chunks.

In the new data set, I have 366 chunks.  So each chunk would need to be 4x smaller than the original chunk size just to result in a similar file size.

Given that the file size doubled, we can guess that each new chunk was half the size of the original chunks.

Impact of Sort Order

Let’s take that knowledge of 1M row chunks and consider the impact of sort order as you import data.

My database is sorted by the leftmost columns in the table – Date, PeriodType, Product, etc.

That means my new AmountType column is basically unsorted – look at the pictures above and you will see that even in the tiny number of rows I snapshotted, AmountType is varying whereas all of the columns to the left display the same values over and over.

That means my AmountType column is very “fragmented” in each of the 1M row chunks, and I am not getting as much savings from the shape change as I would if it were not fragmented.

So, let’s try sorting the data by AmountType before importing it!

Sorting by Amount Type Decreases File Size by 35%!

image

Note that This Was Just Table #1
(I Deleted Table #2 Before Running This Sorting Test)

Marco and Alberto were the first people to bring sort order to my attention.  In their experiment, they found a 25% savings on sorted versus unsorted data, but not much difference when they sorted by different columns.

In my case, the data was already sorted.  Changing which column I sorted by, BEFORE import, made a 35% difference.  So, your mileage may clearly vary.

Only Sort BEFORE Import Matters!

Just want to be overly clear here:  once you import the data, sorting in the PowerPivot window has NO impact on file size.  Actually, it has no impact whatsoever, except that it helps you navigate your data in the PowerPivot window.

Only sorting the data in the source, before importing into PowerPivot, has an impact.

In This Case, I Went Back to Wide Tables

All right, even with the 35% reduction from sorting, my resulting files would still be bigger than the original “wide” format.

I also ran some quick performance tests and both formats were returning my test pivot in about 4 seconds.

So in this one case, going tall/narrow didn’t help.  But that won’t discourage me from trying it again next time.  I am pretty sure this data set is an outlier – it currently only has two values in the date column for instance.

The moral of the story:  the tall/narrow trick is generally reliable.  But always test to make sure.


Another Angle to “The Three Seconds of Now”

March 15, 2012

 
image

Yesterday, Tom LaRock posted an insightful followup to my post on Moore’s Law and Inflation.

He’s a lot better at brevity than I am, so I encourage you to read it.  I really liked this part:

“…he took my idea and was able to quickly gather data, perform analysis, postulate his own theory, and return the entire results back in a very short amount of time.

This is the future for data professionals. It isn’t about racking servers. It’s about getting people access to the data they want, in a meaningful way, and quickly.”

Another Secret to PowerPivot’s Success

In my full post today, I described how the “three seconds of now” should guide your construction of reports.

But the things Tom said reminded me of something else, too:

When you are building something in PowerPivot, the line between “report” and “model” is perfectly blurred.  You transition between them seamlessly, very much in the moment of now.  You try something out on a report, see the results, and realize you need a new measure all within a 3-second window.

Because everything is (generally) so fast and seamless, you never leave that moment.  “Now” consumes your entire work session.  And because that suits our biology to a tee, the results are phenomenal.

“PowerPivot:  in tune with the rhythms of the soul.”  Too much?  Probably.  Anyway I am late for my drum circle.


Analysis in the Three Seconds of Now

March 15, 2012

 
NOW, by Alison Farmer

“NOW”, by Alison Farmer
(Click to visit her site)

Our Perception of “Now” is Three Seconds in Length

For several years now this has been one of my absolute favorite paintings.  I liked it long before I knew the meaning, but when it was explained to me I was doubly hooked:  the string that she is holding is divided into three equal lengths:  three seconds.  The moment of “now.”

Apparently there is a lot of research indicating that three seconds is the “magic” window in MANY human and animal behaviors, from hugs to goodbye waves to babies’ babbling. 

Go read this quick article in Science.  It’s short, fun, and interesting.  I will wait.

Where I Used to See It

Why is Tommy Chong Sitting in Donald's Office???I used to see this painting nearly every day, at Microsoft, in Donald Farmer’s office.  Yes, that Donald Farmer.  The one who works as a stunt double for Tommy Chong (or vice versa).  The genetic donor for PowerPivot Yoda.  And he is also the one who last year left Microsoft for QlikView.

Since he is married to the artist, he gets to use his office as a private gallery of sorts.  And he was kind enough to send me some pictures of the painting last night when I asked.

 

Why This is Relevant

Someone at Microsoft recently emailed me to ask my opinion: 

“For large PowerPivot workbooks, how long do you think users will expect to wait when they click a slicer?”

My answer was:

“They don’t care that there is a lot of data behind it.  If it isn’t fast, they won’t engage.  The limits of human patience are not the least bit sympathetic to our data volume problems.”

In other words, people expect the click to finish “now,” in three seconds or less, no matter what.  They don’t even think about it – this is biological.

Two Things Happen When Something Takes Too Long…

imageIf a slicer click or related interaction takes too long, two things happen:

1. The user’s train of thought is broken while waiting on the click to complete.  Their mind wanders off topic and they often flip back over to email while they wait.  They sometimes forget to come back. 

They do not “commit” to the experience, do not get absorbed, and generally decide to remain “shallow” in their thoughts toward it.

2. If they grow to expect “long” wait times, they will ultimately decide not to click at all.  If they know that conducting a slicer click exploration of the data is going to take 15 seconds a click, and they may have to execute 10 clicks over the course of their exploration, they simply decide not to do it at all. 

Yeah, in they had invested that 2.5 minutes, they may have discovered something amazing or revolutionary in the data.  Tough.  Humans aren’t built for that.  They want their three seconds, even more than they want their MTV.

Let that sink in for a moment…

The speed of the report heavily impacts the quality of the thinking someone will do when using it. 

It also will impact whether they use the report at all.

It does NOT just impact how long it takes someone to get something done with your report.

Moral of the Story:  Make Your Reports Fast

If you are building interactivity into your reports for others to consume, don’t skimp on the speed.  If something is slow, here are a few simple tricks to consider:
 

  1. Slicer cross filtering is very often the #1 source of slowdown.
  2. Make your bigger tables tall and narrow.
  3. Don’t use calc columns in your big tables if you can avoid it (do the calc columns in the db if you have one)

It may also be time for better hardware.  We certainly spend a lot of time on hardware selection at Pivotstream.  I like to say that every unit of time or money invested in speed pays off.  I never stop pressing for speed.

Gotta keep things in the moment of now.


SQL UNPIVOT Makes My Data Skinny

November 29, 2011

 

Guest post by David Churchward

If you’re anything like me, you will have eagerly sucked data into PowerPivot, keen to get cracking with solving all of your reporting needs.  If you’re an Accountant (amongst others) doing this, you will no doubt have brought your trial balance dataset in and found endless columns of values representing monthly values.  “How do I get time intelligence to work with this?” is probably one of the first issues you’ve encountered and then, further down the line, when you’ve conjured up complex measures to resolve this problem, the words “why can’t I get this to run quicker” pass your lips.  On this note, everyone should pay close attention to Less Columns, More Rows = More Speed! where Rob proves that the tall and skinny table is much more effective than a short and wide version.

In this post, I’ll explain how SQL can solve this problem by transforming the data on the data extraction process using a SQL function call UNPIVOT and it’s available to everyone – not just developers!

This is PowerPivot using Pivot tables – why on earth do I want to UNPIVOT.  Just like a kid who ties their shoelaces wrong, Mum has to untie in order to do it properly!  The fact is that data transformation has to happen in many situations to leverage the full power that PowerPivot can provide.

What’s this SQL Stuff – Surely that’s for Developers?

I was reading Rob’s recent posts on Why PowerPivot is Better Fed From a Database and it made me realise that there’s a grey area that represents how far an Excel Pro can go in data transformation.  I’m going to leave the “Developer V Excel Pro” debate to Rob (but believe me, he’s right).  As an Accountant who learnt SQL but who is an Excel Pro through and through, my view is that Excel Pros should be able to transform their data to the extent that a SQL view can.

Some of you will be aware, but others won’t, that there is a SQL query editor in PowerPivot.  If I’m upsetting Developers by revealing this, I’m sorry, but, if you’re worried, please take this as an opportunity to provide your Excel Pros with a secure Dev environment!

Accessing The SQL Query Editor

You can access query editor either when you first import data or by editing a table that you’ve already imported.

Accessing Query Editor on First Import

  1. From the PowerPivot window, select From Database > From SQL Serverand set up your connection as you would do normally.
  2. Click Next and select the option Write a query that will specify the data to import

image

Accessing Query Editor From a Table Already Imported

  1. Select the table that you want to affect in the PowerPivot Window and select Design > Table Properties
  2. Change from Table Preview to Query Editor

image

image

And there you have it.  A script that is running to import the data that you’ve specified.  We can now edit this to transform the data.

Before I go ahead, I just want to set a benchmark regarding the data that I’m using.  When I’ve brought this data in, I’ve got 38,183 records using 14 columns.  We’ll see what the UNPIVOT does to this in the end.  Everyone’s expecting a lot more rows but fewer columns, right?

image

The SQL “UNPIVOT” SCRIPT

So, here it is, the script to use in Query Editor mode:

SELECT

– These are the fields that I want in my final dataset

NCODE,

NNAME,

VALUE_CATEGORY, – NEW FIELD 1 – The field that takes my value column names

VALUE – NEW FIELD 2 – The field that takes my values from those columns

FROM

– I specify where these fields come from

(

SELECT

NCODE,

NNAME,

NTURNOVER_C1, – Value Field

NTURNOVER_C2, – Value Field

NTURNOVER_C3, — Value Field

NTURNOVER_C4, – Value Field

NTURNOVER_C5, – Value Field

NTURNOVER_C6, – Value Field

NTURNOVER_C7, – Value Field

NTURNOVER_C8, – Value Field

NTURNOVER_C9, – Value Field

NTURNOVER_C10, – Value Field

NTURNOVER_C11, — Value Field

NTURNOVER_C12 – Value Field

 

FROM dbo.AZZ_V_POWERPIVOTPRO_TB –Substitute this with the name of your source table or view

) PIVOT_TABLE

– Unpivot the table               

UNPIVOT

(VALUE FOR VALUE_CATEGORY in        – These are the names given to the

       — value field (NEW FIELD 1) and value column name (NEW FIELD 2)

(

– Enter value fields as per the select above

NTURNOVER_C1,

NTURNOVER_C2,

NTURNOVER_C3,

NTURNOVER_C4,

NTURNOVER_C5,

NTURNOVER_C6,

NTURNOVER_C7,

NTURNOVER_C8,

NTURNOVER_C9,

NTURNOVER_C10,

NTURNOVER_C11,

NTURNOVER_C12

)

) AS UNPVT

WHERE VALUE <> 0

ORDER BY NCODE, VALUE_CATEGORY

Note – green text within the script are comments to provide some direction as you use this script.  You can copy these into query editor too – PowerPivot will ignore them.

How to Use This Script

  1. Copy and paste the script into your query editor
  2. In the first SELECT, enter the field names that you require in your final dataset.  These should correlate to fields in the table except where I have noted “NEW FIELD 1” and “NEW FIELD 2”.  These “NEW FIELDS” will carry the value elements that we are unpivoting.
  3. In the second SELECT, detail the fields that you have already detailed in the first select together with the column names carrying the values that you will unpivot.  Also change the table or view name in the FROM clause to reference the source data that you are using.
  4. In the UNPIVOT section, detail all of the value fields again that you are unpivoting.
  5. In the ORDER BY clause, select the fields that you wish to use for dataset ordering.

When you save this, your dataset will have been narrowed and you’ll have a new field denoting the data type (or value column reference).

image

You’ll notice that we’re down from 38,183 records to 7,103.  This won’t always happen.  The reality is that we’re able to strip out the zero value cells by using the WHERE VALUE <> 0 clause.  If you’re carrying a number of cells that are zero valued, we won’t see that item as a record in this new dataset.  As a result, we get fewer records in this case.  In any event, your data is now a lot more streamlined and ready for more efficient measures to be written.  You don’t have to include the WHERE clause if you prefer, for whatever reason, to bring all records in.  If that’s the case, just delete that clause.

One Further Step

To make this truly useful, we need to translate VALUE_CATEGORY into something more meaningful.  I would normally expect this to be a date but most trial balances use column names such as “CURRENT1” so that the system isn’t date dependent.  Another table would be required that the VALUE_CATEGORY column can link through to which translates that value, thereby opening up the whole time intelligence functionality.

A Quick Warning

This UNPIVOT approach has given you the power to transform your data into a more streamlined dataset.  However, by using this approach you’ve created a local version of the script.  You won’t be able to reuse this without copying and pasting it into other workbooks (or at least I haven’t found a way).  However, those Developer friends of ours can give you a centralised version that can be controlled and reused more easily.  If you ask these guys nicely, they might create a view or stored proc to do this for you.  I’ve found with my team that salt based snacks, chocolate and the occasional Spag Bol for lunch does the trick and, if it doesn’t, I go ahead and do it anyway!


Less Columns, More Rows = More Speed!

August 1, 2011

 

“Stay close!  Come together!  Staggered columns!  Staggered columns!

I mean…  less columns!  More rows!”

-General Maximus Speedicus Queryus

 
 

A Long-Held Belief, Quantified

For a long time now I have been advocating that everyone minimize the number of columns in their large tables.  For certain, you want to not import columns unless they are going to be used.

But I also recommend that if you have a lot of numerical columns in a table, it’s often better to replace them all with a SINGLE numerical column, and add a second column for “type.”  Even though that results in many more rows, the theory is that PowerPivot’s VertiPaq engine would  return quicker pivot results against the “tall, narrow” table than it would against the “short, wide” table.

I’ve simply “known” that to be true, in my bones, for a long time.  I was so certain of it, in fact, that I have never bothered to test a “before and after” case until now.

We had an opportunity this weekend to quantify the “before” and “after” impact of making such a change, so I thought this time I’d record the results and share them.

The Original Data

PowerPivot Wide Table Lots of Columns

With that data shape, the [Sales] measure needs to be the sum of a specific column:

PowerPivot Measure Wide Table

And then, a very large pivot reporting against this data ran in 5.7 seconds.

OK, so there’s our baseline.

The Taller, Narrower Table

PowerPivot Tall Table Less Columns

Whereas the original table used a single row to store all 9 numerical values, this new table stores each numerical value in its own row, and tags each row using the Value Type column to identify which numerical value it is.

(Note that in this new shape, zero values can be completely omitted if you’d like.  And we did, which is why there are not 9 * 700k rows in the resulting narrow table – we just dropped rows that were going to be zeroes.)

I then added a very small table that “decodes” the ValueTypes into something I can understand, and related it to the tall table:

PowerPivot Tall Table Lookup Table

Lookup Table – Helps Write More Readable Measures

Each row in this lookup table corresponds to one of the original numerical columns in the Wide table, of course.

I then wrote a simple measure for [Amount]:

PowerPivot Amount Measure Tall Table

Base Measure – Just a Raw Sum of the Value Column,
Serves as a Basis for Other Measures

And then the Sales measure can simply be a filtered version of Amount:

PowerPivot Measure Tall Table

[$ Sales] = The [Amount] Measure Filtered to “Ttl_Dollars” Type Rows

The Results

I have to admit, I was a bit nervous, because I had colleagues watching me, they knew I was making a precise comparison and was going to share the results.  Would this turn out as expected?  Would it be faster?  I experienced a moment of doubt as I was about to click the mouse and time the query.  I mean, we were looking at nearly 7x as many rows in the new set.  I was starting to sandbag my own expectations, telling myself it would likely be the same.  I should not have doubted.

It ran in 2.4 seconds, more than twice as fast as the original 5.7 seconds.

Boom.  It was worth it, big time.  Even more impressive since there is definitely some fixed-cost overhead involved in Excel sending the query, populating the grid with thousands of rows of results, etc.  There may even be a full second of overhead involved, in which case raw query time went from 4.7 to 1.4 seconds.

Why Faster?  Will This ALWAYS Work?

Remember that VertiPaq is a COLUMN storage engine, and it compresses the data VERY efficiently.  (I’ve covered this topic briefly here as well).

With that in mind, here are two quick answers.  The long answers would be too long for this post Smile

1) By removing a lot of the “noise” presented by the 9 numerical columns, this change enabled VertiPaq to do a much better job compressing the columns that remained.  And that compression translates into better query speed as well.

2) No, I do NOT expect that this trick will always speed things up.  It all comes down to “cardinality,” or how many unique values per column.  And I’m pretty sure that the three columns dedicated to “Units” in the original data set had lower cardinality than the SKU column for instance.  If we hadn’t had those Units columns, I suspect this change would NOT have been faster.

As a side note, the file size of tall and narrow was NOT smaller than short and wide.  I’ve mentioned many times that file size is a ROUGH guide to performance but not 100%.  Maybe we’ll discuss this in the comments section, because this post has run too long already.

No, I don’t think this will always work.


Two tips from “the Old Country”

June 10, 2011

   

“Don Corleone, we have some information for you.”

For awhile now I’ve been meaning to share a few tips from who I call “the Italians” – Alberto Ferrari and Marco Russo.  I reviewed their book awhile back, if you recall.  Over the past several months they have discovered a few things that I think you will appreciate.

 

 

Tip 1 – <CTRL>-Scroll Wheel to zoom the measure window

OK, we all know that the formula measure editor window uses a small font:

PowerPivot Measure Editing Font is Small

PowerPivot Measure Editing Font is Small

Alberto discovered that if you hold down the <CTRL> key on your keyboard, and scroll the mouse wheel, you can increase the font size!

Bigger Font in the PowerPivot Measure Editor

Bigger Font in the PowerPivot Measure Editor

Neat huh?  Funny thing is, this does not work on all of my computers.  I use different mice, different versions of Windows, etc. – I scramble all of the variables.  So it’s hard to know why it doesn’t work everywhere.  (In fact, the two screenshots above were taken on different machines).

While we’re on the topic, you may also want to consider writing your measures *outside* of this dialog, so take a look at this post on using Notepad++ written by Colin Banfield.

Tip 2– Sort data before it’s imported to improve perf

Another Alberto discovery:  Believe it or not, when PowerPivot imports data, the sort order of the incoming data actually has an impact.

The original post is here.  Bottom line:  you can shave 25% or more off your workbook size if the data coming in is sorted ahead of time, and Alberto’s results suggested that it doesn’t matter much which column you sorted by.

NOTE:  We are NOT talking about sorting data AFTER it’s been imported.  We mean sorting it during the import process.  Sorting after import has ZERO effect.

Tip 2a – Smaller workbooks are faster workbooks

He also noted that pivottables built on those smaller workbooks are faster than on the workbooks created from unsorted data.

That 100% matches my expectations, because PowerPivot’s compression is not just an on-disk compression – that compression is also maintained when the data is loaded into memory, AND is used to speed up queries!

So I have been telling people for a long time:  smaller workbooks are generally faster workbooks.  You can use workbook size as a rough indicator of how fast it will be.

Tip 2b – The column you choose DOES matter, if…

At Pivotstream we recently had a SQL table that just refused to import.  PowerPivot basically just locked up every time we tried to import it.

The table was sorted by a column with millions of distinct values (the dollars column).  When we sorted it by basically ANY other column, it imported just fine.

And this was only a 1.5M row table.  It seems that certain data sets can just chew up massive amounts of RAM during import if they are not sorted properly.

This discovery was what triggered me to go back and share all of the tips above, so it was a fortunate find.


Buying a PowerPivot Server? Talk to Me First!

March 2, 2011

PowerPivot Servers - A Joint Collaboration Between Pivotstream and um... a leading technology partner :) PowerPivot *IS* Fast… but takes some tuning

Most of us have seen the demos of 100 million rows in PowerPivot, and slicer clicks taking well under a second against those sources.  Super slick.

And whenever I first load a large data set like that and start working with it, I do indeed experience sub-second response times.  But once I go and build sophisticated measures, and put, say, 5-6 slicers on the report, things start slowing down, sometimes by a lot.

When you think about everything PowerPivot is DOING in those cases, it’s staggering.  But the report consumers don’t care – they want speed, not explanations.

So at Pivotstream, we spend as much effort tuning our models and reports for speed as we do creating them in the first place.  We’ve gotten much better at that over time, so it’s almost second nature to us now.

Multiple Times Faster.  AND More Cost Effective.

The quest for speed didn’t end with our modeling and reporting techniques though.  Remember, we’ve been at this for over a year now, and eventually we started to brainstorm and experiment on hardware platforms.  The results have been surprising – in a very good way.

Being a relatively small company, we needed help in this quest, so we approached a leading technology provider for assistance.  We weren’t sure what sort of response to expect – would they understand / take it seriously / appreciate the opportunity?  And even if so…  would our collaborative investigation yield results?

Yes, yes, yes, and BIG YES.

As a technology professional, this has been one of the most stimulating experiences of my entire career.  Real science fiction, cutting edge experimentation…  which is why for now, I’m going to refer to our partner as Cyberdyne.  (This is all so fresh that there are still some legal details to tidy up before I can disclose their true identity).

Anyway, bottom line:  what we’ve developed is both multiple times faster than off-the-shelf server hardware…  AND at the same time, almost certainly more cost effective.

Take the PowerPivot Pepsi ChallengeOur Version of the Pepsi Challenge :)

I’m quite confident that our Black Box (as we are calling it today) will blow the doors off of literally ANYTHING you can get your hands on through any other channel.

So, before you sink big money into PowerPivot server hardware, talk to me.  There’s a good chance that I can save you a lot of time and money.


PowerPivot Scheduled Refresh Pt 3: Thin Workbooks

February 1, 2011

 
PowerPivot Thin and Core Workbooks

What is a “Thin” Workbook?

At Pivotstream, we came up with the term “thin workbook” to describe an Excel workbook that does not contain a PowerPivot model, but that DOES connect to another workbook that does, which we call a “core” workbook.  In other words, thin workbooks use core workbooks as their data source.  Others have called this “hub and spoke,” and I may even like that phrasing better, but hey, we’ve been using “thin and core” at Pivotstream for nearly a year and it’s hard to change.

So, a thin workbook contains pivots, charts, cube formulas, etc. – it’s a report-only workbook.  And whereas our core workbooks may be 200 MB or more, our thins are often 5 MB or less.

Why create thin workbooks?

Here are a few reasons why you may want to use a thin workbook approach.

1) To avoid duplicating data.  If you want to create multiple workbooks that offer different views of the same data, which we do all the time, you probably don’t want to duplicate the PowerPivot data across all of those workbooks.  That duplication needlessly burns RAM on the server, takes extra time to run scheduled refreshes, and also increases your maintenance work (ex:  if you want to modify a measure, you have to do it multiple places instead of one).  I *highly* recommend you never duplicate PowerPivot models like that, and thin workbooks are a good alternative.

2) To lock down data.  In many ways this is another flavor of #1.  At Pivotstream we often create one workbook that contains a full set of measures, and then another that omits certain measures that are particularly sensitive.  Both can use the same core workbook as a data source, but if you properly lock down your permissions on SharePoint, you can give one set of users a different level of data than another set.

3) To avoid the refresh on open performance problem.  In the last post about refresh, I pointed out that the PowerPivot refresh service on SharePoint kinda “cheats” a little bit – it refreshes the data in the PowerPivot model, but the data in the pivots (the data in the Excel sheets themselves) is untouched.  To work around that, PowerPivot for SharePoint sets the “refresh on open” flag, forcing Excel to refresh all pivots whenever the workbook is opened.

That can be slow.  In many cases in our work at Pivotstream, it is VERY slow.  Since the PowerPivot refresh service does NOT touch the thin workbooks, “refresh on open” never gets set.

Of course, that creates another problem doesn’t it?  The data in the Excel sheets then remains stale, with no automatic way to refresh it.  That’s no good.

But we are busy little beavers at Pivotstream and we don’t give up easy :)  We started working on this problem in March 2010.  More on this later.

How do you create thin workbooks?

The first step is to publish your “Core” workbook to SharePoint.  You CANNOT connect a thin to a core sitting on your desktop – the desktop version of the PowerPivot engine does not support cross-workbook connections like that.

Once you have your Core published, there are two primary ways to create a thin workbook:  Start from Scratch, or Hack like Mad.

Boring:  Start from Scratch

In this method, all you do is go to the PowerPivot Gallery and click the highlighted button:

Creating a Thin PowerPivot Workbook from the Gallery 
…and that gets you started in Excel, in a new workbook, with a blank pivot connected to the Core model.

More Fun:  Hack Like Mad!

For those of you who prefer to color outside the lines, here’s another technique that I often find helpful.  You’ve created a single Core workbook, and it has lots of report sheets in it.  You don’t want to start from scratch and throw all of those report sheets away, right?

So here’s what you do:

  1. Create a copy of the workbook.  Just copy/paste the file in Windows Explorer.  Name one file something like Core.xlsx and the other Thin.xlsx
  2. Delete pivots from the Core workbook.  Two reasons for this.  One, it reduces confusion – when you open the Core, you know it’s the core because you don’t see any report sheets.  But deleting pivots also makes the scheduled refresh process on SharePoint run faster – after PowerPivot refreshes the model on the server, it then needs to refresh the thumbnails in the gallery.  To do that, it asks Excel Services to open the workbook, which triggers “refresh on open.”  Why pay extra time and CPU penalties on reports no one is going to see?
  3. Publish the Core workbook to SharePoint.
  4. Change the connection string in the Thin workbook.  Open it up in Excel, and in Connection Properties change the “$Embedded$” string to be the full URL to the core workbook on SharePoint.  Example:

Changing the Connection String to Point to a Published PowerPivot Workbook

When you click OK, Excel performs a full refresh against the new source.  Be patient.  Once that is done, your thin no longer connects to the PowerPivot data embedded in its own file, it now queries the server copy instead. 

Of course, now you probably want to make the file smaller and get rid of all that unused PowerPivot embedded data.  Remember where that lives?  This post explains in more detail, but here’s the image of the thin workbook renamed from XLSX to ZIP:

Embedded Data - One Reason Why PowerPivot Will Not Work With Excel 2007

Item1.data is the file you want to get rid of.  But if you delete it, you will corrupt the file.  So I just create a 0-byte Item1.data file on my desktop and copy it over the top of the file inside the ZIP.

Rename back to XLSX and you have a true “Thin” workbook with all of the original report sheets intact and functional!

Summary, and a hint at the next post

Thin workbooks are a VERY useful tool in a PowerPivot system.  For Pivotstream, they are absolutely essential.

The only trick, as mentioned above, has been getting those thin workbooks to always be up to date.  I’m happy to say we have that problem solved.

I’m pretty sure others are going to need this sort of thing, too, as well as some other internal tools and components we have built.  If you have interest in beta testing our PowerPivot Accelerators, as we are calling them, watch for the next post :)


Got a high end CPU? Please help w/ an experiment!

January 23, 2011

If anyone out there has a really high end CPU and wants to participate in a VERY quick experiment, please drop me a note – rob at pivotstream (dot) com.

Will take you five minutes.  And as a bonus, I will explain to you why it is relevant to PowerPivot – advanced glimpse behind the scenes, weeks in advance of the eventual blog post :)

May even have a gift or two to line up for you as well.

To be clear, a high end processor is one that ranks near the top of these rankings:

http://www.cpubenchmark.net/high_end_cpus.html

I want to conduct this experiment in the next 24-48 hours.

UPDATE:  results so far are a bit surprising.  One of the top-end procs is one of the worst performers so far.  The current champ is a mid-range, affordable proc that was just released in the past two weeks, based on the new “Sandy Bridge” architecture from Intel.

UPDATE #2:  More results trickling in, updated the table below (click for larger version).  I am going to leave this open for another day or two, so please get in touch if you would like to participate.  You will need 64-bit Office 2010 and 64-bit PowerPivot.

UPDATE #3:  New results and information as of Tuesday morning, Jan 25.

image

(Click for larger version)

If you have a CPU not on that list, or one on that list and in a different RAM config (or overclocked), please leave me a comment or drop me an email – rob at pivotstream (dot) com.

Thanks!


PowerPivot Scheduled Refresh – Twists & Turns, Pt 1

January 3, 2011

Skipper Plans A PowerPivot Refresh Strategy
“Rico, we’ll need special tactical equipment. We’re going to face extreme peril. The private probably won’t survive.”

OK, with a new year it’s time to get serious on the blog.  (Yeah, nothing says “serious” like cartoon penguins, but really, I *do* mean it).  This is gonna kick off a series of inter-related posts that I’ve had in my head for a long time now.  So buckle up, Rico.

 

 

Let’s say you’ve built a bunch of nifty PowerPivot workbooks.  You’ve published them to SharePoint.  And naturally, now you want to set things up so those workbooks automatically refresh, with the latest data, every night (or perhaps once a week).

As a recap, the simplest way to get started is to go to the PowerPivot Gallery view and click the highlighted button:

PowerPivot Gallery Schedule Refresh Button

Which leads to this page:

PowerPivot Gallery Schedule Refresh Page

Refresh ASAP vs. Refresh Once

Most of the options in that scheduling page are quite straightforward.  The only one that’s ever tripped me up is “Once.”  “Once” does NOT mean “Now.”  Instead, “Once” typically means refresh will occur tonight – either after business hours or after your specified earliest time.

If you want a refresh to occur now, you either need to specify a time (from today) that has already passed, OR don’t use “Once” at all.

Instead, you can schedule a refresh to occur Daily/Monthly/Weekly, and then check the “Also refresh as soon as possible” checkbox.

Personally, I think the lack of a “One time, right now” option is an oversight.  As a workaround, I often found myself scheduling something to run once every 12 months just so I can check the “ASAP” checkbox.

Another Way to Turn “Once” into ASAP

At Pivotstream, we have found another way to schedule ASAP refreshes:  we’ve changed the centralized definition of “business hours” to be so narrow that there are essentially no business hours at all.  (Right now we have biz hours set to be 7-7:15 AM – I haven’t tried setting Start Time = End Time, but a 15 minute window is narrow enough for our purposes).

The downside of essentially disabling Business Hours, of course, is that it impacts everyone.  That may not be a good option for your organization, as a refresh scheduled for “after business hours” will end up running sooner than expected.  It works well for us at Pivotstream however.

For details on how to modify Business Hours, see Vidas’s item in the FAQ:  http://powerpivotfaq.com/Lists/TGPPF/DispForm.aspx?ID=51

Behind the Scenes – the Basics

When a scheduled refresh does occur, a lot more happens than meets the eye.  This rough diagram illustrates the steps behind the scenes:

What Happens During PowerPivot Refresh - The Basics

(Click for Larger Version)

Notes on the diagram

I won’t repeat the descriptions on the diagram above, so spend a couple of minutes going over it ok? :)

With that done, I do want to call out a few points:

  1. The timer service only checks about once per minute – so if you schedule an ASAP refresh but nothing happens immediately, just be patient, it will start shortly.
  2. You can set different schedules per connection – I just discovered this recently, thanks to the watchful eyes of a training/consulting client of mine.  George and company, you rock :)
  3. The whole process will fail if the original file is checked out, or otherwise open for editing, during Step 6.  So it’s extra important that you not leave your published PowerPivot workbooks checked out.  I recommend editing locally on your desktop for this reason (and also because SharePoint is not terribly speedy for open/save of large files…  and you should be saving often.)

Refresh is Actually TWO Refreshes

OK, one last piece of background in this post.  When PowerPivot runs a refresh, it is refreshing the PowerPivot model – which corresponds to the contents of the PowerPivot window in Excel client.

In order for report consumers to see the latest and greatest data when they first navigate to it, however, the pivots in the Excel sheets themselves ALSO need to be refreshed.  So there are TWO refreshes that need to happen – we’ll call the first one “PowerPivot Refresh” and the second one “Excel Refresh.”

PowerPivot Refresh vs Excel Refresh
PowerPivot Refresh and Excel Refresh

OK, that’s enough for part one.  Next time I’ll explain how this two-stage refresh can complicate things in some situations.

Click Here for Part Two >>

In the meantime, if you are interested in further reading on scheduled refresh, you can check out the following articles:

http://msdn.microsoft.com/en-us/library/ff976569.aspx (Mariano’s excellent whitepaper)

http://powerpivotgeek.com/2010/09/08/a-peek-inside-getting-the-most-from-data-refresh/

http://blogs.msdn.com/b/powerpivot/archive/2009/11/20/powerpivot-data-refresh.aspx