PowerPivot Refresh: CPU/RAM Spike at Completion

December 6, 2010

 
I mentioned this in a recent post, but I figure a picture illustrates it better.  This is a screenshot of task manager as a PowerPivot refresh completes.  This one ran for about 30 minutes, so this represents just the tail end of the process:

PowerPivot Data Refresh CPU and RAM Spike 

Note the two highlighted points – one of the 4 CPU’s pegged at 100% for awhile – and while the other 3 did go “quiet” at the very end, they WERE active for part of the time that the fourth CPU was at 100.  So – CPU gets precious near the end of refreshes (after hovering around 50% for much of the process).

Even more notably, RAM usage spiked by nearly 2 GB!  The PowerPivot file in question was 1.45 GB on disk when it was complete.  And that workbook was still in RAM after refresh completed, so the 2 GB spike was pure overhead during the final compression process.

(Side effect:  Even if you have enough RAM to load a workbook, that does not mean you have enough RAM to refresh it.)

Allocate your server RAM and CPU around the refresh process, folks.


Hardware/Capacity Planning: From the Trenches

November 30, 2010

So How Many Servers Do I Need in My PowerPivot SharePoint Farm?

As people increasingly move up from just dabbling with the addin, and decide to start leveraging the publish/schedule/share/secure benefits of the PowerPivot for SharePoint infrastructure, I am getting this question more frequently.

How many servers?  How much RAM?  And less often but just as important…  how many CPU’s?

Time for an old joke:  “Ever hear the one about the statistician who drowned in a river that was, on average, only 3 inches deep?”

That’s the whole joke.  And as time goes on, I only find it funnier.  And funnier.  But first, some basics.

The Three Primary Server Roles

For the vast majority of PowerPivot deployments, you will mostly need to concern yourself with three server types:  Excel Services, PowerPivot Engine, and Data Source DB Servers:

PowerPivot Server Roles Summary For RAM and CPU Planning 
(Click for Larger Version)

A few notes on the diagram up front:

  1. I recommend viewing the larger version – there is information on the diagram (in the notes) that I will not repeat below.
  2. Don’t read too much into the 3/4/2 ratio of Excel Services/PowerPivot/SQL – I included different numbers of servers in each tier intentionally, to illustrate that you can scale each tier out at independently.  But, for instance, I do NOT expect you will need 3 Excel Services boxes for every PowerPivot box (you will likely need less).
  3. Although separated above, server roles CAN be combined onto single servers – for instance, many folks can get by with an “all in one” server where everything above lives on a single box.  Also, even in a multi-server farm, one of the MS-recommended configs is to combine Excel Services and PowerPivot onto a single box, and then deploy as many of those combined boxes as needed.
  4. There are some elements missing from the diagram – for instance, there is a Web Front End (WFE) role, and a SharePoint Configuration Database role, but in my experience, PowerPivot does not put unique strain on those elements (except for config db disk space).  Your mileage may vary of course.

Early Planning Efforts

Back in the Spring, I set down to the very serious task of “how much hardware will we need on our SharePoint servers in order to handle user loads?”  And naturally, I made a very serious spreadsheet to model it out:

PowerPivot SharePoint Hardware Planning Spreadsheet v1

Wow, what a spreadsheet.  I mean, I even used the =POISSON.DIST() & =BINOM.DIST() functions – clearly, such spreadsheet horsepower indicates accurate results!

Turns out, that spreadsheet was nothing more than an amusing theoretical exercise with little bearing on the real world.

There were multiple problems with that spreadsheet:

  1. It relied on outright guesses as to how often/how intensely consumers would utilize the published reports.
  2. I was not yet aware of the CPU-gobbling power of slicers, a power that makes real-world queries many times more CPU-intensive.
  3. Most importantly, it assumed peak usage would occur in the morning, when consumers are most in need of fresh information.

Solution for 1) and 2)

These are in some sense the simplest to address.  Quite simply, put up a pilot solution and observe the usage characteristics.  You will learn a lot about usage patterns as well as what that does to CPU and RAM.

But even better:  you may choose to simply ignore these factors for now and focus on problem 3 instead.  Here is why:

Peak Load is Probably Experienced During Scheduled Refresh

If you’ve looked at the larger version of the diagram above, you have seen a hint as to what I’m about to say here: 

At Pivotstream, our nightly refresh process puts far more strain on our servers than our users do, and is what we now plan our hardware around.

I will explain further, so you can evaluate whether your situation will be similar.  We receive new data on a nightly basis (typically late at night).  That data gets ingested into SQL server, and then our PowerPivot refresh process begins.

Once PowerPivot refresh begins, we have about 6 hours, tops, to get all models and reports refreshed, so that when business opens in the morning, everyone has access to fresh insights.  6 hours sounds like a lot…  until you discover that it isn’t.

A lot happens during scheduled refresh!

Keep in mind that the PowerPivot model (the embedded database living inside your PowerPivot workbooks) must first be refreshed – let’s call this phase “Model Refresh.”  This can put a lot of strain on your SQL servers just in terms of supplying the data, since PowerPivot v1 pulls a fresh copy of all tables (no incremental refresh).

As the data flows into the PowerPivot servers, a LOT of CPU power goes into compressing that data into the storage format.  During the refresh process, RAM usage steadily climbs as well.

As model refresh nears completion, RAM usage spikes upward quite a bit, sometimes by as much as 50-100% the size of the resulting workbook.  CPU usage also spikes during this phase.

Once the model is done refreshing, PowerPivot for SharePoint then triggers a refresh (via Excel Services) of the Pivots and Cube Formulas in the workbook, so that when thumbnail screenshots are “snapshotted” for the report gallery, those reflect the latest data (let’s call this phase “Pivot Refresh”). 

Yes, that process initiates via Excel Services, but that refresh immediately results in a meaty set of queries sent back to the PowerPivot servers for processing.  So, CPU usage spikes again.  And the model in question is held in RAM while that happens, so that RAM can’t be recycled into the pool to be used for model refresh.

Remember, this all happens for a single workbook.  And if you hang a bunch of report-only workbooks off of a single “hub” model workbook (as we do all the time), the amount of time it takes to finish Pivot Refresh can actually exceed the time for Model Refresh.

If you have to cram all of this into a narrow nightly window, chances are that you will need more hardware to pull it off than you ever will need for normal daily usage!

Rough Guidelines for Your Own Situation

This post is running a bit long, so I will try to be succinct in closing:

  1. If a significant percentage of your PowerPivot models and reports will be refreshed nightly, refresh is likely going to be a larger peak strain than normal interactive usage
  2. Even if you only refresh weekly (or less often), if the execution window between “data is ready” and “reports must be ready” is basically still a single night, then refresh is still likely going to drive your peak hardware need.
  3. If you think “Peak Load = Refresh Process” is likely the case for your organization, I suggest ignoring interactive usage projections during your hardware planning process, and instead developing some prototype workbooks (with rich data sets and mutliple report sheets with 100% realistic slicer sets, measure complexity, etc.), and then putting those workbooks through the refresh process to get a baseline.

Contest: Many Charts, Shared Slicers, Fast Queries

September 20, 2010

 
Excel Pros Searching for a Worthy Challenge

“Shooting space garbage is no test of an Excel Warrior’s mettle!

I think it is appropriate to start this post with the following disclaimer: 

You do NOT have to be able to win this contest in order to be a monster at PowerPivot.

Think of this as brain candy.  Something that opens your eyes to some of the high-end power available when you merge PowerPivot calcs with native Excel calcs.  A 300-level technique.

***CONTEST UPDATE***  Check here for 3 small rules clarifications.

The Goal:  One Slicer Sheet Impacting Many Report Sheets

This is a pretty common desire.  If the report consumer wants to see ALL of the data broken out by the same set of filters, why make them repeat those filter selections on every single sheet of a workbook?

Isn’t it better, in those cases, to give them a single set of “master” slicers on a single sheet like this?

Slicer Home Page 
Master Slicers Sheet
(Sensitive Data Redacted as Always)

There are 14 slicers on that sheet.  The user can select the department, class, etc. that they care about up front.

Once they have made their selections, in this particular workbook they then have dashboard sheets that look like this:

Dashboard Sheet
Multi-Chart Dashboard Sheet Fed from PowerPivot 

We have several sheets like that in this one workbook, with 20-30 charts on each.  All charts react to the slicers on the Master Slicers sheet.

But you’ll also notice that 4 slicers appear on this sheet, too.  Those are duplicates of 4 of the slicers on the master sheet.  They are the same exact fields as the corresponding 4 on the master sheet, so they are always in sync – set them on one sheet and all other sheets reflect the same selections.

That is merely a convenience – we figured that most slicer selections would be made once, but that users would appreciate being able to bounce around the calendar a bit with the time slicers, without having to switch sheets.

There are also about 20 sheets in the workbook that look like this:

Full Chart Sheet
1 of 20 Single-Chart Sheets Fed from PowerPivot

Again, with the four time slicers repeated for convenience.

The Snag:  Slicer Cross-Filtering Performance

80 unique charts and 14 unique slicers.  What could go wrong?

Well, as I documented in the post on slicer cross-filtering, this is roughly the equivalent of having 80 * 14 = 1,120 pivots all updating at once in response to a single click.  (Maybe it’s not quite that bad, but it DOES get VERY complicated).

And that is very slow, even if the performance of any single pivot is super-fast.

But our report that we built at Pivotstream is fast.  It does NOT bog down on slicer cross filtering performance.  So, how did we do it?

The Contest:  Find Efficient Techniques!

After we built this workbook at Pivotstream I realized that our technique could be refined quite a bit, made simpler.  And while we are already doing that, I realized:

THIS IS AN EXCELLENT CONTEST FOR EXCEL PROS!

I still have two MSDN Subscriptions to hand out.  These basically are unlimited, free MS software for your own personal use, so they are a pretty damn good prize.

The top two submissions will win those.  Runners-up will win PowerPivot architecture posters from Denny Lee, assuming he has not run out :)

Rules Schmules!
 

  1. Use lots of charts, and have them all respond to a shared set of slicers
  2. Entries must include your PowerPivot workbook, plus an explanation of what you did
  3. Use any data source you want, but please use non-sensitive data and enough rows that we can spot performance differences
  4. Use at least 6 slicers, with cross-filtering enabled
  5. You do NOT have to use PivotCharts, but you can
  6. Macros can be used to design the report if you find that helpful, but should NOT be part of the update process when I click a slicer.  (Authoring time is ok.  Run time is not.)
  7. Entries will be judged on, in roughly descending order:
    1. Performance – I will compare query speed with slicer cross-filtering turned on vs. turned off.  Smaller differences are better for this contest.
    2. Polish – as I slice the report, do the charts all still look nice?  Ex:  Blank space in charts = not good.
    3. Ease – how hard is it to execute your technique?  We will turn winning entries into blog posts, and this will become a new reference technique here at PowerPivotPro.
    4. Originality – this never hurts but is not crucial.  Fun is good.  Unexpected benefits and features are even better.

Have fun :)


Slicers and pivot update performance

July 2, 2010

-Posted by Rob Collie

 
CHI0000402_P

OK, I was sitting at the 2nd-day keynote down in New Orleans last month, watching Amir Netz do a demo on some PowerPivot goodies we can expect in the next release.

He was showing off a 2 Billion row model, and its amazing performance, and while that was eye-opening, something he said in passing caught my attention in a big way.

 

In an effort to explain that what we were seeing (split-second pivot performance against said 2 Billion rows), he said the following:

“Now look at this report!  It has 4 pivotcharts and 6 slicers, each of which issues 2 queries whenever I click a slicer, meaning this report is actually querying the 2 Billion rows twenty times!”

Each slicer issues 2 queries against the data source????  It was revelation time.  I kinda missed the next couple minutes of what Amir was saying, as I digested the implications for our work at Pivotstream.

And, I was kicking myself for forgetting this, because I had once known most of this in Redmond: 

Slicers can, if used improperly in a report, end up slowing a report down by a factor of 5, 10, or more.

Tip #0:  Don’t overthink this!

Hey, if your report is fast, don’t obsess.  Move on to something else.  It is not worth monkeying around with your set of slicers to trim half a second.  Remember, slicers are the difference between report consumers loving your work and dismissing it as just more nerdy junk.

When a client recently told us “Reports are dead, now we have Pivotstream!” there was no way we would have received that reaction without slicers.

But every now and then you will find yourself with a report that doesn’t operate quickly enough for your purposes, and slow response times can drain the value and utilization out of a report quickly (well, slowly I guess).

In those cases, one of the first places you should look is your usage of slicers.

Why do slicers issues 2 queries each?

Well, I won’t go into great detail here, because  1) I don’t know every detail   and 2) Vidas covered the tech details quite well here.

What I will give you, though, is an intuitive sense of what a slicer has to do.

First of all, it has to populate itself.  I don’t know how often it checks to see if something has changed in the underlying model – I’ll have to run some tests.  But it’s something to think about:

Tip #1:  Are you using a field from your measure table as a slicer?

Say you have a 2 million row table, and one of the columns is Date.  When you park that Date column on a slicer, you are demanding that the slicer populate itself from all of the distinct values of the Date column.  That doesn’t happen for free – the PowerPivot engine is gonna have to work pretty hard just to generate that distinct list, and it might have to do that every time you interact with the report (click a slicer, change a page filter, etc.)

So, this is yet another reason to consider using a separate, smaller table that just contains all unique dates, and then relating that back to your large table.  Then you can use the field from the smaller table in your slicer, and PowerPivot won’t have to burn so much time doing something unnecessary.

Cross-filtering:  the incredibly useful but sometimes crippling feature

You know that cool feature of slicers where, as you make selections, other slicers update to show which tiles are valid (clickable) in those conditions, and which tiles are not?

For instance, consider this example:

PowerPivot Slicers and Cross Filtering

OK, in week 7 (of all NFL seasons combined in this data set), there were 8 receiving touchdowns scored by players weighing 179 pounds or less.

Now check out the FullName slicer – I have not clicked anything in it myself, but all player names have been disabled except four.  That’s because those four players are the only ones under 179 pounds who caught TD’s in week 7.  Cool!

Of course, the identification of those four players doesn’t come for free.  To drive that point home, I’m just gonna add a second measure to the pivot:

PowerPivot Slicers and Cross Filtering Depends on Measures

Hey look!  Now there are eight players enabled.  That’s because there were four players who weighed less than 179 pounds and recorded receiving yards in Week 7.

Stated more generally, any player who has data for ANY of the measures in the pivot, in the conditions dictated by the other slicers, will be enabled.

What does this mean to us?  It means quite a bit, actually, once you understand how it works.

Under the hood:  what Excel does w/ slicers during a pivot update

I’m sure I will miss a detail or two but that’s not the point.  The point is to give you an intuitive understanding.

  1. Excel takes all of your slicer selections, plus the layout of your pivot (rows, columns, measures), bundles all of that up into a query, and sends it off to the PowerPivot engine
  2. The data comes back and Excel can populate the pivot with numbers at this point
  3. But now Excel still needs to determine, for each slicer, which tiles to enable.
  4. So for every single slicer, Excel does the following:
    1. Takes all of the selections from every other slicer, bundles those up into a query with the measures from the pivot…
    2. …And then adds the field of the current slicer to the query, unfiltered, as if it were on the Row Labels axis of a pivot
    3. Sends that whole query off to PowerPivot, waits for a reply
    4. When the data comes back, only the values of that slicer field for which at least one measure has data, will be in the result
    5. Excel ignores the measure values returned and uses the list of returned slicer field values to enable/disable tiles
    6. Excel then moves onto the next slicer and repeats steps 1-5

That’s a lot huh?

Turning that knowledge into action

There are many tips we can derive from that understanding.

Tip #2:  Reduce the number of slicers you use on a pivot

From the above, you can tell that the query being sent for a given slicer is absolutely on par with the query that is used to populate the body of the pivot with data.  In fact in some cases, that slicer query can be MORE time-consuming that the pivot itself!

So, if you remove a slicer from a report that your consumers rarely use, you can take a HUGE chunk off of the time it takes for the report to respond to user interaction.

Furthermore, every time you add a slicer to a report, you also make the queries issued for the other slicers more complex, so removing a slicer might make the other slicers’ queries speed up as well.

Tip #3:  Watch out for slicers with lots of tiles

I have not confirmed this but feel confident enough to share it anyway:  I’m pretty sure that slicers with long lists of tiles are more expensive to update than those with a small number of tiles.  Customer Email Address, for instance, is probably much more impactful on performance than Gender.

Tip #4:  Reduce number and complexity of measures

Just something to think about.  A measure that uses the FILTER function, for instance, is going to be a lot more time-consuming to crunch than a straight SUM.  And you might already be 100% cognizant of that.

The point here is that when you add one such complex measure to a pivot, you are not running that measure once per report interaction.  You are running it once for the pivot AND once for each slicer, so it adds up faster than you think.

Tip #5:  Use a report filter instead of a slicer

Traditional report filters don’t have this cross-filtering behavior, so they don’t impact performance in the ways outlined above.  You can use report filters without fear of slowing down your report.

But report filters are ugly, clumsy, and they scream Windows 3.1 – if report filters were so great, we never would have built slicers in the first place.

So I have a better suggestion…

Tip #6:  Disable cross-filtering for slicers that don’t need it

Here ya go.  Right click a slicer and click Slicer Settings.

In the dialog, you can uncheck the highlighted checkbox:

Slicer Settings Disable Cross Filtering Items with no Data

…and now you won’t get cross-filtering anymore.  All players will be enabled for example:

PowerPivot Slicers and Cross Filtering Disabled

Well, for a list like Players, that may not be the best idea.  There are thousands of them, and cross-filtering is REALLY useful for helping me narrow down the list and find the players I want.

Then again, having a field with thousands of values in it as a slicer runs contrary to Tip #3, so maybe it’s not such a good idea to have such a slicer in the first place.  Can you live without it?  It’s worth asking yourself.

Tip #7 Consider using other, shorter fields for slicers

Sometimes I like to use the same field in a slicer and in the pivot itself:

Same Field on Slicer and on Row Labels PowerPivot

This is just a lot more convenient for limiting the set of players (sometimes) than using the Row Labels filters.

But again, this can incur performance cost, especially for long lists, which is precisely where you may be tempted to “double park” a field like above.

The alternative that’s worth thinking about:  maybe a field that’s just the first letter of the player’s last name would be sufficient.  Or the team they play for.  Or whatever.  The idea is to reduce the number of tiles, per tip #3, and sometimes you can get what you want by using a different field.