SQL UNPIVOT Makes My Data Skinny

November 29, 2011

 

Guest post by David Churchward

If you’re anything like me, you will have eagerly sucked data into PowerPivot, keen to get cracking with solving all of your reporting needs.  If you’re an Accountant (amongst others) doing this, you will no doubt have brought your trial balance dataset in and found endless columns of values representing monthly values.  “How do I get time intelligence to work with this?” is probably one of the first issues you’ve encountered and then, further down the line, when you’ve conjured up complex measures to resolve this problem, the words “why can’t I get this to run quicker” pass your lips.  On this note, everyone should pay close attention to Less Columns, More Rows = More Speed! where Rob proves that the tall and skinny table is much more effective than a short and wide version.

In this post, I’ll explain how SQL can solve this problem by transforming the data on the data extraction process using a SQL function call UNPIVOT and it’s available to everyone – not just developers!

This is PowerPivot using Pivot tables – why on earth do I want to UNPIVOT.  Just like a kid who ties their shoelaces wrong, Mum has to untie in order to do it properly!  The fact is that data transformation has to happen in many situations to leverage the full power that PowerPivot can provide.

What’s this SQL Stuff – Surely that’s for Developers?

I was reading Rob’s recent posts on Why PowerPivot is Better Fed From a Database and it made me realise that there’s a grey area that represents how far an Excel Pro can go in data transformation.  I’m going to leave the “Developer V Excel Pro” debate to Rob (but believe me, he’s right).  As an Accountant who learnt SQL but who is an Excel Pro through and through, my view is that Excel Pros should be able to transform their data to the extent that a SQL view can.

Some of you will be aware, but others won’t, that there is a SQL query editor in PowerPivot.  If I’m upsetting Developers by revealing this, I’m sorry, but, if you’re worried, please take this as an opportunity to provide your Excel Pros with a secure Dev environment!

Accessing The SQL Query Editor

You can access query editor either when you first import data or by editing a table that you’ve already imported.

Accessing Query Editor on First Import

  1. From the PowerPivot window, select From Database > From SQL Serverand set up your connection as you would do normally.
  2. Click Next and select the option Write a query that will specify the data to import

image

Accessing Query Editor From a Table Already Imported

  1. Select the table that you want to affect in the PowerPivot Window and select Design > Table Properties
  2. Change from Table Preview to Query Editor

image

image

And there you have it.  A script that is running to import the data that you’ve specified.  We can now edit this to transform the data.

Before I go ahead, I just want to set a benchmark regarding the data that I’m using.  When I’ve brought this data in, I’ve got 38,183 records using 14 columns.  We’ll see what the UNPIVOT does to this in the end.  Everyone’s expecting a lot more rows but fewer columns, right?

image

The SQL “UNPIVOT” SCRIPT

So, here it is, the script to use in Query Editor mode:

SELECT

– These are the fields that I want in my final dataset

NCODE,

NNAME,

VALUE_CATEGORY, – NEW FIELD 1 – The field that takes my value column names

VALUE – NEW FIELD 2 – The field that takes my values from those columns

FROM

– I specify where these fields come from

(

SELECT

NCODE,

NNAME,

NTURNOVER_C1, – Value Field

NTURNOVER_C2, – Value Field

NTURNOVER_C3, — Value Field

NTURNOVER_C4, – Value Field

NTURNOVER_C5, – Value Field

NTURNOVER_C6, – Value Field

NTURNOVER_C7, – Value Field

NTURNOVER_C8, – Value Field

NTURNOVER_C9, – Value Field

NTURNOVER_C10, – Value Field

NTURNOVER_C11, — Value Field

NTURNOVER_C12 – Value Field

 

FROM dbo.AZZ_V_POWERPIVOTPRO_TB –Substitute this with the name of your source table or view

) PIVOT_TABLE

– Unpivot the table               

UNPIVOT

(VALUE FOR VALUE_CATEGORY in        – These are the names given to the

       — value field (NEW FIELD 1) and value column name (NEW FIELD 2)

(

– Enter value fields as per the select above

NTURNOVER_C1,

NTURNOVER_C2,

NTURNOVER_C3,

NTURNOVER_C4,

NTURNOVER_C5,

NTURNOVER_C6,

NTURNOVER_C7,

NTURNOVER_C8,

NTURNOVER_C9,

NTURNOVER_C10,

NTURNOVER_C11,

NTURNOVER_C12

)

) AS UNPVT

WHERE VALUE <> 0

ORDER BY NCODE, VALUE_CATEGORY

Note – green text within the script are comments to provide some direction as you use this script.  You can copy these into query editor too – PowerPivot will ignore them.

How to Use This Script

  1. Copy and paste the script into your query editor
  2. In the first SELECT, enter the field names that you require in your final dataset.  These should correlate to fields in the table except where I have noted “NEW FIELD 1” and “NEW FIELD 2”.  These “NEW FIELDS” will carry the value elements that we are unpivoting.
  3. In the second SELECT, detail the fields that you have already detailed in the first select together with the column names carrying the values that you will unpivot.  Also change the table or view name in the FROM clause to reference the source data that you are using.
  4. In the UNPIVOT section, detail all of the value fields again that you are unpivoting.
  5. In the ORDER BY clause, select the fields that you wish to use for dataset ordering.

When you save this, your dataset will have been narrowed and you’ll have a new field denoting the data type (or value column reference).

image

You’ll notice that we’re down from 38,183 records to 7,103.  This won’t always happen.  The reality is that we’re able to strip out the zero value cells by using the WHERE VALUE <> 0 clause.  If you’re carrying a number of cells that are zero valued, we won’t see that item as a record in this new dataset.  As a result, we get fewer records in this case.  In any event, your data is now a lot more streamlined and ready for more efficient measures to be written.  You don’t have to include the WHERE clause if you prefer, for whatever reason, to bring all records in.  If that’s the case, just delete that clause.

One Further Step

To make this truly useful, we need to translate VALUE_CATEGORY into something more meaningful.  I would normally expect this to be a date but most trial balances use column names such as “CURRENT1” so that the system isn’t date dependent.  Another table would be required that the VALUE_CATEGORY column can link through to which translates that value, thereby opening up the whole time intelligence functionality.

A Quick Warning

This UNPIVOT approach has given you the power to transform your data into a more streamlined dataset.  However, by using this approach you’ve created a local version of the script.  You won’t be able to reuse this without copying and pasting it into other workbooks (or at least I haven’t found a way).  However, those Developer friends of ours can give you a centralised version that can be controlled and reused more easily.  If you ask these guys nicely, they might create a view or stored proc to do this for you.  I’ve found with my team that salt based snacks, chocolate and the occasional Spag Bol for lunch does the trick and, if it doesn’t, I go ahead and do it anyway!


Why PowerPivot is Better Fed from a Database, Pt2

November 24, 2011

 
image

Think of a Database as a Bodybuilding Supplement
for Your PowerPivot Regimen

In part one I covered the following three benefits of pulling data into PowerPivot from a database, as opposed to using other sources such as Excel itself:

  1. Data-shaping – much easier in a database than in basically any other tool, this is the first benefit you will see.
  2. Auto-refresh – by removing manual data shaping from the system, you can then rely 100% on PowerPivot’s server-side automatic refresh to keep your reports up to date.  The first time in “Excel history” that Excel pros aren’t slaves to their own spreadsheets.
  3. More quality, less errors – the combination of PowerPivot’s “portable formulas” and performing your data shaping in a database greatly reduces the opportunity to make mistakes.

OK, time for benefits 4-7.

Benefit #4:  Complex Calc Columns

Let’s say you want to stamp your customers into quintiles – the top 20% in terms of dollars spent, the bottom 20%, etc.

And you don’t want that to be dynamically re-calculated every time you slice or filter a pivot – you just want it statically calculated according to total overall sales, so that you can then use the quintile as a slicer or row/column axis in your pivot.  You want to be able to offer a pivot report like this:

image

To do that, you need a calculated column in your Customers table:

image

But that’s a reasonably tricky calculated column, as it has to be calculated per-customer, against their matching rows in the Sales table, and then ranked against other all other customers.  If you do that in Excel, you’ve introduced another manual step which foils auto-refresh and in turn introduces more chances for error, etc. 

And doing it in PowerPivot is awkward, at least in v1.  I’m not saying you can’t do it in PowerPivot, because you can, and that DOES auto-refresh.  But it’s just so much easier in a database, if you have someone around who knows databases, and there are more complex examples than this one which get even harder in PowerPivot.

Benefit #5:  Centralized, Re-Useable Logic

This one is both its own distinct benefit as well as a multiplier for other benefits (such as data-shaping and complex calcs).  Anything you do in Excel is manual.  Anything you do in PowerPivot is “tied” to the workbook in which you did it.  But logic that you put in a database can be re-used across many different workbooks.

Which, of course, is very nice.  No need to re-write it every time.  No chance that it accidentally diverges from other models (a source of error).  And if you need to modify that logic in the future, you only have to do it in one place.

At this point it’s worth pointing out that a PowerPivot BI environment DOES share some characteristics with a traditional BI environment.  These db-centric benefits are a hallmark of traditional BI, and these are the things you want to preserve from the traditional approach.

Benefit #6:  “Magic” Sliding Windows and Parameters

This is one of my favorites.  Oftentimes, an organization operates according to a calendar, or even multiple calendars, that are not quite the same as the calendar hanging on your wall.  And that leads you to a place where your measures like “growth versus prior year” can get pretty complex.

I won’t go into great detail here because it would take awhile, but imagine a slicer driven from a table that is purely calculated in the database before input:

image

To get a sense of some similar techniques, which revolve around using “unconnected” slicer tables as inputs, see this post and this post.

Benefit #7:  Compression

It’s a fact:  imported columns compress much better in PowerPivot than calculated columns.  So if you have a choice between doing a calc column in your db (and then importing it) or importing a table and then adding that calc column via a DAX calc column in PowerPivot, DO IT IN THE DATABASE!

Better compression means smaller files, less memory consumption, and often faster response times to slicer clicks.  (Note that the bigger the table, the bigger the difference you will see).

Again, deserving of its own post in the future.


Delayed today, but post still coming

November 24, 2011

It’s Thanksgiving today in the US, and I’ve been taking it a bit easy this morning.  Plus my wife talked me into playing ice hockey with her last night and I did a lot of this last night:

image

Anyway, part two of “why PowerPivot is better fed from a database” is coming later today.  The blog is on an eleven-week streak of consistent Tuesday/Thursday posts and I’m not about to mess that up Smile


Why PowerPivot is Better Fed From a Database, Pt 1

November 22, 2011

An Excel Pro’s Two Year Journey in Database Land

In a post last week I mentioned that when you use a real database as a source for PowerPivot, a number of unexpected doors open up.  As a longtime Excel pro who has spent the past two years working closely with a database team, I can tell you that it’s been an eye opening experience.

So I have been advocating to Excel/PowerPivot pros for quite some time that they cozy up to their database teams (if they have them).  Of course, this works both ways, and I have also been advocating to Database/BI pros that they embrace and cooperate with Excel/PowerPivot pros.

Awareness, Flexibility, and Sometimes…  Outright Resistance

It’s an interesting ambassadorship.  On one hand, Excel pros are often accustomed to doing everything on their own, with no external help whatsoever, and so it’s mostly a matter of opening their eyes to benefits they’ve never imagined. 

But the other camp often doesn’t expect to learn anything new and relevant about the role of Excel, that unruly little brother of “real” BI.  In conversation, that camp splits into two subgroups rather quickly – one that is willing to consider new ideas and possibilities, and another that regards Excel as the devil.  Interactions with that second group can get ugly.  I also tend to fear for them a bit.

Today, I’m going to focus on the Excel crowd and explain why a db pro can make a huge difference in their lives.  While I have danced around the topic for two years on the blog, this will be my attempt to provide a definitive list of the benefits.

And you die-hard Excel haters out there, don’t worry, I’m coming back around to you soon Winking smile

Benefit #1:  Data Shaping is Easier in Databases.  MUCH Easier.

Hey, Excel is just a collection of individual cells when you come right down to it.  Cells go into formulas, and cells come out.  But turning a wide and short table into a tall skinny table (or vice versa) for instance can consume the better part of an Excel pro’s day in some cases.  That same operation might take a db pro 5 minutes or less.  And even better, the next time might take them no time at all…

Benefit #2:  Auto-Refresh!

Once the db pro has a script in place (or query or view or sproc or whatever it is that they deem best), it can now be run automatically in response to your refresh request from PowerPivot. 

If you are running PowerPivot on your desktop, hey, now it’s just one click (refresh) to pull in the latest, properly-shaped data.  But even better, if you have access to a PowerPivot-enabled SharePoint server, you can put your workbook up there and schedule it to refresh itself!

Even if the manual shaping that you do today only takes you a few minutes each time, it’s still worth it to outsource it to a db.  The SharePoint option means you don’t have to do anything each day to merely update the reports, and the difference between “small” and “nothing” is huge.  Like, “you can go on vacation without repercussions” huge.  Or “you don’t have to come in early each day” huge.

Benefit #3:  Quality

Did you know there are entire conferences devoted to the topic of spreadsheet errors?  Spreadsheet errors are a fact of life in traditional spreadsheets, but they are primarily due to specific problems that a good PowerPivot “ecosystem” eliminates.  One is the lack of convenient named reference – “what did D$14 refer to again?” crosses the inner monologues of Excel pros worldwide millions of times a day.  PowerPivot fixes that – everything is referenced by table/column/measure name.

But raw repetition is the real killer.  If you perform the same spreadsheet task every day for a year, pure statistics tells us you will make mistakes.  And if the task is tedious, you will make even more.

The lack of “portable formulas” in traditional spreadsheets is an underappreciated source of repetition.  Even a single iteration of a reporting task explodes into repetitious subtasks and provides lots of opportunity for error.  (PowerPivot fixes that one too).

So we are left with raw repetition – performing the same task every day – as our primary source of error.  Once you have PowerPivot, data shaping is the biggest source of repetition.  Outsourcing that shaping logic into a database, then, doesn’t just save time.  It prevents mistakes. 

Why?  Because the db logic is written only once, and it doesn’t change no matter how many times you run it.  Even if there were mistakes made in the db logic, you will catch them sooner or later (usually immediately), and once you fix them, they stay fixed.

Come back Thursday for part two, with items four through seven.


Comparing Access to PowerPivot

November 17, 2011

 
I don't think the artist had data apps in mind, but this pic rocks!

Here’s a question that comes up with increasing frequency:  “PowerPivot seems kinda similar to Access in many ways, what’s the difference?”

Why Does the Question Come Up?

Some of you are thinking “yeah, I can see why people would ask that.”  And others of you are thinking “WHAT???  That questions makes ZERO sense, they are NOTHING alike!”

You are both right.

The biggest reason why the question comes up, I think, is the longstanding symbiosis between Access and Excel – Access as data source, Excel as analysis tool.  Remember, Excel used to be limited to about 64,000 rows of data.  And many Excel pros learned to import large data sets into Access rather than Excel, manipulate and prep the data in Access, and THEN import from Access into Excel for PivotTables, charting, etc.

With Excel expanding from 64k to 1M rows in 2007, that tradition has already begun to fade, to an extent.  But even for data sets that fit into Excel, there is still a good reason that drives Excel pros into Access:  VLOOKUP is slow.

VLOOKUP = Excel Acting Like a Database, and Driving Folks to Access

Even for large data sets, arithmetic calculations in Excel can be blindingly fast.  After all, Excel is designed for that.  But VLOOKUP, and its more advanced cousin INDEX/MATCH, is not arithmetic.  It’s a search – “go find me a value that looks like X, and when you find it, return value Y from the same row.”

Even when you’re dealing with row counts merely in the thousands, that can get slow in Excel.  Because “search and retrieve” is what databases are designed for.  And Excel is not a database.  When it comes to finding values, Excel isn’t terribly more efficient than Word.  (A risky thing for me to say, I expect to be corrected in three…  two…  )

But since many data sets inherently “arrive” as multiple separate tables, you can’t avoid trying to splice them together, and that means VLOOKUP, or using a real database product.  I know the SQL snobs will say that Access doesn’t qualify, but Access IS a real db.

An Understandable Question

With that in mind, it’s easy to see why longtime Excel pros see their first PowerPivot demo, and come away asking this question.  At a high level, this is what they see:

Access Versus PowerPivot

OK, so what’s the answer?

PowerPivot Does Things That Access Will Never Do

You had me at DAX MeasuresThe first time you write a “Sort By Slicers” or an “Iffer-Blanker” or  a Set of Greater/Less Than Slicers or a Custom Calendar Running Total or even just worked with ALL(), you know this isn’t Access.  Oh, and you are doing that in Excel pivots, not in a separate window.

There are other benefits of course, for sure, but DAX Measures are a gamechanger.  Let’s leave it at that for now.

And Yes, You Can Replace “Access as Data Source” With PowerPivot*

The two biggest reasons that drove Excel pros into Access in the past are in fact alleviated with PowerPivot.  No 64K row limit in PowerPivot.  No 1M row limit either.  I commonly demo a 300M PowerPivot row workbook on my laptop!  You can load a lot more data into PowerPivot than you can into Access.

And VLOOKUP isn’t something you even need anymore in PowerPivot.  Got multiple tables?  Fine!  Leave them as separate tables, link them via relationships, and you are done.  That’s not even just a convenience – leaving them as separate tables is actually even just better, for many reasons.

All of that “go find me a match in another table” stuff is taken care of by PowerPivot.  Lightning fast in fact.

But whoa there, I put an asterisk on that statement above.  For good reason.

PowerPivot Goes Better with Databases!

image

Database –> PowerPivot –> Excel = Happy Happy

Fact is, if you work in an organization that uses SQL Server or another industrial strength database product, you are better off connecting PowerPivot to that db.

Or more accurately, you will get even more out of PowerPivot if you have the cooperation of a database professional.

Why is that?  I can (and will) write many posts on that.  For now let’s keep it simple and just point out two reasons:  1) Databases are inherently a very good place to do data “shaping,” which is not something you can do at all in PowerPivot.   and 2) Databases are great places to perform complex row-wise and cross-row business calcs.  They centralize those calcs for re-use, often take the db pro 5 mins to do versus much longer for you, and result in faster and more compact workbooks than if you use calc columns.

I’m not saying you need a db pro to get amazing things out of PowerPivot.  But there’s another level even beyond amazing, and it opens up when you cooperate with a db pro.

Final Note:  Be Thankful PowerPivot Wasn’t Built By Office

I say this because it would have been questioned to death.  It’s hard to imagine, but as a product like PowerPivot is taking shape at Microsoft, no one is really sure how to describe it yet, or even what it’s going to turn out to truly be.  A hundred people in the Office org would have had the same question – are we cannibalizing the Access business, and there would have been as much time spent answering that as designing the actual product.

Now, in hindsight, no one in Office is worried about that.  Access always had a much bigger mission than carrying around data for Excel.  And Access’s current mission has evolved quite a bit from what it was even a few years ago.

But those inevitable nagging questions early on would have saddled PowerPivot with a number of “thou shalt not cross this line” concessions.  Concessions which ultimately were not needed, and that would have hurt the product.

Gives new meaning to the term “Office Politics.”


The Ultimate Date Table

November 15, 2011

 
image

“Looks like it’s time for me to get myself a date.”

-Ace Ventura, PowerPivot Detective

The Importance of a Date/Calendar Table

I get a lot of questions from people who are struggling with the time intelligence functions in DAX.  And nine times out of ten, the answer is that they don’t have a proper date table.

I know it’s tempting.  You’ve got your sales table, and hey, there’s a Date column in there!  So you use it, and pass that column as a parameter to, say, DATESBETWEEN, or DATEADD.

Sometimes that will give you an error.  And other times, it won’t…  but the results will be funky.

You need a separate Dates table, or perhaps you prefer to call it a Calendar table.  A separate table, whose only purpose is to store dates (and the properties of dates, like DayOfWeek, etc.)  And it contains consecutive dates – no “gaps.”  Even if your business is never open on weekends, you need unbroken ranges of dates.

Oh, and then you need to relate it to your Sales table.  (Or whatever fact/measure tables you have).

Much More Than a Single Column

A single-column table that contains merely dates is enough to make the time intelligence DAX functions operate smoothly.  But you will almost certainly want other fields too.  Like Year.  MonthName.  DayOfWeek.  The list goes on.

Maybe something like this:

image

And yes, you can cobble this together on your own in Excel.  Tedious work though.

Would You Like One for Free?  Try DateStream from Boyan Penev!

Imagine just being able to open up PowerPivot and always having three nice date tables awaiting import:

image

That’s what Boyan Penev has put together for you.  Three great calendar tables that you can download directly into PowerPivot, for free.

He published them to Azure DataMarket, a service from Microsoft where data providers can actually sell you their data sets – things like weather, demographics, etc.

Boyan did this for free though – I suspect half as a service to the community, and half as a project to learn how to provide a service on DataMarket.

It’s pretty damn cool, and really, the story should end there.  If you’ve used DataMarket before, then it DOES end there.  Go get the date tables and try them out. 

But if this is your first exposure to DataMarket, it takes a few minutes to get it set up.  It’s not bad as long as you don’t make the mistakes I did.

How To Get It – Short Version

Hey, it’s on Azure DataMarket.  The URL is in the next section below, or you can just go to Azure DataMarket and search on “DateStream.”

DataMarket is going to be a wonderful service someday, but right now it has a few warts, so there is a Long Version too.

How To Get It – Long Version with Occasional Snarky Commentary

Step 1:  Go to the DateStream page on DataMarket.

image

Step 2:  Get confused.  OK, now is where things get choppy, because frankly, the DataMarket site itself has a terrible user interface.  I sent a full page of feedback to the DataMarket team about a month ago and as far as I can tell, they ignored it.  (Which is pure karma – I used to be one of the people at MS who ignored 90% of the feedback coming in, and now I get to be the one who is ignored).

I don’t want this to be a tutorial on how to navigate their website, or even how NOT to design a website.  So let’s just hit the highlights and try to get to Boyan’s date tables as soon as we can.

Step 3:  Get an account.  OK, this isn’t bad.  Another MS site that requires a Live ID.  Most of us have three of those by now.

Step 4a:  Scan the DateStream page looking for the “Download to PowerPivot” button or link.

Yeah that’s right.  There is no such link – you can get to one by navigating a few levels deeper but I’m going to skip that.  Don’t despair though, good things await you!

Step 4:  Find the URL of the DateStream Service

This is NOT the same as the URL of the DateStream page.  But it IS displayed on the page.  Here’s the URL you need:

image

And here again in text:  https://api.datamarket.azure.com/BoyanPenev/DateStream/

OK, copy that.  You will need it.

Step 5:  Launch PowerPivot, Go Into the PowerPivot Window

And click this button:

 

image

If you don’t have that button, you need a newer version of PowerPivot.  Go get that from PowerPivot.com and resume the next step.

Step 6:  Fill in the Dataset URL From Step 5

image

Step 7:  Account Key

See that last text box in the picture above?  The one with the long code in it that I’ve partly blurred?  That’s my account key.  I highly recommend clicking that Find button.  It’s actually pretty damn useful.

Be careful – the DataMarket site has TWO long nasty codes like that for you.  One of them is the one you want, and the Find button takes you to that one:

image

THIS is Your Account Key

Do NOT, under any circumstances, do what I did, and confuse Account Key with Customer ID:

 

image

This is NOT Your Account Key.
Do NOT Be Tempted to Use This!

Step 8:  Click Next, and Pick Your Table or Tables

image

NOW we are on familiar ground.

Last Note:  Parameterization?

One thing I have not yet figured out is how to limit the date range I import.  The table starts in the year 1900, which goes back a bit far for my needs, and makes the dataset take a long time to download.

You’ll notice that when importing from DataMarket, the Preview and Filter UI lacks the filter dropdown buttons:

image

No Filter Dropdowns, Just Checkboxes

But the DateStream homepage DOES indicated that parameterization is possible:

image

So if you’ve got that figured out, drop me a note Smile


Six Observations from the 2011 PASS Keynote

November 10, 2011

The PASS Summit tends to be one of  Microsoft’s favorite venues for unveiling big news in the BI space.  As you may recall, the 2010 Summit revealed some amazing things for the PowerPivot world.

Yes, I know that the 2011 Summit was weeks ago, and I’m overdue on my observations.  And no, I did not attend in person this year.  But the keynote tends to be the vehicle for the big news, and it was available via streaming.  So I watched it later the same day.

It’s a couple hours long, vast stretches of it are dry wooden rhetoric, you can’t really fast forward it, and I don’t recommend watching the whole thing even though the highlights were worth it.  I’ll share those here to the best of my ability.

Point 1:  Denali Release Date “First Half of 2012”

OK, this means we will get the final production version of PowerPivot v2, the new Tabular BISM, and Crescent in first half of 2012.  I was kinda expecting them to say first quarter of 2012, so I was a little surprised.  I guess this means there is still time to get real feedback submitted Smile

Points 2-4:  Cloud and Big Data

A very distinctly “cutting edge” feel to this year’s keynote.  And honestly, there appears to be substance to it, not merely hype.  I would say that the SQL team is one of the most nimble orgs at Microsoft, and one of the most responsive to changing customer needs.

2) “The cloud world is a hybrid of your data center and the cloud”
      -
SQL VP Ted Kummert

image

This was a very deliberate and prominent statement.  It’s very interesting (and encouraging) that they said this – a sharp contrast to the MS reputation of “our offerings are the only things in the universe and are only designed to work with themselves.”  (BTW – that reputation, while deserved, derives from the academic mindset of MS employees rather than from arrogance, but at times that’s a fine line).

The meaning here is that we will be able to opt in “a la carte” rather than being forced to convert completely to Azure in order to make use of service X.  I like that a lot, because I expect some services to mature faster than others.

OK, maybe that’s not a big deal.  That’s just good business strategy and perhaps obvious.  But there’s a big difference between them stating this as a prominent theme (as they did) versus mentioning it as a detail, or merely bringing it up in Q&A (which is often the case).  They were NOT saying this last year.  So I call this a very positive development.

3) “Reporting Services Will Be Available in Azure Sometime Next Year”
      
(…and then nothing was said about Analysis Services)

I forget who said this – it was either Ted or Amir, or maybe both.

The real information for me here was what was NOT said.  They said nothing about Analysis Services (SSAS), and the omission simply cannot be an oversight.  It was too obvious, the void in the next sentence was tangible.

That means they either already know that it will be 2013, or they are trying for 2012 but aren’t sure enough yet to promise it.  Either way, we can safely assume we won’t see SSAS Azure until late 2012 at the earliest.

Since PowerPivot is built on Analysis Services, that also means we won’t see any PowerPivot in the cloud until late 2012 at the earliest.  Furthermore, Office 365 won’t support PowerPivot until late 2012, or probably 2013.  That’s not a fact, but it’s a very safe guess.

4) Hadoop Support in PowerPivot!

PowerPivot and Hadoop:  Sounds Like Chocolate and Peanut ButterDo you use Hadoop?  I don’t either, at least not yet, but a number of our clients at Pivotstream do.  So my ears definitely perked up when they said that we will soon have an ODBC driver that connects directly to Hadoop sources.  And as a bonus, our boy Denny Lee got some stage time giving the demo.

Seems like a natural fit – PowerPivot’s ability to crunch large volumes of data coming together with the world’s most popular system for collecting massive amounts of web data.  And again, a departure from the MS norm.  I would typically expect MS to hastily invent a Hadoop competitor and rush it to market, then take five years to make it a credible competitor.  Maybe that’s still a long term goal, but to embrace something with open source and Google roots like this so prominently is again a very novel and mature move that we should salute.

I’m actually getting a more in-depth demo and update today, so I hope to report back with more detail soon.

5) Introducing Data Explorer!

image

How often do we get something 100% brand new?  Data Explorer allows you to take basically any collection of data sources – like an Excel file on my desktop, a sales data set in SQL Azure, and a demographics data source on DataMarket – and mash them together into a single table.

Even better, it then allows me to publish the resulting data set, in Azure, so that others can consume it.

I have a LOT of questions about this new offering, but very little time to explore it.  I have asked a member of the Data Explorer product team if I can interview them on the blog.  If that doesn’t work out, maybe one of you out there would like to investigate it and submit a review to the blog.

Point 6:  Crescent is now named Power View

image

Just like PowerPivot was known as Gemini until late 2009, we knew Crescent would eventually get a real name.  And that real name is Power View.  Yes, the space is official.

Point 6a:  Live interactivity in PowerPoint (yes, the slides app) is going to be included in the Denali release after seeming like it was going to get cut.  Pretty cool.

Point 6b:  Purely my opinion, but Power View seems aimed at putting a more glamorous face on traditional BI scenarios – it’s a very “field list oriented” tool which in my experience means that only “data people” will take to it initially.

But I also DO believe that as Excel pros get more and more comfortable with publishing PowerPivot models to SharePoint, they will start opportunistically exploring what Power View can do for them, since Power View can be connected directly to a PowerPivot model and used as an alternative front end (or complement to) Excel Services.


Trended Moving Averages

November 8, 2011

Guest post by David Churchward

image

I’ve always been a firm believer that moving averages probably give a better insight into trends within a business than a simple trend line associated to a set of values such as monthly sales (although I tend to review these two values together).  The reason for this is that a trend can be skewed by one or two values that may not be representative of the underlying business such as spikes associated to seasonality or a specific event.  When BillD highlighted a query regarding this concept in his comments on Profit & Loss (Part 2) – Compare and Analyse, I thought it would be a great idea to flex our P&L dataset to provide some Moving Average capability.

In this post, I will explain what moving averages are intended to deliver and explain how to calculate them using the sales elements of the example data used in the Profit & Loss series of posts.  I will then add the flexibility for users to select the time frame that the moving average calculation should consider, the number of trend periods to be displayed and the end date of the report.

What is a Moving Average?

The most common moving average measure is generally referred to as a 12 month moving average.  In the case of our sales data, for any given period, this measure would sum the last 12 months of sales preceding and including the month being analysed and then divide by 12 to show an average sales value for that timeframe.  In financial terms, the equation is therefore quite simply:

12 Month Moving Average = Sum of Sales for Last 12 Months / 12

This all seems very straight forward but there’s a lot of complexity involved if we want to put the Moving Average timeframe (represented as 12 in the above example) in the hands of the user, give them the power to select the number of trend periods to be displayed and the month that the report should display up to.

The Dataset

The dataset that we’re using looks something like below.

image

Note – I’m using PowerPivot V1.  Design viewer is available in V2 but I’ve hashed this together – nothing clever!

You’ll notice that FACT_Tran (our dataset to be analysed) is linked to DIM_Heading1, DIM_Heading2 and DIM_DataType to provide some categorisation to our dataset.  I’ve also linked to Dates which is a sequential set of dates that more than covers the timespan of our dataset.  This table carries some static additional information based on the date:

Date_Month_End = EOMONTH(Dates[Date],0)

Date_Next_Month_Start = Dates[Date_Month_End]+1

Once again, we’re not quite registering on Rob’s spicy scale!  Rest assured that you’ll be getting a more intense DAX workout as we go on.

As these date measures aren’t expected to be dynamic, I’ve coded them in the PowerPivot window.  This allows them to be calculated on file refresh but they won’t need to recalculate for each slicer operation which removes performance overhead from our ultimate dynamic measure.

For reasons that I’ll come on to later, I also need the month end date on my fact table as I can’t use the Month End Date on my Dates table in my measures.  I can however pull the same value across to my FACT_Tran table using the following measure:

Fact_Month_End_Date = RELATED(Dates[Date_Month_End])

So What Are These Unlinked MA_ Tables?

The reason for these tables should become apparent as we go on.  In brief, they’re going to be used as parameters or headings on our report.  The reason that they exist and that they’re not linked to the rest of our data is simply because I don’t want them to be filtered by our measures.  Instead, I want them to drive the filtering.

Initial PivotTable Setup

I’m going to be displaying a series of data organised in monthly columns.  The user will be given slicers to set Month End Date (the last period to be shown on the report), Number of Periods for Moving Average (this will ultimately be part of our divisor calculation) and Number of Periods for Trend (this will be the number of monthly columns that we will display on our trend).  We can establish these slicers straight away and link them to the pivot.

I obviously need a month end date as a column heading but which one?  To some extent I’ve given this away earlier on.  In short, I need to use my MA_Dates[Month_End_Date] field.  The reason is that this field isn’t linked to our dataset and therefore won’t be affected by any other filters.  If I use a date field that is part of my dataset or part of a linked table, the values available may be filtered down by the users selections.  I can get around this using an ALL() expression to give me the correct values, but the problem is that the column is still filtered and my results will all be displayed in one column.  It’s difficult to explain until you see it so please go ahead and try – it’s worth hitting the brick wall to really understand it!

Calculating Sum of Sales for Last X Months

The first part of our equation is to calculate the total value for sales across all periods within a dynamic timeframe to be selected by the user.  For this I use a Calculate function that looks like this:

CALCULATE(

[Cascade_Value_All],

DIM_Heading1[Heading1_Name]=”Sales”,

DIM_DataType[Data_Type_Name]=”Actual”,

DATESBETWEEN(

Dates[Date],

DATEADD(

LASTDATE(VALUES(MA_Dates[Next_Month_Start_Date])),

MAX(MA_Function_Periods[Moving_Average_No_Periods])*-1,MONTH

   ),

LASTDATE(VALUES(MA_Dates[Month_End_Date]))

         )

  )

I’m using a base measure called Cascade_Value_All that was created in Profit & Loss – The Art of the Cascading Subtotal.  I’m then filtering that measure to limit my dataset to records that relate to Sales and a data type of Actual (ie eliminating Budget).  This is simple filtering of a CALCULATE function.  However, it gets a bit more tasty with the third filter which limits the dataset to a series of dates that are dependent on the users selections in slicers and our date column heading.

The DATESBETWEEN function has the syntax DATESBETWEEN(dates, start_date, end_date) and works like this:

  1. I set the field that requires filtering (Dates[Data]).  I’ve found that this works best if this is a linked table of sequential dates without any breaks.  If you have any breaks, there’s a chance you might not get an answer as the answer that you evaluate to has to be available in the table.
  2. My start date is a DATEADD function that calculates the column heading date less the number of months that the user has selected on the “Moving Average No of Periods” slicer.  I use the LASTDATE(VALUES(MA_Dates[Next_Month_Start_Date)) function to retrieve the Next_Month_Start_Date value from the MA_Dates table that relates to the date represented on the column heading.  I then rewind by the number of months selected on the slicer using MAX(MA_Function_Periods[Moving_Average_No_Periods])*-1.  The “-1” is used to go back in time.  The reason I use Next_Month_Start_Date and a multiple of –1 is more clearly explained in Slicers For Selecting Last “X” Periods.
  3. My end date is simply the Month_End_Date as shown on the column heading of the report.  This is calculated using LASTDATE(VALUES(MA_Dates[Month_End_Date]).

image

That’s great, but my measure isn’t taking any account of my “Show Periods Up To” selection and the “Trend No of Periods” that I’ve selected.  We therefore need to limit the measure to only execute when certain parameters hold as true based on these selections.  I only want values to be displayed when my column heading date is:

  1. Less than or equal to the selected Month End Date on my “Show Periods Up To” slicer AND
  2. Greater than or equal to the selected Month End Date LESS the selected number of periods on my “Trend No of Periods” slicer.

To do this, I use an IF statement to determine when my CALCULATE function should execute.  Let’s call this measure Sales_Moving_Average_Total_Value

Sales_Moving_Average_Total_Value

= IF(COUNTROWS(VALUES(MA_Dates[Month_End_Date]))=1,

    IF(VALUES(MA_Dates[Month_End_Date])<=

LASTDATE(Dates[Date_Month_End])

&&VALUES(MA_Dates[Month_End_Date])>=

DATEADD(

LASTDATE(Dates[Date_Next_Month_Start]),

(MAX(MA_Trend_Periods[Trend_Periods])*-1),MONTH),

CALCULATE(

[Cascade_Value_All],

DIM_Heading1[Heading1_Name]=”Sales”,

DIM_DataType[Data_Type_Name]=”Actual”,

DATESBETWEEN(

Dates[Date],

DATEADD(

LASTDATE(MA_Dates[Next_Month_Start_Date]),

MAX(MA_Function_Periods[Moving_Average_No_Periods])*-1,MONTH

   ),

LASTDATE(VALUES(MA_Dates[Month_End_Date]))

         )

  )

)

  )

The IF statement works as follows:

  1. I first need to determine that I’m evaluating only where I have one value for MA_Date[Month_End_Date].  If I don’t do this, I get that old favourite error in my subsequent evaluation that says that a table of multiple values was supplied……
  2. I then evaluate to determine if my column heading date (VALUES(MA_Dates[Month_End_Date]) is less than or equal to the date selected on the Month End Period slicer (LASTDATE(dates[Date_Month_End])…AND (&&)
  3. My column heading date is greater than or equal to a calculated date which is X periods prior to the selected “Show Periods Up To” as selected on the Slicer.  I use a DATEADD function for this similar to that used in my CALCULATE function except we’re adjusting the date by the value selected on the “Trend No of Periods” slicer.

With this in place, we have the total sales for the selected period relating to the users selections.

image

So my table is now limited to the number of trend periods selected and represents the month end date selected.

So Now We Just Divide By “Moving Average No of Periods” Right? eh NO!

We’ve calculated our total sales for the period relating to the users selections.  You would be forgiven for suggesting that we simply divide by the number of moving average periods selected.  Depending on your data, you could do this but the problem is that the dataset may not hold the selected number of periods, especially if the user can select a month end date that goes back in time.  As a result, we need to work out how may periods are present in our Sales_Moving_Average_Total_Value measure.

Sales_Moving_Average_Periods

= IF(COUNTROWS(VALUES(MA_Dates[Month_End_Date]))=1,

    IF(VALUES(MA_Dates[Month_End_Date])<=

LASTDATE(Dates[Date_Month_End])

&&VALUES(MA_Dates[Month_End_Date])>=

DATEADD(

LASTDATE(Dates[Date_Next_Month_Start]),

(MAX(MA_Trend_Periods[Trend_Periods])*-1),MONTH),

CALCULATE(

COUNTROWS(DISTINCT(FACT_Tran[Fact_Month_End_Date])),

DIM_Heading1[Heading1_Name]=”Sales”,

DIM_DataType[Data_Type_Name]=”Actual”,

DATESBETWEEN(

Dates[Date],

DATEADD(LASTDATE(MA_Dates[Next_Month_Start_Date]),

MAX(MA_Function_Periods[Moving_Average_No_Periods])*-1,MONTH),

LASTDATE(VALUES(MA_Dates[Month_End_Date]))

         )

   )

)

  )

This measure is essentially the same as my Sales_Moving_Average_Total measure.  The only real difference is that we count the distinct date values in our dataset as opposed to calling the Cascade_Value_All measure.  I mentioned earlier that there was a reason why I needed the month end date to be held on my FACT_Tran table and this is why.  If I use any other table holding the month end date, that table isn’t going to have been filtered in the way that the core dataset has been filtered.  As an example, my Dates table has a series of dates that spans my dataset timeframe and more.  As a result, evaluating against this table will deduce that the table does in fact have dates that precede my dataset and there is therefore no evaluation as to whether there is a transaction held in the dataset for that date.

image

As you can see, since my dataset runs from 1st July 2009, I only have 9 periods of data to evaluate for my 31/03/2010 column.  If I had divided by 12 (as per my “Moving Average No of Periods” slicer selection), I would have got a very wrong answer.  Obviously, this is slightly contrived but it’s worthy of consideration.

And Now The Simple Bit

I can understand that the last two measures have taken some absorbing, especially working out when particular date fields should be used.  For some light relief, the next measure won’t really tax you!

Sales_Moving_Average_Value =

IFERROR(

[Sales_Moving_Average_Total_Value]/[Sales_Moving_Average_Periods],

BLANK()

    )

This is a simple division with a bit of error checking to avoid any nasties.

image

When It’s All Put Together

Since all of these measure are portable, I can create another Pivot Table on the same basis as the one above (with Sales_Moving_Average_Value given an alias of Moving Average), move some stuff around, add a measure for the actual sales value for the month (I won’t go into that now, but it’s a simple CALCULATE measure with some time intelligence) and I then reconfigure to look like the following:

image

I can then drive a simple line chart and apply a trend line to my “Actual” measure with the chart conveniently hiding my data grid that drives it.

image

As you can see, a trend on my Actual measure shows a steady decline.  My Moving Average, however, shows a relatively stable, if not slightly improving trend.  Seasonality of some other spikes are obviously therefore involved and the reality is that both measures probably need to be reviewed side by side.

For those of you reading this who are interested in seeing the workbook of this example, I’ll look to post this in a future post when I take this analysis one step further to cover the whole P&L.  Sorry to make you wait.

I hope this helps you out BillD…

One More Point to Note

Those eagle eyed DAX pros out there have probably noticed that my IF functions only contain a calculation to evaluate when the logical test reaches a True answer.  The reason is that the function assumes BLANK() when a false evaluation condition isn’t provided.  I haven’t worked out if there’s any performance impact using this method on large datasets.  It’s up to you what you chose to do and if anyone can convince me why coding the False condition as BLANK() is best practice, I will quickly change my habits!


Mini Post 4/4: Refreshing the FAQ, Recruiting New Contributors!

November 3, 2011

The Great PowerPivot FAQ still gets great traffic every day, but I recently realized that no one had added any new FAQ items for a year.

Dave Boylan asked me a question the other day, and that triggered it:  I have been neglecting it, and it could use an infusion of new content and contributors.

Submitting items to the FAQ is quite easy, quick, and straight to the point.  A great place to gain some experience with the community without committing to writing full blog posts.

Even the FAQ Page here on the blog is piling up unanswered questions in comments. Answering one or two of those and adding it to the FAQ is a great place to start.

And you always get public credit for everything you submit.  You also get to add yourself to the Contributors/Moderators list.  (I think there are some people in that list who never submitted anything, and I’m going to trim them out at some point, muhaha).

Dave Boylan and David Churchward both recently signed up.  Dave B then submitted this entry on the EARLIER function.  (And another on market basket analysis). 

If you are interested in becoming a FAQ contributor, drop me a note:  info@pivotstream.com


Mini-Post 3/4: DAX.xml Update for Notepad++ from Colin Banfield

November 3, 2011

(Rob’s note:  Apologies to Colin, he put this up here in draft form weeks ago and even though I promised to flip it to live two weeks back, I forgot.  So Colin…  a thousand pardons.  This is awesome!)

From Colin:

For those of you that use DAX.xml with Notepad++, there is an update available at http://powerpivotfaq.com/PowerPivot%20Samples/Forms/AllItems.aspx. There is also an update of the DAX functions file. Although I created the new files shortly after downloading the Denali version of PowerPivot, I held off making the files available because, at the time,  I didn’t know whether any additional functions would be added in the final release. I have it on good authority that no additional functionality will be added on top of CTP3, so it’s unlikely that the files will change between now and PowerPivot V2 RTM. Note that the list of functions contains those that are not currently documented, including the very important KEEPFILTERS function, which you can read about in this superb post from Alberto Ferrari. Perhaps this function should have been called KEEPUSERSIGNORANT (okay, just kidding Smile).

If you missed the Notepad++ post, and might be interested in using this excellent tool to author your DAX measures, you can read the original post here.