The Greatest Formula in the World, Part Two

December 7, 2011

image

Our Time Machine From Part 1 Needs a Couple of Fixes

OK, if you recall from part one, we had a “year over year” time machine calculation going on, and it was built against a data set that lacked a “real” calendar table.  Very cool.  Very resourceful.  And very necessary.

But the time machine had two problems.  Let’s fix them.

Problem #1:  Meaningless Grand Total

Let’s start with the easy one:

Year over year in PowerPivot - Meaningless Grand Total

Grand Total is Meaningless for Last Year Sales

It is meaningless to have a “grand total” value for a measure that returns “last year’s sales.”  What year would that BE, actually?  It’s nonsense.  So we use an old trick, one of the many flavors of IF(VALUES()):

=IF(COUNTROWS(VALUES(Periods[Year]))>1,BLANK(),
   CALCULATE([Total Sales],
      ALL(Periods),
      FILTER(ALL(Periods), Periods[Year]=MAX(Periods[Year])-1),
      VALUES(Periods[MerchPeriod])
   )
)

Where the new IF “wrapper” is highlighted and the original formula is in normal font.

For more on that “IF VALUES” technique, please see this post.

The results are as desired, the grand total is now blank:

image

Problem solved.  Moving on…

Problem #2:  2011 isn’t complete but we’re getting all of 2010 sales

Remember, our sales data for 2011 only goes through the first 6 months.  So we do NOT want “last year sales” for months 7-12:

image

That can be solved a number of ways.  The simplest is just to add another IF(), and have the Last Year Sales measure return BLANK() whenever there are no Sales:

=IF(COUNTROWS(VALUES(Periods[Year]))>1,BLANK(),
   IF([Total Sales] = BLANK(), BLANK(),
      CALCULATE([Total Sales],
         ALL(Periods),
         FILTER(ALL(Periods),
            Periods[Year]=MAX(Periods[Year])-1
         ),
         VALUES(Periods[MerchPeriod])
      )
  )
)

That trims the results to just the months desired:

image

But note that the subtotal for 2011 is still too high:

image

“Last Year Sales” for 2011 Should Only be Returning
2010 Sales Through the First Six Months

This one is trickier.  I struggled to find a good answer before coming up with the following:

Step 1:  Add a “Next Year Period Num” Column to the Periods Table

Recall that my Sales table has a “Period Num” column that is the basis for the relationship with the Periods table:

image

And that matches up with a similar column in the Periods table:

image

Note that Period Num does NOT reset to 1 with each new year (unlike MerchPeriod).  So that is the absolute unique ID for a given Month/Year combo.

Well, I created a new column in the Periods table that tells me, for a given Period Num, what the equivalent Period Num will be NEXT year:

image

Now, in my FILTER statements, I can choose to match on THAT column instead.

Bringing it Home

OK, now that I have that column, I can add a new FILTER clause to my measure:

=IF(COUNTROWS(VALUES(Periods[Year]))>1,BLANK(),
   CALCULATE([Total Sales],
      ALL(Periods),
      FILTER(ALL(Periods),
         Periods[Year]=MAX(Periods[Year])-1
      ),
      FILTER(ALL(Periods),
         Periods[NextYear Period]<=
            MAXX(
                VALUES(PeriodSales[Period Num]),
                PeriodSales[Period Num]
                )
      ),
       
      VALUES(Periods[MerchPeriod])
   )
)

OK, what does that do?  Simply put, it further filters the Periods table to NOT go beyond any periods for which we currently have records in the Sales table.

And the results:

Last Year Sales Measure in PowerPivot With a Custom Calendar

Last Year Sales Measure Doing the Right Thing

And now if you want YOY Growth, it’s straightforward:

=([Total Sales] – [LY Sales Finished]) / [Total Sales]

Which yields:

Year over Year / Year on Year Growth Percentage Measure in PowerPivot With a Custom Calendar

Year over Year / Year on Year Growth Percentage
Measure in PowerPivot With a Custom Calendar

OK, I cheated and wrapped another IF(COUNTROWS(VALUES)) around the outside of the formula so that the grand total cell is blank again, because again, that’s a meaningless value to report.  Details.

Isn’t that a lot of work?

Depends on how you look at it.  First of all, it’s a lot harder to grasp the first time than it is as you get used to it, trust me.

But even better, this investment pays off forever.  Normal Excel formulas may be easier to write the first time, but then you have to re-write and adjust them forever, every time your data changes.  And every time your desired report shape changes.

This one is a portable formula.  It goes wherever you want it to go, and eats whatever data you feed it.  Forever.  It’s worth a little extra effort.

OK but do I really have to understand all of that?

No, not really.  Not right away.  Think of it this way:  this is what I do all day, every day.  And I can tell you that figuring this out is not something you want to do in your spare time.

But once someone gives you the pattern?  Wash, rinse, repeat.  Make small adjustments as needed.

Honestly, that whole formula above could be wrapped up in a function provided by Microsoft, and all of the nasty stuff hidden.  I told them as much last time I visited Redmond.  And if it were a function rather than a formula, we wouldn’t care at all how it worked.

So… treat this as a pattern.  Copy/paste and modify to fit your needs.  Seriously.

Unfinished Business

I actually WILL retrace my steps here though and explain a number of things though, so that you CAN understand.  I have a couple of posts in the queue aimed at just that.

Were there alternate ways to write this measure?

Yes, there were, especially in that last step where I needed to get the “Last year sales” measure correct for the 2011 subtotal.  I could have used SUMX to make the year subtotal equal to the sum of its underlying months.  That is worthy of a post for sure.

But I was wondering if there was still another way.  A way that doesn’t use SUMX and doesn’t require a new calc column.

And for that, we need the Italians.  So I’m gonna try out my new toy, the Boot Signal:

image

Calling Marco and Alberto

Get it?  *Boot* Signal?


Off Topic: Halloween Humor

October 31, 2011

I received this Halloween card today, thought it was hilarious, and decided to scan it in:

image

Why exactly is that hilarious, you ask?  It’s a fair question.

Well, it’s funny precisely because that IS my father.  And yes, he owns that suit.  Not making this up.

It’s even funnier because my dad is actually quite a bit “cooler” than me.  Example: a fight broke out during a YMCA basketball game once (he and I were on the same team) and he waded into it to rescue one of our teammates, while I was getting the hell away.  Then about 10 guys piled on my dad…  and I decided to confuse the enemy by running away more.  Darth Vader can fend for himself.

Even today, when we go places together, women who would never even acknowledge me go all giggly around him.  He has that “cool guy” charisma.  And a full replica Darth Vader suit.  Yeah, even the expensive light saber:

image

Six feet, four inches of total goofball.

This does NOT count as Tuesday’s post, so stand by for something more meaningful.


Creating Accurate Percentile Measures in DAX – Part I

September 29, 2011

Guest Post By Colin Banfield

29th September, 2011

Equivalents of Excel’s Percentile, Quartile, and Median functions are perhaps the most significant omissions in Denali’s DAX statistical function library. Quartile and Median are actually special cases of percentile, and in this post, we calculate these special cases. Some of the most insightful BI analysis involves the use of statistics, and PowerPivot provides an excellent environment for developing statistical measures. Perhaps percentile and its cohorts will be considered for inclusion in PowerPivot V3.

Percentile measures are not difficult to create in DAX, but the process is also not trivial. Adding to the complexity is the need to interpolate values if you want to duplicate the accuracy of Excel’s percentile functions. My approach to creating percentile measures in PowerPivot is as follows:

  1. Calculate the ascending rank of values in the dataset under exploration.
  2. Calculate the rank corresponding to required percentile. I use the formula n=P(N-1)/100+1, where n is the rank for a given percentile P, and N is the number of rows in the dataset. This formula provides the rank corresponding to Excel’s PERCENTILE or PERCENTILE.INC function. For a calculation equivalent to Excel’s PERCENTILE.EXC, the rank for a given percentile can be found by using the formula n=P(N+1)/100. In most cases, the rank calculated will not be an integer. Therefore, for a generalized solution, we must use the integer rank below and above the calculated value and interpolate. For example, the first formula in (1) calculates a rank of 11.25 for P=25 (25th percentile) and N=42. We must therefore interpolate between the dataset ranks of 11 and 12.
  3. Find the data values corresponding to the integer ranks derived from step 2. For convenience, let’s call these values PPercentileDown and PPercentileUp.
  4. Calculate the percentile using linear interpolation.

Let’s start with a simple dataset to illustrate the concepts. In figure 1, the original dataset is in the table on the left. The corresponding PowerPivot PivotTable is on the right.

Figure 1- Original Dataset

Step 1 – Calculate the ascending rank of values in the dataset

Fortunately, the newly minted RANK function in Denali makes the calculation a lot easier than the rank formula we had to create in V1. The rank formula is simply:

=RANKX(ALL(Data),[Sum of Value],,1)

Figure 2 – Rank Measure

Step 2 – Calculate the rank corresponding to required percentile

We will first calculate the rank corresponding to the 25th percentile (inclusive) in the dataset. Recalling the formula n=P(N-1)/100+1, this translates into the following DAX formula:

=CALCULATE(

      (COUNTROWS(Data)-1)*25/100+1,


      ALL(Data)

          )

In this case, the calculated rank is 11.25.

Figure 3 – Rank of 25th percentile

Step 3 – Find the data values corresponding to the integer ranks derived from step 2

Our 25th percentile value is somewhere between the values corresponding to the 11th and 12th rank, as shown in figure 3. We need formulas to extract these values as measures. The 25thPercentileDown formula is:

=MAXX(

      FILTER(

             ALL(Data),

             [Rank] = ROUNDDOWN([25PctRank_INC],0)

            ),

      [Sum of Value]

     )

We are returning the Max of Sum of Value, filtered so that [Rank] is equal to the rank of the 25th percentile, rounded down to the nearest integer rank. In other words, we are filtering the table to return only values where [Rank] = 11. In the case of ties, there will be more than one value in the filtered table (all identical). This is the reason for using MAXX. However, MINX, or even AVERAGEX would calculate the same result.

For 25thPercentileUp, you may be inclined to use a similar formula, substituting the ROUNDDOWN function for ROUNDUP. However, in the event of ties for the 25th percentile, the filter will be incorrect. This is so because [Rank] will calculate the same rank for ties (11 in this case), and ROUNDUP([25PctRank_INC],0)will be = 12. We can get around the problem by filtering the table using the TOPN function instead:

=MAXX(

      TOPN(

           ROUNDUP([25PctRank_INC],0),

           ALL(Data),

           [Sum of Value],

     1

          ),

     [Sum of Value]

    )

The formula simply finds the max of the top 12 values in the dataset (i.e. 11.25 rounded up to 12). The result is not affected by ties in the rank. Figure 4 shows the resultant measures.

Figure 4 – Rankup and Rankdown

Step 4 – Calculate the percentile using linear interpolation

Now that we have the values between which we need to interpolate to find the actual percentile, we use these values in a simple interpolation formula:

=[25thPercentileDown]+([25thPercentileUp]-[25thPercentileDown])*([25PctRank_INC]-ROUNDDOWN([25PctRank_INC],0))

= 1891.07 + (2073.29-1891.07)*(11.25-11) = 1891.07 + 182.22*0.25 = 1936.63

In figure 5, I’ve completed the calculations for the 50th percentile and 75th percentile and removed intermediate formulas from the PivotTable.

Figure 5 – Percentiles

Figure 5 illustrates that that values we calculated are identical to values calculated by using Excel’s PERCENTILE.INC function. Note that:

  • The 25th percentile is also the first quartile
  • The 50th percentile is the median (i.e. the same value returned by Excel’s MEDIAN function)
  • The 75th percentile is also the third quartile

Alternative median calculation

If we need to calculate the median without regard for percentiles, we could use the following formula:

=IF(

    MOD(

        COUNTROWS(ALL(Data)),

        2

  )=0,

    AVERAGEX(

             TOPN(

                  2,

                  TOPN(

                       (COUNTROWS(ALL(Data))/2)+1,

                       ALL(Data),

                       [Sum of Value],

                       1

                      ),

                  [Sum of Value]

                 ),

             [Sum of Value]

            ),

   MAXX(

        TOPN(

             COUNTROWS(ALL(Data))/2,

             ALL(Data),

             [Sum of Value],

            ),

        [Sum of Value]

       )

   )

Essentially, we want to return the value in the middle of the range, after sorting the range in ascending order. The calculation we use depends on whether there are an even or odd number of values in our dataset. The first part of the formula, MOD(COUNTROWS(ALL(Data)),2)=0, determines if the dataset has an even number of values. I would have preferred to use ISEVEN (and thus eliminate an extra function call), but alas, Excel’s ISEVEN and ISODD functions didn’t make it into DAX.

If the number of values is even, we need to get the top value in the first half of the dataset, plus the next higher value (because we have to average these two values to get the median). We accomplish this goal in the following part of the formula:

TOPN(

     (COUNTROWS(ALL(Data))/2)+1,

     ALL(Data),

     [Sum of Value],

     1

    )

Based on our sample dataset, the preceding formula will return a table with values where ID=1 to ID=22. Next, we get the top 2 values from this new table (ID=21 and ID=22). To accomplish this task, we wrap another TOPN function around the new table and sort in descending order:

TOPN(
     2,

     TOPN(

          (COUNTROWS(ALL(Data))/2)+1,

          ALL(Data),

          [Sum of Value],

          1

         ),

    [Sum of Value]

   ),

Finally, we average the two values:

AVERAGEX(

         TOPN(

              2,

              TOPN(

                   (COUNTROWS(ALL(Data))/2)+1,

                   ALL(Data),

                   [Sum of Value],

                   1

                  ),

            [Sum of Value]

           ),

        [Sum of Value]

       ),

If the number of values in the dataset is odd, we just return the middle number, using the formula in the final argument of IF:

MAXX(

     TOPN(

          COUNTROWS(ALL(Data))/2,

          ALL(Data),

          [Sum of Value],

         ),

    [Sum of Value]

   )

Note that TOPN rounds a number up or down arithmetically, if the first argument is a decimal value. Dividing an odd number by 2 will therefore always round the number to the next higher value.

Percentile INC vs. EXC

At the beginning of this post, I mentioned that to compute percentiles corresponding to Excel’s PERCENTILE.EXC, you use the formula n=P(N+1)/100 to get the rank for a given percentile. When, therefore, should you use one versus the other? The answer is similar to the use of STDEV.S/STDEV.P or VAR.S/VAR.P. PERCENTILE.EXC, which is a new function in Excel 2010, generally provides better results when we are using a sample from a larger dataset (where it’s unlikely that the sample would include the dataset’s zero and one-hundredth percentile values – and thus these percentiles are excluded). PERCENTILE.INC is better when we’re working with the entire dataset.

My rants of the day

Rant #1

I can’t help but end this post with a couple of rants. I’ll start with my biggest irritation in all of PowerPivot. WE CAN’T HIDE MEASURES. That’s bold, underlined, and highlighted. This situation is quite unfortunate. Imagine if you couldn’t prevent a VBA function from surfacing in the user interface (by declaring the function as private). Imagine if you couldn’t reuse routines in your development work, but had to rewrite them over and over. Imagine if you couldn’t validate complex expressions and improve performance by maintaining intermediate formulas. Without being able to hide measures from client tools, we are faced with all of the preceding issues. We can hide table columns, so why are measures any different? In Analysis Services, when you create a calculated measure, you have the option to have the measure hidden or displayed in client tools. Why does this logic make sense for calculated measures in AS, but not in PowerPivot?

Look at my field list for percentiles in figure 6 (some not mentioned in this post):

Figure 6 – PowerPivot Field List

Of all the measures shown, I want the client to see about half the number of measures in the list. The rest are intermediate results for development purposes only. To be honest, I don’t know which I find more baffling; that the development team can’t see the absolute necessity of this feature, or that I seem to be the only being on the planet that has a problem with this issue. Grrr. Unlike some very brilliant folks, I am unable to construct a half-page worth of a complex DAX formula, without the need to validate intermediate results.

On this matter, I think that I’m in good company. Flipping through the second edition of “The Microsoft Data Warehouse Toolkit,” written by authors from the highly regarded Kimball Group, I came across the following statement:

The best way to create calculations and measures in PowerPivot is to break long, complex formulas up into their component parts and build them incrementally. This allows you to experiment with the data and quickly see the results of your formulas.

Great advice, but incredibly, they don’t address how you are supposed to keep all of those intermediate results from the client’s view. The only solution currently available, and it’s a poor one from a performance and maintenance standpoint, is to merge the intermediate formulas into one mega-formula. Using perspectives don’t help either, since you can’t hide the original list of measures.

Rant #2

I’m glad to see that Denali includes TOPN, as this is a very useful function. But where is TOPNPERCENT? Can you name a product where these two functions don’t appear together, like Gemini or Siamese twins? Excel conditional formatting? Both there. Microsoft Access? Both there. T-SQL? Both there. MDX? Ditto. I find this level of inconsistency irritating. You can create a TOPN percent measure by using a formula like TOPN(COUNTROWS(Data)*N%), which is understandably trivial. Well, you can create a distinct count measure by using COUNT(DISTINCT), which is also trivial, but we now have DISTINCTCOUNT because creating distinct count measures are so common (and because the BI Pros all requested it J).

On the positive side, there is the “issue” of TOPN returning more than N records if there are ties in the Nth record. Microsoft’s products are somewhat varied in behavior. PowerPivot, Excel Conditional Formatting, and Access always return more than N values if there is a tie in the Nth value. MDX in Analysis Services never returns more than N values, so if there are ties in N, values are arbitrarily dropped (in contrast, SAS Analytics provides an option in MDX TopCount/BottomCount, which allows you to specify whether or not to return ties in N). T-SQL’s TOP predicate provides a WITH TIES qualifier to give you control over returning ties in value N. Personally, I see no logic in dropping tied values in the Nth record. Which records do you drop, and which record do you keep? Also, if there are ties in any of the values within the top N, all the ties are returned. Therefore, I think that PowerPivot, Excel and Access provide the correct logic.

Next…

In Part II, I will continue the discussion of percentiles using a slightly more complex example and demonstrate how to use percentile measures to create a box & whisker PivotChart!


A Lesson in A-G-I-L-I-T-Y

May 11, 2011

image

A case study I just have to share

Over the past year, “Agility” has become quite a theme on this blog.  It’s fair to say that I just keep hammering it, over and over.  But I recently had my eyes opened as to what “extreme” PowerPivot agility can look like.

Let’s put it in football terms:  it’s like I’ve been going around telling everyone how they should be less like the 400+ pound Aaron Gibson and more like the nimble, versatile Marshall Faulk.

And then someone comes along like Barry Sanders:  agile in ways that you’d never think to recommend, but once you see it, you recognize it immediately.  And then you watch it over and over in awed slow-motion (seriously, click Barry’s name above for a video). 

That’s what this is like.

Dynamic Data Warehousing!

You don’t truly know what you need to know, until you are delivered what you THOUGHT you needed to know (I sound like Rumsfeld).  That was the theme of this blog post from last year.  That blog post concluded with my clients at the time realizing that the data they truly needed…  wasn’t even being collected.

I think it’s basic human nature to assume that yesterday’s gaps in your understanding are just that (yesterday’s), and that you won’t suffer from that in the future.  Marcellus Wallace recognized this tendency as pride, and rightly took a dim view of its helpfulness. (Video definitely NSFW!)

I’ve sung the praises of those folks (my clients) before, and I’m going to do it again:  Pride didn’t get in the way of their thinking on this issue.  Instead of telling themselves that they’d be better at anticipating their needs in the future, they decided to bake uncertainty into the CORE of their BI planning. 

They built a system that enables the following:  If they decide they are missing a set of data points, they can start collecting them, warehousing them, and analyzing them, end to end, in as little as two to four days’ time (that’s my estimate).

Here’s a diagram to illustrate:

image

Click for larger version

More Detail

Here’s some detail that was hard to fit into the diagram.  Basically these folks have millions of devices out “in the wild,” and those devices are instrumented to collect data about usage patterns.  When I first visited them in the early Fall, those devices were hardwired to collect only fixed data points, and we discovered that they needed to collect new data.

When I had the opportunity to drop back in on them recently, however, they revealed this new system.  Now, the only thing hardwired is flexibility.  The devices all call home once a day and see if there are new instructions awaiting them – brand new script written by their development team.  To make things painless and error-free for the development team, they also have built an internal portal that the developer visits to register the new message type that they are adding to the devices’ instrumentation scripts.  That portal takes care of configuring the data warehouse – new tables, retention policies, aggregation rules, as well as configuring the incoming message ports and mapping them into the right import processes.  Boom.

imageThey are even experimenting with ways to allow automatic generation and/or modification of PowerPivot models based on selections made in the portal.

It’s worth taking a step back and marveling at.  On Monday they can realize they have a blind spot in their radar.  On Tuesday and Wednesday they develop and test new instrumentation code.  On Thursday they roll it out to the devices.  And on Friday, they are literally collecting and analyzing MILLIONS of data points per day that they lacked at the beginning of the week.

It’s not like those new data points are on an island, either.  Via DeviceID, they are linked to multiple lookup/dimension tables and therefore can be integrated into the analysis performed on other fact/data point tables as well.  They can literally write measures that compare the newly-collected data against data they were already collecting – ratios, deltas, etc.  They can put the new metrics side by side with old metrics in a single pivot.  And, in theory, they could use the new data points to generate new lookup/dimension tables by tagging devices that exhibit high or low amounts of the newly-instrumented behaviors (although we did not discuss that on site – it just struck me as a possibility).

And late last year this organization had made zero investment, ZERO, in business intelligence.

Can this work for everyone?

Of course not.  Not everyone has the luxury of reprogramming their production systems at high frequency like this.  Not everyone can afford the risk or performance hit of having their production systems writing back directly to their data warehousing systems either – standard practice is to have your warehousing efforts “spying” on your transactional systems and taking occasional snapshots.  It’s a pull, not a push, which is why the “E” in “ETL” is Extract.

But this is definitely food for thought, for everyone.  “Why not?” is one of my favorite questions, because even if you can’t ultimately do something, examining the “why not” in detail is often very enlightening.

I know one of the standard objections is going to be, essentially, “Data Gone Wild:  No discipline.  Mile-long lists of tables and fields.”  Bah, I say!  Good problems to have!  Storage is cheap, flying blind is expensive.  And when you reach the point of being blinded by too much information, well, that’s an opportunity for a new set of tools and disciplines.

More to come

Last week was my first ever “doubleheader” – two consulting/training clients in a single week.  That can be hard on a blogger, heh heh.  But I look to be home for the next week or two, so you should expect to see a renewed flow of content here.  Got several things rattling around in my head.


Advanced DAX calculation: doing a moving grouped average in PowerPivot

March 24, 2011

Woops I'm sorry, did I change your Filter Context ?

By Kasper de Jonge, original post at PowerPivotblog.nl

I got an excellent question last week on the ask my question page that brought me new understanding of DAX. So finally a new interesting (I hope) blog post on DAX.

Let’s say I have a set of sales per week of a specific brand:

Now I want to have the average of sales per week (slicable by brand) and compare it to the average of the last 3 weekly totals. I prepare  the pivottable :

Read the rest of this entry »


Quick Poll: Do you run PowerPivot for Excel on Win XP?

January 11, 2011

The Tommy Chong Meme Spreads!

January 9, 2011

Just a quick humorous note.  Another tech blogger recently posted about Donald’s big news.

Here’s a quick screenshot from that blog:

image

What’s funny?  Well, that isn’t Donald.  It’s Tommy Chong.  Who is the villain behind this confusion?  Someone horrible, is my guess.

Totally.  Awesome.


AVERAGEX – The 5-point palm, exploding fxn technique revisited

September 10, 2010

The 5-Point palm technique is a beauty to watch, but o so lethal!

By Kasper de Jonge, original post at PowerPivotblog.nl

I recently had a problem that needed The 5-point palm, exploding fxn technique as described by Rob in his blog post.  I have used SUMX and COUNTX with success a few times before but this time I had a hard time getting my head around the AVERAGEX solution to my current problem.

While talking with Rob about it I realized I was thinking way to difficult. In this blog post I will try to make more sense of the X functions and give you a little help in building you own X function.

First let’s set a scenario. I want to compare the salesamount of each year to the average salesamount over the years. Sound easy huh ..

We start by creating a pivottable based on Contoso with CalendarYear on Y and the sum of salesamount as measure

Next we create a measure to get the average using =AVERAGE(FactSales[SalesAmount])

hmm what happens here?  The description of the function Average is: “Returns the average (arithmetic mean) of all the numbers in a column.”  So what is the function will do is use the lowest grain of data and determine the average of these values, in our case this means he uses all the sales amount values from the FactSales table that are in the current context (year) to calculate the average.

So how do we calculate the average over each year? In Excel this is easy:

And in PowerPivot this is essentially the same, we can use the AVERAGEX function to calculate average over a specific table. The description from MSDN is:

Calculates the average (arithmetic mean) of a set of expressions evaluated over a table.

This means we can change the scope of the average, the normal average automatically will take the lowest grain, but now we can supply our own table it has t o calculate the average. We want the function to do the same as the Excel calculation above. Our function should do average over:

2007 € 4.561.940.955 2008 € 4.111.233.535 2009 € 3.740.483.119

And this is exactly how the AVERAGEX should be used:

=AVERAGEX(all(DimDate[CalendarYear]), FactSales[Sum of SalesAmount])

The all(DimDate[CalendarYear]) would return a table of all years available (it even includes years that don’t have sales but this won’t alter the calculation). For each year in this table it returns the sum of salesamount, we need to use a measure here because it automatically calculates the sum no matter what context were are in. Then it uses this values of these sums to calculate an Average over.

The parameters are just like you say it, we want to calculate the average for each CalendarYear of Sum of SalesAmount.

This measure results in the following values:

The key here is the use of the first parameter supplied to the AverageX function, this is the table it will iterate over. Some other samples:
I want to calculate the Average sales amount of the “Contoso Bangkok No.1 Store”, we create a table by using the FILTER function where filter all storenames by one single store.

=AVERAGEX(FILTER(ALL(DimStore[StoreName]),
	DimStore[StoreName] = "Contoso Bangkok No.1 Store"),
	FactSales[Sum of SalesAmount])

I want to calculate the Average sales amount of the “Contoso Bangkok No.1 Store” and “Contoso Seoul Store”:

=AVERAGEX(FILTER(ALL(DimStore[StoreName]),
	DimStore[StoreName] = "Contoso Bangkok No.1 Store"
	|| DimStore[StoreName] = "Contoso Seoul Store"),
	FactSales[Sum of SalesAmount])

I want to calculate the average salesamount of all years leading up to the current year. First it checks if the current year has values, 2011 doesn’t have values but is available in the dimdate table. As table we want to return all years from the dimdate table before the current year. The current year is selected by values(DimDate[CalendarYear]) because we have set CalendarYear on the y-ax.  We need to check if values returns more than one result because when the subtotal is calculated values will return all years.

=IF(FactSales[Sum of SalesAmount] > 0
	,AVERAGEX(
		FILTER(ALL(DimDate[CalendarYear]),
					DimDate[CalendarYear]
					<= IF(countrows(VALUES(DimDate[CalendarYear])) > 1
						,Blank()
						,VALUES(DimDate[CalendarYear]))
				)
		, FactSales[Sum of SalesAmount])
	, Blank())

The AverageX, CountX and SumX functions are very powerful but difficult question, try to visualize what you want to do or use Excel to get the result you want and work from there.


Datamining using PowerPivot and Predixion Insight

August 23, 2010

By Kasper de Jonge, cross post from PowerPivotblog.nl

Since this week the public beta of Predixion Software’s Data mining in the cloud for Excel is available. Those of you who are familiar with the the Microsoft SSAS Data mining Add-ins should be very comfortable with what is inside Predixion Data mining for Excel.  I have done a previous blog post on doing data mining using PowerPivot with the MS data mining add-in where you can see how it currently works .

Predixion Insight for Excel is like a new version of the current SSAS add-in, the Predixion insight team consists of the folks that previously build the Add-in for MS and now started on their own.

The biggest change is that you no longer need an SSAS server installed. All action happens on the Predixion servers in the cloud. Second biggest (for me) is that you can use PowerPivot data as a datasource for you Data mining. Using it in combination with PowerPivot requires nothing more then Excel and a Predixion subscription for data mining. Furthermore the overal UI had been improved to make data mining a more user friendly experience. And it support 64 bits.

From the Predixion site:

Predixion’s intuitive and easy-to-use solution allows users to run predictive analytics in the familiar environments of Microsoft Excel® and PowerPivot. Whether you are an existing SQL Server® Data Mining user, a BI specialist or a newcomer to the arena of Predictive Analytics, Predixion Insight™ will enable you to easily create, manage and run powerful and accurate predictive models without extensive training or specific knowledge of the methodologies currently required to create successful predictive projects.

In this blog post we are going to see what are the key influencer are of the number of items on stock from the Contoso sample database. First we need to install the Predixion Insight for Excel, just run setup and the client will be installed within Excel. Next time you open Excel the client will be there. We have two tabs “Insight analytics”:

and “Insight Now”:

The “Insight analytics” tab is mainly for the advanced data mining,the insight now enables you to get started immediately. Before we can do anything we need to connect to the predixion servers with the account we created on the website:

After logging into the Predixion cloud service we can start data mining. I have loaded information from my datawarehouse in PowerPivot for Excel, i have information about my stock. I have loaded the fact table FactInventory that contains the actual nr of stock, this contains 8 million rows . The fact table is related to a lot of descriptive tables that surround the fact table, called the dimension tables. I have loaded a few of these descriptive tables into PowerPivot as well. What do we know about an item that is on stock:

  • When was it on stock? Year/month/day
  • What Product?
  • What Productcost
  • Aging of a product in inventory
  • The Country of the store it is in.

Of all these properties we want to know what influences the nr of days in stock the most. For this i want to use the “Analyze key influencers”  function. So i click on it.

This gives us a screen where i can select what my source is, Excel or PowerPivot, I select PowerPivot. Now i can select what table i want to analyze, i select the fact table. We could place filters here but i decided to plague the Predixion server all out with my full 8 million rows :) .

Next we can select the column we want to we want to determine the key influencers for:

Of course we don’t need all the columns to be analyzed, we can select the columns we want to include in our analysis:

And this is where we notice something not right. As you can see we can select DateKey, StoreKey, ProductKey. But when we analyze this it would analyze this as a Key value, instead of Year 2009 it would test for the value 1-1-2009 and Store “Amsterdam” it would check as Integer 12. So we need to do something first, we need to prepare our PowerPivot table so that it contains descriptive values.  Luckily for us this is not that hard, just add a column in the PowerPivot field window using the =RELATED function:

Now we can select these columns in the data mining add in:

Now we are good to go, just click Run and the data mining will be started.

The great thing here is that everything happens on the server, i can start multiple operations at the same time. And of course it being in the cloud i can open this up on another machine and immediately access the results.

One thing i noted is that the information is send to the cloud through an encrypted tunnel so no worry your data can be read while sniffing your network.

When i click on Minimize to Task pane you will see a new Predixion pane will show up where you can see all your tasks:

As you can see i ran this demo before so i can use these results to show the result of the mining Predixion did, just click “Results” and the report below appears:

As you can see it is pretty easy to combine the information you have in PowerPivot with the enormous powers of data mining. The new user interface and the availability of the Predixion servers in the cloud really make data mining available for anyone. Just as PowerPivot makes data analytics available for everyone. The Predixion Insight for Excel works with Excel 2007 and Excel 2010 32 AND 64 bit, of course PowerPivot won’t be available with Excel 2007.

Predixion Insight is also working on a on-premise and dedicated off-site cloud solution which leverages SQL Server, SSAS and SharePoint which they call Enterprise Insight.


Combine two data sources (in our case ssas) using PowerPivot

August 11, 2010

By Kasper de Jonge, crosspost from PowerPivotblog.nl

As part of our PowerPivot for the regular Information worker today a blog post on how to use PowerPivot to combine information from two sources into a single information product.

In our case the information is stored inside two analysis services cubes that the BI department had made accessible to us.

We want to analyze the order amount per product, year, month and country we have in one cube with the actual order count we have in another cube.

We start by importing the information from both cubes into PowerPivot, make sure both tables contain the same columns you want to analyze against.

We use the “From database”, “From analysis services” to select the analysis service and cube we want to import from. After that we can select the data to import:

Same for the second cube:

The data from both cubes are loaded into two seperate tables. We want to combine the two tables. To do that we can create a relationship between the tables, for more information read a previous blogpost on relationships. First thing we need to do is make sure that all the columns are identical. As you can see in the screenshots above the month and year columns from both tables do not contain the same values.

We start by creating a new calculated column in which we use DAX to get only the last 4 numbers and create a new year column

The next one is a little harder, the monthscolumn from one cube contain only the names of the months instead of the numbers. To translate the monthname to a monthnumber i created a translation table in excel and loaded it into PowerPivot:

next i created a relationship between the table and the translation table:

Now we are able to use the translated values in our table to create a new column using DAX, using the function RELATED we can get the value from the translation table in each row:

To make sure we can create a relationship we need a unique key in each table to connect a row from one table to a row from the other table. Again for more information check out my previous blogpost on relationships.

A row is uniquely identified by the values of all the columns from a row. SSAS will make sure a row is returned once for each measure because it automatically aggregates rows. So all we need to do now is combine the values from all the columns into a new column and connect the two tables by creating a relationship.

First we create a new column we call Key in both tables using the &-sign or use the Concat function:

Then we create a relationship between the two:

Now we have prepared the the data we can use it into the PowerPivot pivottable, you can see we can use the measure from both cubes in one table:

What is important to notice is that I use columns from the table I indicated as lookup table as filter / slicer / row label. When i would have used the other table we only would have had measure from the orderamount table.

To make it a little easier to use i would recommend to hide the columns from the orderamount table in our pivottable. Go to the table, click design, click hide and unhide.

Here you can hide columns from the pivottable, in our case we only want to keep the measure value in the pivottable:

This will give us only the cleaned up colums in our PowerPivot Field list:

As you can see combining data from two cubes is not that hard, important is identifying  a key column between the two tables. You need to keep this in mind when importing data from cubes to combine them.