Creating Accurate Percentile Measures in DAX – Part II

October 1, 2011

Intro from Rob:  today, Colin finishes in Percentile Measures topic from last week.

Note that 1) this technique requires the new release of PowerPivot that is currently still in beta (the Denali release)…

…and 2) I would rate this one a solid 3 or 4 on the spicy function scale.  Not one of the beginner techniques, in other words Smile

Guest Post by Colin Banfield

October 01, 2011

In Part I, I discussed my approach to creating percentile measures in DAX. In this part, we will add an extra dimension to our dataset and create a box & whisker PivotChart. The expanded dataset is shown in figure 1, with the source table on the left, and the resulting PowerPivot PivotTable on the right.

clip_image001

Figure 1 – Expanded dataset with Set column

In the expanded dataset, I have added a column called Set. The goal is to create percentile measures that are constrained to each set. Following the 4-step approach outlined in Part I, we start with the rank calculation. Recall that our rank formula in Part I was:

=RANKX(ALL(Data),[Sum of Value],,1)

This formula works when we are working with the entire table. However, with the addition of the Set column, we need to calculate values for the sets A, B, & C in a new filter context (i.e. one that excludes the ID column). We need RANKX to rank the values for each set independently. We employ the CALCULATETABLE function to achieve the desired result:

=RANKX(

                  CALCULATETABLE(
                                                       Data,
                                                      ALL(Data[ID])
                                                    ),
                 [Sum of Value],,1
               )

To show the ranking within each set more easily, I’ve sorted the source table by set and value, both in ascending order. Next, I created a calculated column that uses a lookup formula to get the rank measure from the PivotTable.

clip_image002

Figure 2 – Rank measure

With the inclusion of the Set column in the PivotTable, we must also modify other measures used in Part I.

Rank for 25th percentile ([25PctRank_INC])

From (Part I):

=CALCULATE(
                           (COUNTROWS(Data)-1)*25/100+1,
                           ALL(Data)
                         )

To:

=CALCULATE(
                           (COUNTROWS(Data)-1)*25/100+1,
                           ALL(Data[ID])
                         )

PercentileDown

From (Part I):

=MAXX(
                FILTER(
                             ALL(Data),
                             [Rank] = ROUNDDOWN([25PctRank_INC],0)
                           ),
              [Sum of Value]
            )

To:

=MAXX(
                FILTER(
                               ALL(Data[ID]),
                               [Rank] = ROUNDDOWN([25PctRank_INC],0)
                             ),
               [Sum of Value]
            )

PercentileUp

From (Part I):

=MAXX(
                  TOPN(
                               ROUNDUP([25PctRank_INC],0),
                              ALL(Data),
                             [Sum of Value],
                            1
                         ),
               [Sum of Value]
            )

To:

=MAXX(
                TOPN(
                             ROUNDUP([25PctRank_INC],0),
                             CALCULATETABLE(
                                                                   Data,
                                                                   ALL(Data[ID])
                                                                ),
                            [Sum of Value],
                            1
                          ),
               [Sum of Value]
            )

Or:

=CALCULATE(
                            MAXX(
                                          TOPN(
                                                       ROUNDUP([25PctRank_INC],0),
                                                       Data,
                                                       [Sum of Value],
                                                      1
                                                   ),
                                       [Sum of Value]
                                     ),
                         ALL(Data[ID])
                      )

The preceding alternative formulas are provided to illustrate how the same scalar result can be obtained using either CALCULATE or CALCULATETABLE. Similarly, we can use CALCULATETABLE instead of CALCULATE in the rank for the 25th percentile formula. I point this out because using CALCULATETABLE as an alternative to CALCULATE in applicable situations is rarely, if ever discussed.

25th, 50th, and 75th Percentiles

Recall the interpolation formula from Part I:

=[25thPercentileDown]+([25thPercentileUp]-[25thPercentileDown])*([25PctRank_INC]-ROUNDDOWN([25PctRank_INC],0))

After completing the measures for the 50th and 75th percentiles, our table is as shown in figure 3.

clip_image003

Figure 3 – Percentile measures, with comparisons to values returned by Excel’s PERCENTTILE.INC function

Creating a Box & Whisker PivotChart from Percentile Measures

It is difficult, if not impossible to analyze data directly in a table – whether or not the data is summarized. The purpose of a chart is (or should be) to provide us with insights into the data that we cannot derive by staring at the numbers. One such chart that takes advantage of percentile measures is the box & whisker plot (hereafter called a box plot). This chart is commonly used in areas such as statistical process control (SPC) and Six Sigma, but they can be quite useful for general business analysis. This chart is not native to Excel, so it’s not surprising that its use is not explored more often. The box plot visually shows the distribution of values in a dataset. The most common plots show the minimum value, maximum value, the range between the 25th and 75th percentile (known as the interquartile range), the median value and the mean (average value). The “box” portion is the interquartile range, and the two “whisker” parts are the range between the min and the 25th percentile, and the range between the 75th percentile and the max. The box plot is most effective when comparing the distribution of values across several data sets. A few examples of box plots for general business use include:

· Comparing the distribution of call hold times of agents over a period

· Comparing the distribution of price fluctuations across several products, or product parts

· Comparing the distribution of shipping weights for different service levels (e.g. normal, express, overnight). This data could provide useful intelligence for a shipping company

· Comparing the distribution of actual salaries across different salary scales. An excellent example is provided in this article here. The article explains how to interpret the plot and also illustrates a very important concept for making charts more meaningful i.e. adding additional context to a chart by providing reference information.

Basically, any related entities in your business that could benefit from analyzing a distribution of their values could benefit from a box plot.

The first step in creating the box plot PivotChart is to arrange the data appropriately. The actual construction of the plot is beyond the scope of this post, but the technique is explained in this tutorial by Excel charting guru, Jon Peltier. If you’re new to creating box plots in Excel, you should read the tutorial before continuing. Jon’s excellent charting tutorials have been around for a long time, and recently he has been updating them. In addition to his tutorials, Jon also has a few chart utilities for sale (and no, I don’t earn a commission for this plug).

The box plot PivotChart using our dataset is shown in figure 4

clip_image004

Figure 4 – Box plot PivotChart

Notes

1. I created a perspective called BoxPlot Data to contain only the measures required for the box plot

2. The measures are numbered to indicate the order that they must be added to the PivotTable

3. The box plot measure formulas are as follows:

[1-25th Percentile] = [25thPercentile_INC]

[2-50th Percentile] = [50thPercentile_INC]-[25thPercentile_INC]

[3-75th Percentile] = [75thPercentile_INC]-[50thPercentile_INC]

[4-Minimum] = [25thPercentile_INC]- MINX(ALL(Data[ID]),[Sum of Value])

[5-Maximum] = MAXX(ALL(Data[ID]),[Sum of Value])- [75thPercentile_INC]

[6-Mean] = AVERAGEX(ALL(Data[ID]),[Sum of Value])

4. I’ve created a chart template that can be used to create box plots. When creating a new plot, the dimension and measures should be added to the PivotTable before selecting the chart template from the Insert Chart dialog box.

5. After you create the chart from the template, the whiskers will not appear. This occurs because the whiskers are created using error bars, and external ranges are used to specify the data for the error bars. Unlike the rest of the chart, the template doesn’t know which ranges to use for error bars. You must add these manually, using the technique detailed in Jon’s tutorial.

6. The white median lines you see in figure 4 were created using a bit of formatting trickery. I used a white border for the bars. The outside borders don’t show because they are set against a white background.

Downloads

You can download a copy of the workbook & chart template from the following location:

http://powerpivotfaq.com/PowerPivot%20Samples/Forms/AllItems.aspx


Creating Accurate Percentile Measures in DAX – Part I

September 29, 2011

Guest Post By Colin Banfield

29th September, 2011

Equivalents of Excel’s Percentile, Quartile, and Median functions are perhaps the most significant omissions in Denali’s DAX statistical function library. Quartile and Median are actually special cases of percentile, and in this post, we calculate these special cases. Some of the most insightful BI analysis involves the use of statistics, and PowerPivot provides an excellent environment for developing statistical measures. Perhaps percentile and its cohorts will be considered for inclusion in PowerPivot V3.

Percentile measures are not difficult to create in DAX, but the process is also not trivial. Adding to the complexity is the need to interpolate values if you want to duplicate the accuracy of Excel’s percentile functions. My approach to creating percentile measures in PowerPivot is as follows:

  1. Calculate the ascending rank of values in the dataset under exploration.
  2. Calculate the rank corresponding to required percentile. I use the formula n=P(N-1)/100+1, where n is the rank for a given percentile P, and N is the number of rows in the dataset. This formula provides the rank corresponding to Excel’s PERCENTILE or PERCENTILE.INC function. For a calculation equivalent to Excel’s PERCENTILE.EXC, the rank for a given percentile can be found by using the formula n=P(N+1)/100. In most cases, the rank calculated will not be an integer. Therefore, for a generalized solution, we must use the integer rank below and above the calculated value and interpolate. For example, the first formula in (1) calculates a rank of 11.25 for P=25 (25th percentile) and N=42. We must therefore interpolate between the dataset ranks of 11 and 12.
  3. Find the data values corresponding to the integer ranks derived from step 2. For convenience, let’s call these values PPercentileDown and PPercentileUp.
  4. Calculate the percentile using linear interpolation.

Let’s start with a simple dataset to illustrate the concepts. In figure 1, the original dataset is in the table on the left. The corresponding PowerPivot PivotTable is on the right.

Figure 1- Original Dataset

Step 1 – Calculate the ascending rank of values in the dataset

Fortunately, the newly minted RANK function in Denali makes the calculation a lot easier than the rank formula we had to create in V1. The rank formula is simply:

=RANKX(ALL(Data),[Sum of Value],,1)

Figure 2 – Rank Measure

Step 2 – Calculate the rank corresponding to required percentile

We will first calculate the rank corresponding to the 25th percentile (inclusive) in the dataset. Recalling the formula n=P(N-1)/100+1, this translates into the following DAX formula:

=CALCULATE(

      (COUNTROWS(Data)-1)*25/100+1,


      ALL(Data)

          )

In this case, the calculated rank is 11.25.

Figure 3 – Rank of 25th percentile

Step 3 – Find the data values corresponding to the integer ranks derived from step 2

Our 25th percentile value is somewhere between the values corresponding to the 11th and 12th rank, as shown in figure 3. We need formulas to extract these values as measures. The 25thPercentileDown formula is:

=MAXX(

      FILTER(

             ALL(Data),

             [Rank] = ROUNDDOWN([25PctRank_INC],0)

            ),

      [Sum of Value]

     )

We are returning the Max of Sum of Value, filtered so that [Rank] is equal to the rank of the 25th percentile, rounded down to the nearest integer rank. In other words, we are filtering the table to return only values where [Rank] = 11. In the case of ties, there will be more than one value in the filtered table (all identical). This is the reason for using MAXX. However, MINX, or even AVERAGEX would calculate the same result.

For 25thPercentileUp, you may be inclined to use a similar formula, substituting the ROUNDDOWN function for ROUNDUP. However, in the event of ties for the 25th percentile, the filter will be incorrect. This is so because [Rank] will calculate the same rank for ties (11 in this case), and ROUNDUP([25PctRank_INC],0)will be = 12. We can get around the problem by filtering the table using the TOPN function instead:

=MAXX(

      TOPN(

           ROUNDUP([25PctRank_INC],0),

           ALL(Data),

           [Sum of Value],

     1

          ),

     [Sum of Value]

    )

The formula simply finds the max of the top 12 values in the dataset (i.e. 11.25 rounded up to 12). The result is not affected by ties in the rank. Figure 4 shows the resultant measures.

Figure 4 – Rankup and Rankdown

Step 4 – Calculate the percentile using linear interpolation

Now that we have the values between which we need to interpolate to find the actual percentile, we use these values in a simple interpolation formula:

=[25thPercentileDown]+([25thPercentileUp]-[25thPercentileDown])*([25PctRank_INC]-ROUNDDOWN([25PctRank_INC],0))

= 1891.07 + (2073.29-1891.07)*(11.25-11) = 1891.07 + 182.22*0.25 = 1936.63

In figure 5, I’ve completed the calculations for the 50th percentile and 75th percentile and removed intermediate formulas from the PivotTable.

Figure 5 – Percentiles

Figure 5 illustrates that that values we calculated are identical to values calculated by using Excel’s PERCENTILE.INC function. Note that:

  • The 25th percentile is also the first quartile
  • The 50th percentile is the median (i.e. the same value returned by Excel’s MEDIAN function)
  • The 75th percentile is also the third quartile

Alternative median calculation

If we need to calculate the median without regard for percentiles, we could use the following formula:

=IF(

    MOD(

        COUNTROWS(ALL(Data)),

        2

  )=0,

    AVERAGEX(

             TOPN(

                  2,

                  TOPN(

                       (COUNTROWS(ALL(Data))/2)+1,

                       ALL(Data),

                       [Sum of Value],

                       1

                      ),

                  [Sum of Value]

                 ),

             [Sum of Value]

            ),

   MAXX(

        TOPN(

             COUNTROWS(ALL(Data))/2,

             ALL(Data),

             [Sum of Value],

            ),

        [Sum of Value]

       )

   )

Essentially, we want to return the value in the middle of the range, after sorting the range in ascending order. The calculation we use depends on whether there are an even or odd number of values in our dataset. The first part of the formula, MOD(COUNTROWS(ALL(Data)),2)=0, determines if the dataset has an even number of values. I would have preferred to use ISEVEN (and thus eliminate an extra function call), but alas, Excel’s ISEVEN and ISODD functions didn’t make it into DAX.

If the number of values is even, we need to get the top value in the first half of the dataset, plus the next higher value (because we have to average these two values to get the median). We accomplish this goal in the following part of the formula:

TOPN(

     (COUNTROWS(ALL(Data))/2)+1,

     ALL(Data),

     [Sum of Value],

     1

    )

Based on our sample dataset, the preceding formula will return a table with values where ID=1 to ID=22. Next, we get the top 2 values from this new table (ID=21 and ID=22). To accomplish this task, we wrap another TOPN function around the new table and sort in descending order:

TOPN(
     2,

     TOPN(

          (COUNTROWS(ALL(Data))/2)+1,

          ALL(Data),

          [Sum of Value],

          1

         ),

    [Sum of Value]

   ),

Finally, we average the two values:

AVERAGEX(

         TOPN(

              2,

              TOPN(

                   (COUNTROWS(ALL(Data))/2)+1,

                   ALL(Data),

                   [Sum of Value],

                   1

                  ),

            [Sum of Value]

           ),

        [Sum of Value]

       ),

If the number of values in the dataset is odd, we just return the middle number, using the formula in the final argument of IF:

MAXX(

     TOPN(

          COUNTROWS(ALL(Data))/2,

          ALL(Data),

          [Sum of Value],

         ),

    [Sum of Value]

   )

Note that TOPN rounds a number up or down arithmetically, if the first argument is a decimal value. Dividing an odd number by 2 will therefore always round the number to the next higher value.

Percentile INC vs. EXC

At the beginning of this post, I mentioned that to compute percentiles corresponding to Excel’s PERCENTILE.EXC, you use the formula n=P(N+1)/100 to get the rank for a given percentile. When, therefore, should you use one versus the other? The answer is similar to the use of STDEV.S/STDEV.P or VAR.S/VAR.P. PERCENTILE.EXC, which is a new function in Excel 2010, generally provides better results when we are using a sample from a larger dataset (where it’s unlikely that the sample would include the dataset’s zero and one-hundredth percentile values – and thus these percentiles are excluded). PERCENTILE.INC is better when we’re working with the entire dataset.

My rants of the day

Rant #1

I can’t help but end this post with a couple of rants. I’ll start with my biggest irritation in all of PowerPivot. WE CAN’T HIDE MEASURES. That’s bold, underlined, and highlighted. This situation is quite unfortunate. Imagine if you couldn’t prevent a VBA function from surfacing in the user interface (by declaring the function as private). Imagine if you couldn’t reuse routines in your development work, but had to rewrite them over and over. Imagine if you couldn’t validate complex expressions and improve performance by maintaining intermediate formulas. Without being able to hide measures from client tools, we are faced with all of the preceding issues. We can hide table columns, so why are measures any different? In Analysis Services, when you create a calculated measure, you have the option to have the measure hidden or displayed in client tools. Why does this logic make sense for calculated measures in AS, but not in PowerPivot?

Look at my field list for percentiles in figure 6 (some not mentioned in this post):

Figure 6 – PowerPivot Field List

Of all the measures shown, I want the client to see about half the number of measures in the list. The rest are intermediate results for development purposes only. To be honest, I don’t know which I find more baffling; that the development team can’t see the absolute necessity of this feature, or that I seem to be the only being on the planet that has a problem with this issue. Grrr. Unlike some very brilliant folks, I am unable to construct a half-page worth of a complex DAX formula, without the need to validate intermediate results.

On this matter, I think that I’m in good company. Flipping through the second edition of “The Microsoft Data Warehouse Toolkit,” written by authors from the highly regarded Kimball Group, I came across the following statement:

The best way to create calculations and measures in PowerPivot is to break long, complex formulas up into their component parts and build them incrementally. This allows you to experiment with the data and quickly see the results of your formulas.

Great advice, but incredibly, they don’t address how you are supposed to keep all of those intermediate results from the client’s view. The only solution currently available, and it’s a poor one from a performance and maintenance standpoint, is to merge the intermediate formulas into one mega-formula. Using perspectives don’t help either, since you can’t hide the original list of measures.

Rant #2

I’m glad to see that Denali includes TOPN, as this is a very useful function. But where is TOPNPERCENT? Can you name a product where these two functions don’t appear together, like Gemini or Siamese twins? Excel conditional formatting? Both there. Microsoft Access? Both there. T-SQL? Both there. MDX? Ditto. I find this level of inconsistency irritating. You can create a TOPN percent measure by using a formula like TOPN(COUNTROWS(Data)*N%), which is understandably trivial. Well, you can create a distinct count measure by using COUNT(DISTINCT), which is also trivial, but we now have DISTINCTCOUNT because creating distinct count measures are so common (and because the BI Pros all requested it J).

On the positive side, there is the “issue” of TOPN returning more than N records if there are ties in the Nth record. Microsoft’s products are somewhat varied in behavior. PowerPivot, Excel Conditional Formatting, and Access always return more than N values if there is a tie in the Nth value. MDX in Analysis Services never returns more than N values, so if there are ties in N, values are arbitrarily dropped (in contrast, SAS Analytics provides an option in MDX TopCount/BottomCount, which allows you to specify whether or not to return ties in N). T-SQL’s TOP predicate provides a WITH TIES qualifier to give you control over returning ties in value N. Personally, I see no logic in dropping tied values in the Nth record. Which records do you drop, and which record do you keep? Also, if there are ties in any of the values within the top N, all the ties are returned. Therefore, I think that PowerPivot, Excel and Access provide the correct logic.

Next…

In Part II, I will continue the discussion of percentiles using a slightly more complex example and demonstrate how to use percentile measures to create a box & whisker PivotChart!


DAX Autocomplete Stops Working–A Workable “Fix”

September 27, 2011

 
image

“There, good as new.”

By Now, You’ve Probably Seen This

So you’re typing along, writing a DAX measure.  And suddenly, poof!  Autocomplete stops working.

This happens to me all day, every day.  Especially when I use the [Measure]( filter expression ) syntax.  Here’s a quick “before and after” example:

PowerPivot autocomplete is working

Autocomplete is Working for Table/Column Names

PowerPivot autocomplete is NOT working

Same Table Name, But Autocomplete is NOT Working

Grab Your Duct Tape…

A guy named Scott Kaylie showed me this about six months ago and I am just now getting around to posting it.  Yes, my backlog of posts has in fact grown to be at least six months long Sad smile

Anyway, here’s the trick.  Just put a * in between the measure and the open parenthesis, as if you were going to multiply the measure by another expression:

Tricking autocomplete into working with *

OK, so then just go ahead and finish writing the measure, leaving the * in there:

Tricking autocomplete into working with *

And then, at the very end, go back and remove the *…

Tricking autocomplete into working with *

Voila.  Ugly yes, but much better than having no workaround.  Thanks Scott.

Any other questions please consult the following home repair manual:

From GraphJam


Detecting ALL() – Detecting When Not Filtered

September 20, 2011

 
image

 

“I like to carry it, you never know when you’re gonna need it.”

-The  much-missed John Candy as Uncle Buck

 

 

A technique that you may need someday

File this under “you may never need this, but when you do, you’ll know it immediately.”  I’d call this one a solid 4 on the DAX Spicy Scale.

Why I needed this:  I had two different calendar tables, one at the Date level and one that was a custom calendar, Periods table.  Most of the measures in this model are written to be used with one table or the other, and I never have to “cross the streams.”

But then I ran into a case where a measure I had written to be used with the Periods table, suddenly needed to also be used on a pivot that was only filtered by Dates.  And I didn’t want to write a new version of this measure (for reasons that are mostly irrelevant here).

Of course, when I put the Period-focused measure on a Date-focused pivot, and there were no fields from Periods on the pivot, well…  the Period-focused measure returned junk.

What I decided to do, then, was detect if Periods was the aptly-named “Sir Not Appearing in This Film” and then assign a Period in that case.  In other words, detect if the Periods table was not on the pivot, and if not, FORCE a Period value into the evaluation of my measure.

Detecting ALL(), or the Absence of Filter on a Field or Table

Here is the final measure formula I used, color-coded for identify its parts:

IF(COUNTROWS(Periods) = COUNTROWS(ALL(Periods))-1,
   CALCULATE(My Original Measure,
      FILTER(Periods, Periods[Period] = [LatestPeriod])
   ),
  
My Original Measure
)

Let’s go part-by-part:

The Detector

IF(COUNTROWS(Periods) = COUNTROWS(ALL(Periods))-1

The “detector” counts the rows of Periods in the current pivot context, and compares that to the number of rows in the Periods table with all filters removed by ALL().  Pretty straightforward right?

So…  why the –1 at the end?  The reasons for that are slightly academic…  academic enough that I don’t really want to understand in great depth.  Let’s go with  the short version:  in this case, when I counted the rows of ALL(Periods), ALL() was kind enough to include the “blank” row of the Periods table.

What’s that?  You say you don’t HAVE a blank row in your Periods table???  Well, neither do I.  But you MIGHT have some rows in your Sales table that have a blank value for the [PeriodID] column.  Or maybe you have rows in the Sales table that DO have a [PeriodID] value, but that value does not appear in your Periods table.  Either way, you implicitly DO have blanks in your Periods table, and COUNTROWS(ALL()) decides to tell you about it.  So you’ll subtract one and like it, soldier!

(I have not tested, honestly, whether you always need – 1 in this detector.  If your Sales table is perfectly clean, maybe the – 1 is not needed, and maybe it is.  Someone let me know OK?  And I bet three-to-one that the answer comes from Italy.)

The Original Measure

My Original Measure

OK, this part IS straightforward.  Whatever my original measure was named, or perhaps its full original formula, appears here.  Moving on…

The Filter for the “No Periods Selected” Case

FILTER(Periods, Periods[Period] = [LatestPeriod])

OK, in the case that Periods is absent from the pivot, I take the original measure and then use the FILTER function to pretend that the Periods[Period] column IS on the pivot, and filtered to a value matching the most recent period, as calculated by my [LatestPeriod] measure.

But really, this part is going to be VERY different based on the circumstance.  Maybe you want to use a completely different measure, for instance.  Or set Periods to the first period this month.  Or the period corresponding to the current filter context from the Date table.

I merely included the “meat” of what I did here to drive home the intent.

Two Notes

One – note that there is NO difference between “the Periods table is not used on the pivot at all” and “the Periods table IS on the pivot but unfiltered in the current context.”  So if you have Periods on a slicer but nothing is selected, the detector will “go off.”  And if you have Periods on rows, the detector will still “go off” in the grand total cell of the pivot (and maybe in certain subtotal cells as well).

Two – I was messing around in this area when I discovered the need for the long-simmering Precedence Project.  In other words, when you start messing around with overriding filter context like I did with the FILTER() function above, and you’ve got a number of tables and relationships in play, every now and then you see something you don’t expect.  For that reason, I plan to return to the Precedence Project shortly.


The Incredible “Iffer-Blanker-Filter” Measure

September 6, 2011

Replica of Shawshank Postcard

“Not long after the warden deprived us of his company, I got a postcard in the mail.
It was BLANK(), but the postmark said… Fort Hancock, Texas.”

Omitting/Filtering Out Pivot Rows Based on an IF()

I’ve had this on my list to share for a long time, but David Churchward’s recent guest post bubbled it to the top.  At one point in that post, he had a nested IF measure that basically looked like this:

IF( <Non-Trivial Conditional Test>,
[Return an Existing Measure],
BLANK()
)

I find myself doing something like that once a week or so.  You just want the pivot to return certain rows – customers of a certain profit margin in David’s case, but it can be stores whose sales reflect certain characteristics, etc.

If it was just a static filter criteria, like “products that are blue,” then you wouldn’t need to put an IF in your measure at all.  You’d just set a filter in the pivot itself and use normal measures.

The key about these examples is that the filter criteria is dynamic – it takes into account selections made on slicers for instance.  And only measure logic can be dynamic like that.

But it’s Very Tedious to Write a Bunch of Measures Like That

OK, so you’ve written one measure that returns, say, [Sales] if the criteria is met, and BLANK() otherwise.  But your pivot needs to contain 6-7 different measures.  Do you want to go write that ugly IF measure another 6 times?  Doubtful.  And it would really suck if, sometime later, you realized that you needed to change the filter criteria.

David faced this problem in his post, and decided to write a single “flag” measure that returned 0 or 1 based on the conditional test, and then added a “Value” filter to the pivot based on that measure:

Pretty clever.  That way he doesn’t have to write the conditional test over and over in multiple new measures.  And given that it’s a “value” filter, it IS dynamic.

It caught my eye, because I’ve been doing this a different way.

My Approach – A “Flag” Measure that Returns 1 or BLANK()

I start out much the same way, with a single “flag” measure that evaluates the conditional test using an IF(), and that returns 1 if the condition is true.  But when it’s false, it returns BLANK() instead of 0:

[IfferBlanker]=
IF( <Non-Trivial Conditional Test>,
1,
BLANK()
)

Next, with my approach, I DO have to create multiple measures, which is a drawback.  However, each of these measures is very simple, like this:

[Filtered Sales]=
[Sales] * [IfferBlanker]

[Filtered Profit Margin]=
[Profit Margin] * [IfferBlanker]

Those resulting measures then return blanks if the condition isn’t met, and if every measure on a row returns blank, the pivot doesn’t show it (unless you change the default setting for blanks).  So it has the same end result as David’s approach.

Which Method is Superior?

I don’t think my way is better.  I plan to use David’s trick a lot actually – it’s a lot quicker to write one measure and apply a value filter than to write 6 or 7 measures, even if those measures are simple.  But I do think both methods have their usefulness.

For instance, if the report consumers are prone to fiddling, they may remove the value filter, perhaps even accidentally, and not realize that they are now seeing customers (or stores etc.) that do NOT fit the criteria.  And if the consumers of the model are  building their own pivots from scratch, they may not know how you applied the filter.  You may also sometimes want to mix and match filtered and unfiltered measures in a single pivot, which the value filter approach doesn’t support.  And if you’re going to build a LOT of pivots that require this filtering, investing in a handful of filtered measures may pay off in terms of effort, too.

I think I’m going to start using David’s approach about half the time, and stick with “Iffer-Blanker-Filter” measures for the other half, based on the situation.  Always good to have multiple tricks Smile


Guest Post – Greater Than/Less Than/Between Slicers

September 2, 2011

 
I get a reasonable number of questions in email each week, and rarely have time to respond to all of them.  Last week though I received a question that I couldn’t resist – it was just too “close to home” for me, on a theme that I like to call “with DAX, you can make slicers do ANYTHING.”  It reminded me a lot of the technique I used in the A/B Campaign Analysis post, but in a more generalized sense.

Pressed for time, I just dashed off a few hints in a reply and hoped that would be enough.  Well the bright dude on the other end took about 10 minutes to digest the hints and solve his problem, and then asked if there was something he could do to pay me back.  I asked if he’d be willing to write up the whole solution as a guest post, he said yes, and later that same day I received the following.

Slicing Data With Greater Than, Less Than, or Between

Guest Post by David Churchward

With PowerPivot’s ability to digest such huge datasets, it’s easy to get carried away and present dashboards that display too much information for users to digest.  Having fallen into this trap, I also found that as usage increases, users come up with more and more exceptions that need to be handled by the dashboard to present their required view on life.

Here’s a request I get all the time:  “I just want to see customers in my report if they have a margin pct greater than X – I don’t want to see any other customers.”  And of course, that also takes the form of “customers with margin less than Y” and “customers with margin between X and Y.”

In order to solve this issue, I started by formulating “Customer Sets” where I would condition Gross Margin bands, for example, Negative Margin, Margin < 5% and so on. I then realised that whichever sets I came up with, one of our users would inevitably find a requirement for a new one.  As a result, I worked on passing all of the onus onto the users by giving them a series of parameters from which they could select their own datasets.

The outcome was a series of slicers as below:

clip_image002

This gives users the means to select customers that fit within parameters that are within their control.

How It’s Done

Firstly three tables need to be created and then linked through to PowerPivot

clip_image003

The first table, let’s call it “ParameterType” contains three entries each with a numeric code associated to it. This code is simply to make the measures that we will create easier to write, less prone to typo errors and allows us to use “Max” or “Min” functions to ensure that we evaluate to one result and avoid that fantastic error message of “…value cannot be determined in the current context…..”.

The second and third tables, let’s call them “Parameter X” and “Parameter Y”, simply contain a series of GM percentages together with a Percentage Name.  The Percentage Name s are simply used for aesthetic reasons on the slicer as users don’t normally want to see 0.05 to represent 5%.

Once the tables are in place, there’s no need to link them together.  We’re simply going to use them as a means to collect the users parameters.

We now need to create 3 measures to determine which items have been selected by the user and make them available to our ultimate Gross Margin measure.  These are as follows:

GM_Parameter_Type  =MAX(ParameterType[GM_Parameter_Code])

Min_GM_Param_X =MIN(ParameterX[GM_Param_X])

Max_GM_Param_Y =MAX(ParameterY[GM_Param_Y])

This is all straight forward enough, but then I realised that I have to accommodate the fact that users are unpredictable and could select a parameter X that is higher than parameter Y.  In addition, because PowerPivot refreshes for each slicer operation, it will be too easy to enter a backwards range.  Therefore, we simply have to deal with it using two new measures

Selected_Max_PC

= IF(
    [Min_GM_Param_X] > [Max_GM_Param_Y],
    [Min_GM_Param_X],
    [Max_GM_Param_Y]
 
  )

Selected_Min_PC

= IF(
    [Max_GM_Param_Y] < [Min_GM_Param_X],
    [Max_GM_Param_Y],
    [Min_GM_Param_X]
 
  )

We’re now ready to construct our measures to deliver the correct Gross Margin values filtered by the ranges selected in our slicers.  In this example, my Gross Margin transactions are held in a table called FACT_TRANS and I have two measures called Margin and Marg_PCT (Gross Margin %).  I need three measures to evaluate each of the potential parameter types that could be selected – “Between X% and Y%” (measure called GM_BetweenXandY), “Greater Than X%” (measure called GM_GreaterThanX) and “Less Than Y%” (measure called GM_LessThanY).

GM_BetweenXandY

=IF(
[MARG_PCT]>=[Selected_Min_PC] &&
[Marg_PCT]<=[Selected_Max_PC],
[Margin],
BLANK()
  
)

GM_GreaterThanX

=IF(
[Marg_PCT]>=[Min_GM_Param_X],
[Margin],
BLANK()
  
)

GM_LessThanY

=IF(
[Marg_PCT]<=[Max_GM_Param_Y],
[Margin],
BLANK()
   )

We now have our three measures that can be called depending on the users selection in the ParameterType slicer.  The measure that does this is

Marg_XY

=if(
   [GM_Parameter_Type]=1,
   [GM_GreaterThanX],
   If(
    
[GM_Parameter_Type]=2,
     [GM_BetweenXandY],
     If(
       [GM_Parameter_Type]=3,
       [GM_LessThanY],
       BLANK()
       )

     )

   )

My dashboard is using GM% to filter those customers to be shown.  However, I want to show the Sales value as well and I want the filter to apply to those measures as well.  Therefore, I have two choices:

1. Construct further measures similar to Marg_XY but representing the sales measure (that returns BLANK() in cases that I don’t want it to show up) or

2. Create a flag to highlight the records to be shown and then filter on that flag.

I’m going to opt for number 2.  The measure used to create the flag is as follows:

Marg_XY_FLAG

=IF(FACT_TRANS[Marg_XY]=BLANK(),0,1)

I then set a simple value filter for Marg_XY_FLAG = 1 on the pivot table.

clip_image005

The slicers used don’t need cross filtering as they stand in isolation.  For performance sake, don’t forget to deselect the “Visually indicate items with no data” and “Show items with no data last” as shown below

clip_image007

The result when included in my customer sales analysis, subject to a bit of tidying up, is shown below.  I’ve used a random code to disguise Customer Names.

clip_image008

With 1.2million records in my fact table, the refresh rate on each slicer operation is under 1 second.  This would be even quicker if I had a linked customer dimension table but as I’m using random values I didn’t consider this worthwhile.


Distinct Count in PowerPivot v2 – MUCH Faster

July 28, 2011

A quick update, as tonight I managed to sit down and experiment a bit with the V2 Beta for the first time in many days.

(OK, actually, I am standing up…  because I have a fantastic new treadmill desk and have walked nearly 8 miles today while working, but that’s a topic for another post).

I was working with one of our HostedPowerPivot clients today who was observing slower-than-expected slicer click performance in one of their reports.  And after some sleuthing, I found that they had a “distinct count” measure in their model, something like this:

[Measure] = COUNTROWS(DISTINCT(Table[Column]))

Now, that’s a pretty useful measure in many situations, like “does product X sell every day?”  Simply taking a count of the date column doesn’t cut it, because if it sells twice on one day and not at all the next day, the count is still 2, and you want it to be 1.

Trouble is, in PowerPivot v1, that operation can be slow.  I won’t bore you with all the details, because frankly, I don’t know all of them, heh heh.  But I do know that PowerPivot v1 does end up creating all kinds of temporary new tables in memory behind the scenes when it is evaluating the measure.

Anyway, I’d heard a long time ago that PowerPivot V2 was going to make distinct counts a lot faster, so I thought I’d try it out.

The Test Case

First I needed to create a test case that was challenging.  I don’t want to compare “fast query vs. fast query” because that often gets skewed by fixed overhead.  Much better to test a “worst case scenario.”

So, I took a 15 million row table, and added a calculated column to it.  Calculated columns in PowerPivot are much less compressed than imported columns, and are therefore more challenging for the measure engine to scan through, looking for duplicates etc. which is required for a distinct count.

And then, to make matters worse, I just used the RAND() function so that there are many unique values:

image

Cutting to the Chase:  The Results

A distinct count measure in PowerPivot v1 took 35 seconds to complete in my test pivot.  This was a big part of why our client was seeing slow perf today.

That same measure, in that same pivot, took less than a second in the V2 beta.

Wow.  This is gonna be nice.

One Last Note:  Two Ways to Write the Formula

PowerPivot v1 did not have a dedicated function for distinct count, you had to do the countrows of distinct thing like what I showed above.

But in PowerPivot v2, there IS a DISTINCTCOUNT() function.  So you can write your distinct count measures in two ways:

[Measure] = COUNTROWS(DISTINCT(Table[Column]))

[Measure] = DISTINCTCOUNT(Table[Column])

That new function is nice, makes for a more readable formula.  BUT…  the performance is still the same.  Either way I wrote it, the measure was sub-second fast.  Very nice.


Guest Post: TOPN in PowerPivot V2

July 20, 2011

Well folks I haven’t had much time to play with the V2 Beta (aka CTP3) yet.  A lot going on over the past week:

  1. Authoring a brand new suite of models and reports for our Duane Reade DR-Direct solution
  2. A nasty crashing bug in PowerPivot for SharePoint that we’ve now learned to catch and repair
  3. I submitted a PowerPivot article to CIMA Insight magazine (for August)
  4. Running refresh tests with PowerPivot servers in Chicago pulling data from SQL in Dallas (over Internet VPN – more on this later) for the HostedPowerPivot offering

So in the meantime, David Hager has a short article to share on the new TOPN function.  I haven’t had a chance to try it yet, and I’m a little surprised that it doesn’t have special handling for ties (as he’s discovered).

David Hager on TOPN

Using the New PowerPivot Version 2 TOPN Function: Up, Down and Between

One of the new DAX functions included in the Denali PowerPivot CTP3 release is the TOPN function. This function returns a table which can be aggregated for the top n values. Prior to the release of this function, workarounds were available to achieve the same results, mainly through DAX ranking formulas (and Denali PowerPivot now has native functions for ranking too!). So, a typical DAX formula which uses the TOPN function is:

=SUMX(TOPN(N,SalesTable,SalesTable[SalesTotal]),SalesTable[SalesTotal])

where SalesTable is the table and SalesTotal is the column in that table containing the top n results to be analyzed.

Well, that was easy enough. However, there is a systemic problem here and it’s mainly data-related. In order to understand why the TOPN function don’t work the way you might expect, read the following that comes from the Denali PowerPivot CTP3 help documentation on TOPN.

“If there is a tie, in order_by values, at the Nth row of the table, then all tied rows are returned. Then, when there are ties at the Nth row the function might return more than n rows.”

In English, that means if there are duplicate values in your data, the TOPN function will return the WRONG answer if the Nth value is adjacent to duplicates. In order to get a meaningful result then, the data must be converted so that each data point has a unique value. The question is, how do you do that and retain the integrity of the data? An Excel trick can be used here as a workaround. This method requires that each value be incremented by a very small but unique value. The best implementation of this comes in the form of a calculated column with the following formula

SalesTable[UniqueIncremValue] = 
IF(RANDBETWEEN(-1,1)=-1,-RAND()*2,RAND())*0.000001

The goal of this formula is to create a column of unique numbers add up basically to zero. There is a balance that is required in this formula between affecting data and insuring that all values in the dataset are unique. The RAND function returns a 15-place decimal, which is the limit in Excel. However, if all of the decimal places were used, the value of the number could be as high as 1 or as low as -2. By multiplying the random numbers by a very small number, the numbers will not interfere with the individual values if they themselves are not small numbers. However, if the multiplier is made too small, not all of the random numbers may be unique depending on the size of the dataset. So, if you are going to use this workaround, you need to test [UniqueIncremValue] to see if all of the values are unique. The best way to do this is to compare its distinct count with the distinct count of a primary key column in the table. If they are equal, TOPN will work for any value of N.

Now, the original formula will work fine if the new calculated column is used in place of SalesTable[SalesTotal].

SalesTable[SalesTotalMod]=
SalesTable[SalesTotal]+ SalesTable[UniqueIncremValue]

It should also be noted that the TOPN function can used in a DAX like the one shown below to afford a BOTTONM result. Note that this formula takes advantage of the new DISTICTCOUNT function.

=SUMX(SalesTable,[SalesTotalMod])
-SUMX(
     TOPN(
         DISTINCTCOUNT(SalesTable[SalesTotalMod])- N,
         SalesTable,
         SalesTable[SalesTotalMod]
         ),
     SalesTable[SalesTotalMod]
    
)

I think you can see that the TOPN function is a valuable new item for your DAX toolbox. Have fun!


Running Totals Without a Traditional Calendar Table

June 30, 2011

PowerPivot provides a host of great functions like DATESYTD, DATESMTD, DATESBETWEEN, etc. that are useful for calculating many things, including a running total.

But with the exception of DATESBETWEEN, I seldom get to use any of those “time intelligence” functions, for the simple reason that our clients almost never operate on a traditional calendar.

I very often find myself working with a “calendar” that looks like this:

Non Traditional Calendar

Can’t Use Time Intelligence Functions With a Calendar Like This

And sometimes I even find myself with “calendars” that don’t have any date columns in them at all:

Another Non Traditional Calendar

This Table DOES Truly Represent the Business Calendar But Has No Dates in It!

So what do you do when you still need a “Year to Date” Total?

Silly humans.  They don’t care that the data is structured one way or another under the hood.  They just keep insisting on seeing useful things, like Year to Date totals.  They don’t want to hear how the blender is constructed, they just want their daiquiri.

So, what’s a report designer to do?  Give up?  No way.  We make daiquiris anyway.

I’m going to use that second calendar above, the one that has no dates in it.  In fact this is the same data set I have been using for the Precedence Project.  (Really, this is Part Three in disguise).  So let’s return to a familiar pivot:

image

Familiar Starting Point

It has Year and MonthNum on rows, and the simple measure Total Sales.  I want to end up with this:

image

Desired Result

Getting to that desired result, in my experience, is something you either stumble upon quickly or flail around forever and never find.  In fact, finding myself in a situation like this is what triggered me to start the Precedence Project in the first place.

Cutting to the chase:  this formula works

To keep this post short and sweet, I’m just going to share a working formula.  I hesitate to call it “the” formula, because there are multiple variations that work, and some fit certain biz requirements better than others.

So here is one that works.  It has some quirks that I will iron out in the next post.

[YTD Sales]=
[Total Sales](
                FILTER(
                  ALL(Periods),
                  Periods[PeriodNum]<=MAX(Periods[PeriodNum])
                ),
                VALUES(Periods[Year])
             )

In short, this  measure clears all filters on the entire Periods table, then adds back two filters – restores the current context for Year (using VALUES), and limits the PeriodNum to be less than or equal to the current context (I will explain later why I used MAX and not just VALUES).   Note that I am using PeriodNum (which keeps increasing across years and is unique to a given Month/Year combo), not MonthNum (which is on the pivot and “resets” to 1 for January of each year).

Next Post:  Explaining the Formula, and Showing How NOT to do it

Explaining that relatively simple formula, if I do a thorough job, will consume its own post.  And I also want to cover some other approaches – ones that seem like they should work but do not.  Because the things that DON’T work are even more educational than the things that do.


Precedence Part 3: ALLEXCEPT vs. ALL w/ VALUES

June 22, 2011

 
Snatch - A Damn Fine Movie

 

“However…  you do have ALL() the characteristics of a dog, Gary.  ALLEXCEPT() loyalty…”

 

 

 

This will be a quick one.  I think.

In part two, we left off with the observation that VALUES() can “trump” ALL() even when VALUES() is applied to a column that is not on the pivot:

[ALL then VALUES of field not on pivot] =
     [Total Sales](ALL(Periods),VALUES(Periods[Period Num]))

image_thumb5

One thing I have often puzzled over is this:  how does a combination of ALL(Table), VALUES(Table1[ColumnA]) compare to using ALLEXCEPT(Table1, Table1[ColumnA])?

In essence, those should do the same things right?  In one case, you set the whole table to ALL(), but then “restore” ColumnA to its pivot context using VALUES().  In the other case, you set every field in the table except ColumnA to ALL(), leaving ColumnA in the context set by the pivot.

They should result in the same thing, right?  Let’s check.

ALLEXCEPT() using columns that are on the pivot

[ALLEXCEPT fields on Pivot]=
[Total Sales](ALLEXCEPT(Periods,Periods[Year],
              Periods[FiscalQuarter]))

There are two fields on the pivot from the Periods table – Year and FiscalQuarter, and both of those are excluded from being “all’d” in the measure.  Results:

image

OK, good news.  Those are indeed the same results as the unfiltered base measure, and that’s what we got from ALL() plus VALUES() as well.  So from this, we can say that ALL() plus VALUES() is the same as ALLEXCEPT().

Not so fast though…

ALLEXCEPT() using columns that are NOT on the pivot

Let’s try ALLEXCEPT() using a field that is NOT on the pivot.  Remember, VALUES() trumped ALL() in this case, too, and the results were the same as the unfiltered base measure.  Will ALLEXCEPT follow suit?

[ALLEXCEPT field NOT on Pivot]=
[Total Sales](ALLEXCEPT(Periods,Periods[Period Num]))

image

OK, that was a bit unexpected.  ALLEXCEPT() returned the same results as pure ALL().  So ALLEXCEPT() does *not* preserve the pivot context of columns NOT on the pivot, whereas VALUES() *does* pick those up.

I’m not sure if that’s intentional on the part of the PowerPivot team, or if it’s just a quirk.

Summary

1) ALLEXCEPT() does behave precisely like ALL(), VALUES() as long as the “exempted” columns are columns on the pivot.

2) ALLEXCEPT() does NOT preserve pivot context, however, on columns that are not on the pivot.

Oh, and one more…

No need to use ALLEXCEPT() instead of ALL() when using VALUES()

In the past, whenever I wanted to do an ALL() with a VALUES(), I typically used an ALLEXCEPT() instead of the ALL().

For instance, I would do something like this:

[Total Sales](ALLEXCEPT(Periods,Periods[ColumnA]),
              VALUES(Periods[ColumnA]))

That turns out to have been an overly-careful habit.  As we’ve seen here, and in part two, VALUES() trumps the ALL(), and does not need the “help” of ALLEXCEPT() instead.

Yes, I’m sure many of you will say that’s obvious from everything covered so far, but just in case you find yourself tempted to use ALLEXCEPT() to exempt the column you are preserving with VALUES(), there’s no need.