Creating Accurate Percentile Measures in DAX – Part II

Intro from Rob:  today, Colin finishes in Percentile Measures topic from last week.

Note that 1) this technique requires the new release of PowerPivot that is currently still in beta (the Denali release)…

…and 2) I would rate this one a solid 3 or 4 on the spicy function scale.  Not one of the beginner techniques, in other words Smile

Guest Post by Colin Banfield

October 01, 2011

In Part I, I discussed my approach to creating percentile measures in DAX. In this part, we will add an extra dimension to our dataset and create a box & whisker PivotChart. The expanded dataset is shown in figure 1, with the source table on the left, and the resulting PowerPivot PivotTable on the right.

clip_image001

Figure 1 – Expanded dataset with Set column

In the expanded dataset, I have added a column called Set. The goal is to create percentile measures that are constrained to each set. Following the 4-step approach outlined in Part I, we start with the rank calculation. Recall that our rank formula in Part I was:

=RANKX(ALL(Data),[Sum of Value],,1)

This formula works when we are working with the entire table. However, with the addition of the Set column, we need to calculate values for the sets A, B, & C in a new filter context (i.e. one that excludes the ID column). We need RANKX to rank the values for each set independently. We employ the CALCULATETABLE function to achieve the desired result:

=RANKX(

                  CALCULATETABLE(
                                                       Data,
                                                      ALL(Data[ID])
                                                    ),
                 [Sum of Value],,1
               )

To show the ranking within each set more easily, I’ve sorted the source table by set and value, both in ascending order. Next, I created a calculated column that uses a lookup formula to get the rank measure from the PivotTable.

clip_image002

Figure 2 – Rank measure

With the inclusion of the Set column in the PivotTable, we must also modify other measures used in Part I.

Rank for 25th percentile ([25PctRank_INC])

From (Part I):

=CALCULATE(
                           (COUNTROWS(Data)-1)*25/100+1,
                           ALL(Data)
                         )

To:

=CALCULATE(
                           (COUNTROWS(Data)-1)*25/100+1,
                           ALL(Data[ID])
                         )

PercentileDown

From (Part I):

=MAXX(
                FILTER(
                             ALL(Data),
                             [Rank] = ROUNDDOWN([25PctRank_INC],0)
                           ),
              [Sum of Value]
            )

To:

=MAXX(
                FILTER(
                               ALL(Data[ID]),
                               [Rank] = ROUNDDOWN([25PctRank_INC],0)
                             ),
               [Sum of Value]
            )

PercentileUp

From (Part I):

=MAXX(
                  TOPN(
                               ROUNDUP([25PctRank_INC],0),
                              ALL(Data),
                             [Sum of Value],
                            1
                         ),
               [Sum of Value]
            )

To:

=MAXX(
                TOPN(
                             ROUNDUP([25PctRank_INC],0),
                             CALCULATETABLE(
                                                                   Data,
                                                                   ALL(Data[ID])
                                                                ),
                            [Sum of Value],
                            1
                          ),
               [Sum of Value]
            )

Or:

=CALCULATE(
                            MAXX(
                                          TOPN(
                                                       ROUNDUP([25PctRank_INC],0),
                                                       Data,
                                                       [Sum of Value],
                                                      1
                                                   ),
                                       [Sum of Value]
                                     ),
                         ALL(Data[ID])
                      )

The preceding alternative formulas are provided to illustrate how the same scalar result can be obtained using either CALCULATE or CALCULATETABLE. Similarly, we can use CALCULATETABLE instead of CALCULATE in the rank for the 25th percentile formula. I point this out because using CALCULATETABLE as an alternative to CALCULATE in applicable situations is rarely, if ever discussed.

25th, 50th, and 75th Percentiles

Recall the interpolation formula from Part I:

=[25thPercentileDown]+([25thPercentileUp]-[25thPercentileDown])*([25PctRank_INC]-ROUNDDOWN([25PctRank_INC],0))

After completing the measures for the 50th and 75th percentiles, our table is as shown in figure 3.

clip_image003

Figure 3 – Percentile measures, with comparisons to values returned by Excel’s PERCENTTILE.INC function

Creating a Box & Whisker PivotChart from Percentile Measures

It is difficult, if not impossible to analyze data directly in a table – whether or not the data is summarized. The purpose of a chart is (or should be) to provide us with insights into the data that we cannot derive by staring at the numbers. One such chart that takes advantage of percentile measures is the box & whisker plot (hereafter called a box plot). This chart is commonly used in areas such as statistical process control (SPC) and Six Sigma, but they can be quite useful for general business analysis. This chart is not native to Excel, so it’s not surprising that its use is not explored more often. The box plot visually shows the distribution of values in a dataset. The most common plots show the minimum value, maximum value, the range between the 25th and 75th percentile (known as the interquartile range), the median value and the mean (average value). The “box” portion is the interquartile range, and the two “whisker” parts are the range between the min and the 25th percentile, and the range between the 75th percentile and the max. The box plot is most effective when comparing the distribution of values across several data sets. A few examples of box plots for general business use include:

· Comparing the distribution of call hold times of agents over a period

· Comparing the distribution of price fluctuations across several products, or product parts

· Comparing the distribution of shipping weights for different service levels (e.g. normal, express, overnight). This data could provide useful intelligence for a shipping company

· Comparing the distribution of actual salaries across different salary scales. An excellent example is provided in this article here. The article explains how to interpret the plot and also illustrates a very important concept for making charts more meaningful i.e. adding additional context to a chart by providing reference information.

Basically, any related entities in your business that could benefit from analyzing a distribution of their values could benefit from a box plot.

The first step in creating the box plot PivotChart is to arrange the data appropriately. The actual construction of the plot is beyond the scope of this post, but the technique is explained in this tutorial by Excel charting guru, Jon Peltier. If you’re new to creating box plots in Excel, you should read the tutorial before continuing. Jon’s excellent charting tutorials have been around for a long time, and recently he has been updating them. In addition to his tutorials, Jon also has a few chart utilities for sale (and no, I don’t earn a commission for this plug).

The box plot PivotChart using our dataset is shown in figure 4

clip_image004

Figure 4 – Box plot PivotChart

Notes

1. I created a perspective called BoxPlot Data to contain only the measures required for the box plot

2. The measures are numbered to indicate the order that they must be added to the PivotTable

3. The box plot measure formulas are as follows:

[1-25th Percentile] = [25thPercentile_INC]

[2-50th Percentile] = [50thPercentile_INC]-[25thPercentile_INC]

[3-75th Percentile] = [75thPercentile_INC]-[50thPercentile_INC]

[4-Minimum] = [25thPercentile_INC]- MINX(ALL(Data[ID]),[Sum of Value])

[5-Maximum] = MAXX(ALL(Data[ID]),[Sum of Value])- [75thPercentile_INC]

[6-Mean] = AVERAGEX(ALL(Data[ID]),[Sum of Value])

4. I’ve created a chart template that can be used to create box plots. When creating a new plot, the dimension and measures should be added to the PivotTable before selecting the chart template from the Insert Chart dialog box.

5. After you create the chart from the template, the whiskers will not appear. This occurs because the whiskers are created using error bars, and external ranges are used to specify the data for the error bars. Unlike the rest of the chart, the template doesn’t know which ranges to use for error bars. You must add these manually, using the technique detailed in Jon’s tutorial.

6. The white median lines you see in figure 4 were created using a bit of formatting trickery. I used a white border for the bars. The outside borders don’t show because they are set against a white background.

Downloads

You can download a copy of the workbook & chart template from the following location:

http://powerpivotfaq.com/PowerPivot%20Samples/Forms/AllItems.aspx

14 Responses to Creating Accurate Percentile Measures in DAX – Part II

  1. Kyle says:

    I am working with products that have many tied ranks and are broken down into about 100 different sets (we call them departments). Since so many products have tied ranks, instead of using the COUNTROWS() function in calculating the 25thPctRank_INC I am thinking it would be wiser for me to simply find the max value from the rank column, within each item’s respective set.

    I receive the following error message when I replace COUNTROWS() with MAX: “Cannot find column ‘Rank’ in table ‘FactSales’.”

    Here is what I currently have as my formula:

    25thPctRank_INC
    =CALCULATE(
    (MAX( FactSales[Rank] ) -1*25/100+1,
    ALL(DimItem[SKU] )
    )

    How can I replace the portion of your formula for 25thPctRank_INC that counts the rows in a table listing all products with a formula that simply returns the max Rank value calculated for each product, while still respecting each product’s set/department?

    Any help you could offer, or simply a link to additional resources to get me going in the right direction would be much appreciated.

    -Kyle

    • Colin Banfield says:

      Kyle,

      Ties doesn’t make a difference to the results. However, there’s an error in one of my calculations, which is discussed here: http://www.powerpivotpro.com/2012/05/percentile-measures-in-dax-errata/

      [Rank] is not a column in the fact table – it is a measure. That’s why you’re getting the error message. In my sample dataset, I do show a rank column, but that’s there for illustrative purposes only – it’s not part of the “fact” table used in PowerPivot.

      Let me know if I can help further.

      • Kyle says:

        Colin,

        Thanks for your help, and for this excellent series of posts on percentile ranking as well.

        What I am trying to create is a measure that calculates each product’s rank (by units sold right now… eventually I’ll want to do the same to rank by gross margin earned, total sales $, and # of sale transactions) within its set, using a moving annual total of units sold. I intend to use this measure to ultimately illustrate the distribution of our investments in inventory over time (% of $ invested in “A” items, “B” items, etc. by week, for example).

        I am, however, getting different results from you in the Rank values because I have included the optional “Dense” enumeration to my RANKX() function. This way, after each tie, the next rank value is [previous rank] + 1. For example, if 5 values are tied with a rank of 8, the next value will receive a rank of 9 (instead of 13). I could be wrong, but I think that this is why the COUNTROWS() function in your 25thPctRank_INC formula is giving me some trouble.

        Broken down, here are my formulas:

        Moving Annual Total Sales Units (“MATSalesUnits”) =
        CALCULATE( SUM (FactSales[UnitsSold] ),
        DATESBETWEEN( DimDate[DateKey],
        NEXTDAY(
        SAMEPERIODLASTYEAR(
        LASTDATE( DimDate[DateKey] ) ) ),
        LASTDATE( DimDate[DateKey] ) ) )

        My MATSalesUnits formula seems to be behaving properly. Now onto my formula for Rank:

        RANK=
        RANKX(
        CALCULATETABLE( DimItem, ALLEXCEPT( DimItem, DimItem[Set] ) ),
        [MATSalesUnits],,1,Dense)

        I think my Rank measure behaves correctly as well. When I show all SKUs and this measure in a PowerPivot table and slice by any value in my date table, I am getting each product’s rank within its respective department using moving annual sales units totals.

        What I want to show next for each item is whether its rank falls in the 25th, 50th, 75th, etc. quartile of the product’s department (set). My logic here is to try to find the rank that corresponds to each breakpoint by finding the highest ranked product for each set, and then if the item in question’s rank is equal to or less than .25 * the max rank within the set, then the product falls in the low percentile bucket (a “D” item).

        This (I believe) is where my process breaks down. How can I find the max rank for each item’s set and use that value in a formula to calculate the set’s 25PctRank_INC? COUNTROWS() is yielding inaccurate breakpoints for me because I am using the optional “Dense” enumeration in my RANKX() function, which I feel is better suited for my project.

        Thanks again for taking the time to post excellent material. This blog has been a tremendous resource to me over the last year.

        Kyle

  2. Colin Banfield says:

    If you need to show a dense rank measure, it should be calculated separately from the rank measure used for the percentile calculation. The dense option cannot be used in the rank measure for calculating percentiles – you’re guaranteed to get incorrect results.

    • Kyle says:

      Thanks, Colin. I see that now.

      Is there any way to bake the context of each product’s “set” into the rank measure itself? The Rank values change once I remove the Set dimension from the powerpivot table. I’d like for the measure to return the same values regardless of whether the “set” dimension is present in the query.

      Sorry for all the questions, but this is a REALLY big help.

  3. Colin Banfield says:

    If I understand you correctly, you want to maintain the set rankings in the PivotTable even though the set dimension is not present to provide context for these rankings?

  4. Kyle says:

    Hi Colin,

    Yes, that’s what I am trying to do.

    Thanks.

    Kyle

  5. Colin Banfield says:

    Without the “set” dimension to add context to the rankings, how would the user know what the rankings relate to, when the rankings are displayed in the PovotTable?

    • Kyle says:

      The user won’t actually see the rank for each item individually. I really just want to show, for example, the value of our store’s inventory at replacement cost over time, with items grouped by their percentile rank. I’m trying to graphically answer questions such as “what percentage of our investments were in “A” items in March? How does that compare to the following month?”.

      I guess you could say that I am trying to generate a fact table containing each product’s percentile rank for each day. If I were to literally generate a fact table and store each product’s rank for 365 dates, my .xlsx file would be enormous (I have about 65,000 products). I am hoping to be able to avoid having to create such a table by using a measure, but perhaps it is not actually possible to do what I am after.

      Thanks.

  6. Colin Banfield says:

    Measures are entirely dependent on data from the source, so they cannot store historical values. It’s a typical function of a data warehouse to store historical inventory balances at a frequency dependent on the grain required for analysis. In the absence of a formal data warehouse, you will need some tool that can copy data from your transactional database into a historical data store. Even so, trying to analyze the ranks for 65K products every day of the year would be quite overwhelming. What action would you take if, for example, a product ranked 60000th today and 60001th tomorrow, and then 59999th the day after? It would be easier if you can analyze your product ranks monthly, and easier still if in addition, you can focus on your top x and botton x products. I understand though that in your scenario, neither of these options might be possible.

    • Nobody says:

      I don’t know, imagine in a online shop, all the laser printers are past the 15kth position in the ranking. Maybe they would wish to know that to make a decision whether to keep offering them or not

  7. brianhadley says:

    Hi, realize this is a fairly old article, hopefully someone can help – I am looking to do similar to the above in terms of providing a rank when an additional dimension level is added (i.e. “Set” above), but I also want to be able to provide the rank of the individual row against the entire population (at a certain grouping level).

  8. Gillis says:

    You saved my life with that rankX(Calculatetable(….))) trick. I have been researching how to do this for 2 days!

    Thanks :)

  9. Leonel says:

    Thank you so much for the rankx(calculatetable(…))) formula. Just what I needed to finish a project. Thank you, thank you!

Leave a Comment or Question