Out of 407 Total Combinations of Subcategory and Region, These 8 Stand Out This Month
(These 8 Combinations Differed GREATLY from their Respective 12-month Averages)
A Post Long in the Thinking: Formulas that “Scan”
(What’s this, you say? An actual “how-to” post from Rob? Well YEAH, it was about time!)
DAX is fantastic at calculating numbers for us, and producing metrics that drop jaws all over the world. It is, quite simply, The Best Thing Ever. You want an all-up metric that properly accounts for seasonality and price changes simultaneously across all divisions of your organization? That thing you’ve always talked about like it was The Holy grail? Well, it’s really no problem.
But I’ve now had a long obsession with making DAX do more than that for us. I want it to not just produce the metrics, but also to then robo-scan said metrics on our behalf and report back on interesting things that it found. I don’t necessarily get too many chance to indulge this desire, and that’s a shame, so today I’m gonna use the old reliable AdventureWorks fake data.
Not Just “Off the Average,” But “Well Outside Normal Variation”
The basic idea here is “alert me if something has changed dramatically.” If there’s a corner of my business that has spiked or crashed in a big way, I want to know. If something has dramatically improved in a particular region, I may want to dive into that and see if it’s something we can replicate elsewhere. And if something has fallen off a cliff, well, I need to know that for obvious reasons too. And both kinds of dramatic change, positive and negative, can easily be obscured by overall aggregate values (so in some sense this is a similar theme to “Sara Problem?”)
So the first inclination is to evaluate distance from average performance. And maybe that would be fine with high-volume situations, but when we’re subdividing our business into hundreds or perhaps thousands of micro-segments, we end up looking at smaller and smaller sample sizes, and “normal” variation can be legitimately more random than we expect.
Consider these two series of numbers:
12, 2, 12, 2, 12, 2, 15
7, 7, 7, 7, 7, 7, 15
Both series have an average of 7 over the first six data points, and both end with 15 as the seventh data point. If we just go by the average, that final value (15) is just as “much” of an outlier in the first series as the second: in both cases, our most recent number is 8 units higher than our average.
But our eyes can tell that 15 is far more “outlandish” in the second series than the first, right? We were just cruising along, hyper-consistent at 7, and suddenly, BAM, huge spike to 15. I can definitely imagine wanting to be notified of the second series’ spike and NOT caring about the first series’ spike.
So we bring Standard Deviation into the picture – not to replace “distance from average,” but to complement it. Series one has a standard deviation of 5 and series 2 has a standard deviation of 0. In some sense our eyes are “seeing” the standard deviation difference between these two series of numbers – series one can’t make up its mind, is super hot and cold, whereas series 2 is like a metronome. Standard deviation is, in other words, a measure of how “choppy” a data set is, or what is considered “normal variation.”
Now, bringing Standard Deviation and “Distance from Average” together… if we compare distance from average to standard deviation, we get a sense if this most recent change is a bigger than normal change!
In series one, 15 differs from the average by 2 units more than its standard deviation, or a 60% spike outside of normal variation: (15 – 7 – 5) / 5 = 60%
In series two, 15 differs from the average by 8 units more than its standard deviation, or a 160% spike outside of normal variation: (15 – 7 – 0) / 5 = 160%
The sensitivity thresholds for each biz are going to be different. Hey, 60% may be plenty to catch your eye (probably not though). But 160% is getting closer, and 300% is definitely attention-getting.
Let’s Write Some (Measure) Formulas!
[Most recent 12 months sales avg] :=
) / 12
I’m choosing to use the “greatest formula in the world” flavor of calendar navigation here rather than the built-in functions like DATEADD because we can generally make them work in ALL biz scenarios, but also because in this case I’m just more comfortable being “hands-on” with the logic rather than trusting the magic functions to do the right thing.
This relies on a MonthID column that in my Calendar table:
Subtract the First Year in Your Data Set (2001 in my case) from CalendarYear, Then Multiply by 12 and add MonthNumberOfYear – This Yields a MonthID that You Can Navigate with Arithmetic
Moving on to more formulas…
[Most recent month sales] :=
[Absolute Delta from Average] :=
ABS([Most recent month sales]-[Most recent 12 months sales avg])
I’m taking the Absolute Value here because I want to treat “up” and “down” spike equivalently, AND because STDEV never returns negative numbers.
[STDEV Sales over Last 12 Months] :=
[Total Sales] )
And finally, the “detector” measure:
[Am I Big Change] :=
IF([Abs Delta From Average] >
(20*[STDEV Sales Over Last 12 Months]),
I’ve highlighted that 20 because that’s nuts – YOU WON’T USE 20 THERE! That 20 means I’m saying “let me know when the latest number is TWO THOUSAND PERCENT the normal standard deviation, whereas earlier in this post, I mentioned that 300% is a lot.
Seriously, you need to tune your formula to a sensitivity that fits your situation. And you do that by changing that 20 to other values. Start with 3, because that theoretically means you’ll catch only the most-outlying 0.3% .
Well remember, I’m working with fake data here, and I had to crank the factor up to 20 before I got “small” results (like the 8 combinations of Subcategory and Region displayed in the pivot at the top of the post). Hopefully, you can use something more reasonable, like 3, but play around with it and find out. (You could also use a disconnected slicer and make it a variable that the user can control of course.)
Furthermore, my fake data was loaded with lots of silly cases where Subcat/Region pairs exhibited ZERO standard deviation, which is just not going to happen in real life. So I filtered out any cases where that happened, too – just to make the results look more realistic.
Because My Fake Data Was Loaded with LOTS of Zero-Deviation Examples, I Added the Highlighted Test
(You Probably/Hopefully Don’t Need That Part)
And Now… the Pivot Itself
First I created a Flattened Pivot…
Then added my Subcategory and Territory fields, plus the measures:
But I don’t want to see Subtotals here, so I turn those off…
And then I filter the “inner” field (Region in this case) to only show me [Am I Big Change] = 1…
And then, because my Calendar table “overruns” my Sales table (it contains dates that “happen” after my last Sales row of data), I add a HasData report filter:
Which itself is a calculated column in my Calendar table:
That prevents me from EVER using a “bad” month (one that contains no data) as my most recent month.
First off, we can add a Month-Year Slicer and go look at outliers in past months as well:
We are NOT Locked into the Strictly Most-Recent Month (July 2004) –
We Can Slice Back and Look at Other Months
(And we see there are more outlier combinations in May than in April)
Furthermore this is NOT “locked in” to Subcategories and Regions specifically. We can try out other combinations on Rows of the pivot, as long as we remember to re-set that [Am I Big Change]=1 filter on the innermost field…
Customer Education Level and Subcategory Work Too – No Formula Changes Required
(Although you might need to tune the sensitivity differently)
Let Us Know If You Use This (or Not)!
We’d be very interested in real-world applications of this technique, because such things simply make us happy when we hear about them. So please circle back here if you have success… but also if you don’t.
Also, any ideas how this can be improved? Flaws in my thinking? Throw those at us too.