“Normalize” Your Measures for Fun and Profit!

 
image

Can YOU Spot the Green or the Red Line?  (Bonus Q:  Can You Tell
That the Green Line is 50x higher than the Red Line?)

Look Ma!  Two Posts!

After failing to maintain my Tuesday & Thursday posting schedule over the past two weeks (one post total, out of four scheduled), I’m BACK folks.  I mean this week didn’t stand a chance.  BAM!  Nailed it.

OK, enough self-congratulation for meeting the minimum standards I set for myself.  Moving on!

Three Different Scales

The three lines plotted on the chart above are “sourced” from these numbers:

image

These Three Measures Are VERY Different in their Relative “Sizes,”
Varying From 2 Digits to 6 Digits.

Data like that results in crappy charts.  Let’s fix it with some formula magic:

Converting Values to Equivalent Scales for Charting - another Power Pivot Magic Trick!

Ah, Formulas Make Everything Better.  Yep, it’s the same data, just “normalized.”

A Word from the Charting Pit of Derision!

Yeah, I hear that chittering out there – the mandibles of the Demonspawn Chart Fiends are clacking out a sound that resembles “Secondary Axis!”

Well I need THREE axes this time.  Is there a Tertiary Axis feature, oh creatures of the dark?  I mean, I seriously don’t even know.  There ARE only two sides to the chart, so it would make sense I guess to NOT offer a tertiary axis.  You’d have to start “stacking” scales side by side and that would probably make Tufte cry.

But I’m intentionally NOT checking whether there is such a feature.  Because honestly I don’t even like the secondary axis feature that much. 

“Oh did I break your concentration?” sayeth Jules?

Numbers are Numbers.  Visuals are Visuals.  Everyone has a preference.

imageThere’s actually a deleted scene from Pulp Fiction where Mia asks Vincent if he’s a Beatles Man or an Elvis Man.  The principle behind the question is that no one likes both of them, you either like one or the other.  (Vincent, of course, is an Elvis man).

I have a similar theory – that you are either a Numbers Person or a Visuals Person.  When you first get some new data, is your first instinct “I need to crunch this data with some formulas” or is it “I need to get this on a chart?”

Now, of course, a Numbers Person still uses charts.  and a Visuals Person still sometimes needs to write some formulas.  But which one is your first instinct – that determines which one you are.

In fact let’s just poll that out right now.

Anyway, I’m a Numbers Guy.  Charts are very much a Last Step in the Process, if that, in my world.  Heck, give me some conditional formatting in a pivot and I am usually set.

(Tellingly, though, I always NEED conditional formatting before I am happy.  See, even a Numbers Guy can leverage visuals – it’s just that I am more on the numbers side of things.)

So, even in a case with TWO different measures, I am tempted to correct with formulas rather than track down the Secondary Axis feature hiding in its camouflaged lair.

I’m just not that comfortable with charts, really.  I struggle to make them do what I want.  And – I want to work with numbers damnit!  So this is really all probably just personal preference on my part – I just camouflage it by saying things like “charts suck, gimme a formula and the numbers any day.”

Get on with it!

Goodness gracious I’ve been working on this post for 90 minutes and have not shown a single formula.  I mean, polls and doctored photos of Vincent Vega on the throne don’t just make themselves you know!  But geesh, a Numbers Guy should be more…  Numeric.  So it’s formula time.

The Formulas

Basically, we divide each measure by the maximum value of that measure, putting everything on a 0-100% scale.  Here’s one of them:

[Sales Indexed to Max Week] =

[Total Sales Measure] /
MAXX(ALL(Calendar[WeekNumberOfYear]), [Total Sales Measure])

Note that Calendar[WeekNumberOfYear] is what I have on rows of my pivot (which “powers” the horizontal axis of my pivot chart).  If you change the field on rows, you need to change that part of the formula.

The other two measures are exactly the same pattern, just substituting their respective base measures for [Total Sales Measure].

Here’s the pivot:

Converting Values to Equivalent Scales for Charting - another Power Pivot Magic Trick!

All three have been “normalized” to be between 0 and 100% – this yields the useful chart.

Alternative Formulas!  Use the Average!

Maybe 0-100% is too restrictive for you.  Maybe you want to divide by the average instead of the max.

OK, here goes:

[Sales Indexed to Average Week] =

[Total Sales Measure] /
CALCULATE([Total Sales Measure] /
            DISTINCTCOUNT(Calendar[WeekNumberOfYear]),
            ALL(Calendar[WeekNumberOfYear)
          )

Again, if you have something else on Rows/Axis, you need to replace Calendar[WeekNumberOfYear].

This yields the following, slightly different (better?) chart shape:

Converting Values to Equivalent Scales for Charting - another Power Pivot Magic Trick!

The Chart is Still Quite Readable, But the Fact that Active Customers Has Wider Variation
Than the Other Measures is no Longer Hidden by Being “Squashed” Into 0-100%

Of course, you could also just use AVERAGEX instead of those CALCULATE shenanigans in the denominator.  In fact, I *did* do that for Transaction Size:

[Trans Size Indexed to Average Week] =

[Transaction Size] /
AVERAGEX(ALL(Calendar[WeekNumberOfYear]), [Transaction Size])

It doesn’t make a huge difference in most cases.  AVERAGEX doesn’t “care” whether certain weeks had higher sales volumes than others – all weeks will be averaged as equals, whereas the fancy CALCULATE approach above computes the average as a grand total ratio, which is inherently weighted.

Anyway, Numbers and Visuals people.  Til next time.

18 Responses to “Normalize” Your Measures for Fun and Profit!

  1. Jon Peltier says:

    You could divide by initial value, so everything starts at 100% and goes up our down from there.
    Better, divide by initial value and subtract one, so everything us starting at zero and you’re plotting % change.

  2. Dave says:

    First off….excellent topic. A dual axis chart can get you into trouble. And with 3+ measures of different magnitude, you are doomed. Normalized (or “Indexed”) charts are the way to go.

    So….I agree with Jon relating to where the index should start. Traditionally (in my small world), the index begins at the begining of the time period observed. That way, all measures start at “1″ (I prefer this over zero….sorry Jon). The best examples of these are google finance charts (Which, of course, start the index at zero) http://www.google.com/finance?chdnp=1&chfdeh=0&chdet=1380225600000&chddm=97750&cmpto=INDEXDJX:.DJI;INDEXSP:.INX;INDEXNASDAQ:.IXIC&cmptdms=0;0;0&q=INDEXDJX:.DJI,INDEXSP:.INX,INDEXNASDAQ:.IXIC&ntsp=0&ei=R45EUuitHcebqwG0ZQ

  3. LairBob says:

    So, on the one hand, this is a clever approach — I really like the idea of independently normalizing all the data _first_, and then plotting that, rather than letting Excel decide what mins, maxes, etc. to use. (I especially like the idea of being able to explicitly define my own mins and maxes algorithmically. I’ve seen approaches that use VB to control them, but with this approach, you can just define them through formulas.)

    OTOH, this approach does have one huge qualification: It’s only useful when you can afford to ignore the exact values that you’re plotting. For the comparative analyses you’re doing here, it’s perfectly appropriate, but for any application where you’re focused on the precise data points — such as temperatures in an experiment, or stock prices — then you still really need to have your axes (and axis _lines_) to reflect “true” values, rather than these scaled proxy values.

    • powerpivotpro says:

      TOTALLY. And in fact, I was headed that way with the Numbers People vs. Visuals People thing but it started running too long.

      Consider a pivot table with conditional formatting (data bars or color scales in particular) vs. a pivot chart with a “real” axis.

      The PT is “numbers first with a side of visuals.” The PC is “visuals first with a side of numbers.” Each is optimized to show one thing – numbers or visuals – but as a secondary function, can also deliver the other component (just not as well as it delivers the first).

      The punchline of this theory is that I suspect that in cases where the numbers themselves are important, we may find that a PT with conditional formatting is better than a chart with a “real” axis. Not all of the time, but a surprising amount of the time.

    • LairBob says:

      And as a geeky techical aside — while I haven’t tried this specific approach before, I have dealt a fair amount with customizing axis scales, etc., and there’s one useful trick I’ve discovered when it comes to defining mins and maxes.

      For whatever reason, charts tend to “read” better (or, at least, I think they do) when the axes begin and end on logical, modular intervals. (For example, if the min and max values in your chart are “31.6%” and “72.9%”, then having your axis scale start at “30%” and go up to “75%”.)

      There are too many possible combinations for me to anticipate and define the formulas here, but it basically always comes down to an exercise in rounding up and down, using modular arithmetic. For example, using my hypothetical data above…

      MinActual = 31.6%
      MaxActual = 72.9%
      Modulus = 5

      you could define those axis values with the following formulas…

      MinScale =INT(MinActual/Modulus )*Modulus
      MaxScale =(INT(MaxActual/Modulus )+1)*Modulus

      The integer quotient of 31.6 divided by 5 is 6, so when you scale that back up by 5 again, you’ve basically just shaved 31.6 down to the nearest integer that’s evenly divisible by 5.
      To define the max, you do the same thing, but you add 1 to the quotient before you multiply it back up, so instead of rounding 72.9 down to 70, you’re padding it up to 75.

      For any internal intervals — like tick marks, or tracking quantiles — if you want them to just fall on your modulus, then calculating them is pretty easy. You just keep adding the modulus to your MinScale. If you wanted to use a different interval than your modulus (for example, if you always wanted to have 10 intervals, no matter what the max and min), then you would use something like “IntervalCalc = (MaxScale – MinScale)/10″.

      • jonpeltier says:

        LairBob -

        I wrote up a couple more general techniques for calculating axis scale parameters. They calculate the tick spacing too, which is then used like your modulus to calculate the upper and lower limits.

        Calculate Nice Axis Scales in Your Excel Worksheet
        Calculate Nice Axis Scales in Excel VBA

        • powerpivotpro says:

          Jon – after seeing this, the first thing I did was go look at the Axis Options, hoping that Min and Max were refedit controls.

          And… they are not. So unlike the “normalization” stuff, the min and max axis values have to be manually entered, no matter how sophisticated the calculations are.

          So Jon, how hard (if at all) have you pushed MS, over the years, to turn those text boxes into refedits? Seems like a natural advancement.

          Gah, does this mean I’m starting to care about advanced chart scenarios?? Shit :)

          • jonpeltier says:

            Rob -

            It’s taken 15 or 20 years to finally get data point labels linked to cells. We’ve been asking to link axis scale parameters to cells for longer than I’ve been an MVP, so since before 2000. We’ve been asking for some kind of NULL() or BLANK() function so we can get a gap in a line chart.

            They do keep tweaking the dialogs, but they seem to get harder to use efficiently. More and more clicks to do routine things, so maybe they’re trying to get us to forget about real enhancements.

            But it’s possible to use VBA to update chart axes when the worksheet recalcs, as I described in an ancient tutorial, Link Chart Axis Scale Parameters to Values in Cells (VBA).

  4. Drew says:

    Does it make sense to express each series as a z-score and confine every measure to one axis of deviations from each series’ mean +3+2+1+0-1:2-3?

    • powerpivotpro says:

      That sounds fancy. It goes over my head, though :). Maybe Jon is still monitoring this comment thread and will weigh in, too.

      • Drew says:

        Its close to what you were doing…take the series and find its average and standard deviation. for each element, subtract the average and divide by the sd, compressing the series into deviations from the mean. The mean is expressed as zero on the y axis, so each series can be evaluated by how far away a score is from its mean value in SDs, deviations from the mean, not the score’s absolute value. (it hurts my brain to see 250% :-)
        Absolute values could be displayed too, so the z score is the line chart and the data labels show absolute values, but that might be confusing…
        Jon’s classes taught me everything I know about charts, best training money I ever spent.

    • jonpeltier says:

      Rob: Yeah, I’m getting notifications.
      Drew: This is a viable approach for science and engineering, but I don’t think I’d use it for financial data, because most finance people aren’t familiar enough with statistics terminology.
      Instead I’d divide by initial value and plot as a percentage, or divide by initial value, subtract 1, and plot as percentage change.
      And thanks for the plug. When/where did you take my class?

      • I use z-scores when disparate magnitudes need to be displayed on the same axis, e.g., total emergency room visits (big number) displayed by practice size and the percent of a practice’s patients actually going to the ER (tiny number) to find proportionately high ER utilization.
        My instinct is that you’re correct in that
        1) small samples make standard deviations wonky
        2) introducing that level of abstraction (how far above or below average is this data point instead of the difference in dollars) may be a bridge too far for the C-layer.
        I attended the dashboard class you gave with Alex Kerin in Westborough MA on June 2, 2010, a year after I saw Tufte present in New York on July 30, 2009. These two experiences redefined my orientation to data analysis. I learned to love Excel (I am a SQL Server resource) because you showed us so much more than dashboards. You teach ‘meta charting’, reusable techniques that directly apply to your our own work, not isolated examples, indeed, your tutorials are nonpareil. Tufte informed my taste and aesthetic, but you showed me how to do it, and for that I thank you.

  5. Drew says:

    Thanks I think
    And please allow me to return the complement!

  6. Janet says:

    I like it too, and I’d love it if somebody would post a formula for “divide by initial value”.

Leave a Reply