Standard Deviation Demystified in Power Pivot

 
Anakin asked questions about Power Pivot's STDEVX.P

“Only a master of evil formulas, Darth!”
”…hmm, which kinda DOES make you the master doesn’t it?  Crap, I give up, just
chop me in half and get it over with, but make sure Luke sees you do it ok?”

Anakin’s Second Line of Questioning

Two weeks ago I posted about a particular student’s knack for making me learn new stuff, specifically the “bubble up ranks” method.  (Ooh, I forgot to circle back and cover the “bubble up exceptions” part – making a note of that now).

Anyway, he also pressed me to teach STDDEVX().  That’s a first.  “No worries,” I said, “it’s just another X function – but only you will know if the numbers are correct, because I don’t know what Standard Deviation truly means.”

Embarrassing isn’t it?  To be a numbers guy and not know what Standard Deviation means?  I was on the varsity calculus team in high school.  I even have a Math major (to go along with my Computer Science and Philosophy majors – jack of all trades, master of none).  How did I slip through the cracks on Standard Deviation?

OK yeah, it measures deviation, variability.  I get that from the name.  And yeah I even “learned” it in Statistics class in college, to the extent I ever showed up at 8 am.  But I didn’t grok it.  Didn’t breathe it in deeply.  Until now. 

It’s all fun and games until the Football Project Returns!


Standard Deviation / STDDEV in Power Pivot

Standard Deviation (in Red) Only “Makes Sense” When Compared to the Average (Green)

Turns out, Standard Deviation is a simple concept!  It has real world tangibility!  No one should be scared of it, at all.

Before I wax philosophical, here are the formulas displayed above:

[Yards per Week]=

  [Yards Measure] / DISTINCTCOUNT(Schedule[WeekNumber])

      In English:  “take the total number of yards this player earned over the season, and divide by the
      number of weeks in the season.”

[StdDev Yards per Week]=

  STDEVX.P(VALUES(Schedule[WeekNumber]), [Yards Measure])

      In English:  “step through each week of the schedule and see how many yards this player gained in
      each week.  That results in a set of numbers.  Then take the standard deviation of that set of numbers.

Simple huh?

LaDainian Tomlinson had a Standard Deviation of 65.3 in 2003 – What Does that MEAN in the Real World?  Can we TOUCH It?

This is the cool part.  YES, we can absolutely touch it!  Two crucial points “unlocked” the mystery for me forever:

  1. Standard Deviation is “measured” in the same units as the average.  In our example, our average is measured in yards per week.  (He averaged 139.4 yards per week).  We can also say that Tomlinson had a standard deviation of 65.3 yards per week!
  2. We should expect “normal” occurrences to fall in a range that is plus or minus 1 standard deviation from the average.  In our example, in a “normal” game from 2003, we should expect Tomlinson to gain somewhere between (139.4 – 65.3 = 74.2) and (139.4 + 65.3 = 204.7) yards!

An Aside on the Value of Education…

dollarGreenAnakin, my student, taught me those two things in about 15 seconds.  Why (and how) was that simple explanation kept from me all these years?

Let’s put that in perspective.  My four-year private school college tuition from 1992-1996 was valued at $72k (I was on scholarship & didn’t pay a dime, no way my family could have afforded that).  Today that same university charges about $180k in tuition for 4 years.

And I estimate that I learned no more than 12 useful lessons over those four years. 

So in today’s dollars, my student taught me a lesson worth $15k.  Let’s not tell him that.  (OK he’s probably reading this).

Let’s Put That In Practice:  Upper and Lower Bounds

Let’s take point #2 from above and “implement” it in two new measures:

[Yards per Week Expected Upper Bound] =

  [Yards per Week] + [StdDev Yards per Week]

[Yards per Week Expected Lower Bound] =

  [Yards per Week] – [StdDev Yards per Week]

STDDEV Applied:  Dare We Say This Now Gives Us Some Primitive Amount of Predictive Power in Power Pivot?

Tomlinson’s High Average is Built on Big Games, But If You Had to Stake your Life on Someone Gaining at Least 85 yards in a week, Priest Holmes is Your Guy.

Hrm, this smells like another technique that’s relevant to charting.  More on this later.

Important:  Average and STD DEV Must Have the Same “Granularity”

One last point.  In this example, I was taking an average per week.  And my std dev measure stepped through individual weeks.  That is important:

  [Yards Measure] / DISTINCTCOUNT(Schedule[WeekNumber])

  STDEVX.P(VALUES(Schedule[WeekNumber]), [Yards Measure])

If I wanted to measure variability of yards per season, or yards per play, I would need to change both of these metrics to that same “per thing” methodology before I could use them together.

In other words, standard deviation is just as “tied” to a particular granularity as an average would be.  It is not a “universal” measurement.

26 Responses to Standard Deviation Demystified in Power Pivot

  1. Niraj Krishna says:

    Very cool article, but there is a common problem made here, I think. If you are going by top running backs based on yards/week in 2003 then where is Clinton Portis? If I remember right, he should be number 1 or 2.

  2. Jeroen says:

    That’s for your clear and simple explanation. I learned a lot about this usefull example.

  3. Averages with a high std dev suggest a lack of control. Whether you’re looking at a pro sports player or class grades or chemical analyses, a high std dev relative to the average should immediately make the question “why?” pop into your head. In your numbers above you might conclude that Holmes is a more consistent player than Tomlinson. A student with erratic grades may be smart, but lacking in focus. A chemical analysis that has a high deviation may need better sampling methodology or more sensitive test equipment to become more reliable. Standard Deviation is such a wonderful tool! Obi-wan definitely used it, probably enhanced by the force, when training Luke. It’s all about control!

  4. Niraj Krishna says:

    You have to be careful here. The divide by distinct count of schedule weeks is not right. Some players do not play every week due to injury, PED suspension, etc. PowerPivot helped me prove every metric wrong at my company because they used this same logic. This is also why you are missing Clinton Portis. He averaged more yards per game than any of these guys that year, but only played 13 games. That kind of insight is pretty valuable. As an operations analyst, I found that managers knew this and would only hire people during the very first week of the month. If they did not, it would bring their monthly performance/employee/day metrics down unfairly. I’ll try and post the correct logic whenever I get to a PC.

    • powerpivotpro says:

      Agreed, I just didn’t want to overcomplicate the post with a COUNTX(…ISBLANK()) sort of denominator. Of course, in some “business” senses, a blank SHOULD count against the player. A player who misses games due to injury or suspension is inherently less consistent, etc.

      Regardless, the overall point is that you always have to be conscious of those denominators and what they mean. We are in agreement there.

      But the reason Portis doesn’t appear in this list is simpler – I just did a TOP 5 by Total Yards, so he wouldn’t show up regardless.

      • Niraj Krishna says:

        Ahhh I see. I thought you were doing this like a fantasy fb analyst! But I suppose you are right, probably better to keep it simple. And no way would I tell you to use a different STDEV… I was clueless how to use it until this article! Now you have shown me the direction to do some incredible segment pricing analysis with PP. I am also thinking of ways to detect fraudulent numbers because of this article…. You have no idea what you have created.

        • Niraj Krishna says:

          And COUNTX? Maybe you need to cover that one. I use DAX queries. With that said, I am sure I have screwed up the syntax severely, but this is done freehanded (A strange hobby of mine)

          Yards Per Game:=
          SUMX([Yards Measure]/[Games Played], ADDCOLUMNS(
          SUMMARIZE(
          ‘Stats Fact Table’,
          ‘Stats Fact Table’[Player]
          ),
          “Games Played”,CALCULATE(
          DISTINCTCOUNT(
          Schedule[WeekNumber]
          ),
          FILTER(
          ‘Stats Fact Table’,[Yards Measure] >= 1
          ),
          )
          ))

        • Brad Mendelson says:

          Hey Niraj I am trying to use this with the same application – detecting fraudulent numbers. I’ve read the article but am struggling still with how to apply it correctly. Have you had success? Could you post an example?

          • Niraj Krishna says:

            Hey Brad, I ended up leaving the company I was going to do this with. But here is how I was going to apply it:

            We picked up waste material at our customers, but often times the rep that picks it up may not be truthful in how much he picked up. So I was going to determine an amount/service average along with a standard deviation for each customer. I was then going to check to see how much deviation occurred when a different rep picked up the same customer. If I found anything out of the ordinary, I would see if this increased deviation occurred for the same rep across several customers.

            It is hard to cover up fraudulent behavior when someone else has to continue the same behavior.

            I hope that makes sense. Maybe if you can describe your situation, we can lay out the roadmap for you to get started.

          • Brad says:

            Thanks Niraj! Hope you ended up in a better position. I work in insurance and am trying to use it to determine outliers in our network of doctors. Essentially what I want to know is the average cost per member per doctor. From there I want to see the standard deviation of the network and see if some of our doctors have a much higher or much lower cost than their peers. Where I am struggling is figuring out what to put in the formula. In the example above it was weeks of the season, I tried putting doctor over average cost per member, but on my pivot table it appeared to give a deviation for each provider. I am sure this is an easy fix, any help is greatly appreciated!

  5. jonpeltier says:

    Two small points.
    1. The standard deviation can be seen as an estimate of the average deviation between the actual values and the average.
    2. In most cases (e.g. Statistical Process Control), statisticians use ±3 SD as the bounds of normal behavior, which means that more than 99% of the expected deviation is included. With a sample size of 16 games, ±3 SD is enormous.

    • powerpivotpro says:

      Excellent, thanks Jon. In that light I think we can see that football game performance by individual players is a terrible example for me to have chosen. No matter how many games of data are used, players are always capable of 0-yard output. In other words, we’re never getting to that 99% threshold, and plus/minus 3 SD will always be massive.

      Need another data set.

      Also – how the heck did you put the plus/minus symbol in your comment? :)

    • jonpeltier says:

      A nice way to plot this would be to plot markers for the average values, and ± error bars for the standard deviation:

      When viewed this way, you can see that statistically the five running backs here are indistinguishable.

  6. lwkochis says:

    Since we are on statistics, is there a way to calculate Pearson Correlation or Simple Linear Regression?

  7. Matt says:

    A useful post to note, thanks. We recall that std dev is (really) important but shy away from using it because we forget how to apply it – to make it tangible… Your example, along with @jonpeltier’s and the other comments, show how readily that can be done. Thanks again.

  8. vic says:

    im dying trying to figure out what is going on with this whole dax thing. i have tried writing this data so many different types of ways. i keep getting a message saying things like data is to large or too small. usually when i have a problem i google them but its not working. im trying to figure out the Loadfactor of certain flights on specific days. but some days dont have load factors on them because the id changed . will this effect my data? and why cant i get the SD? any help will be greatly apreciated

    heres my error

    ERROR – CALCULATION ABORTED: MdxScript(Sandbox) (13, 62) An argument of function ‘SQRT’ has the wrong data type or the result is too large or too small.

    heres the calculation i tried

    CALCULATE(SQRT((SUM([BLF])-(COUNT([BLF])*((SUM([BLF])/COUNT([BLF]))^2/COUNT([BLF])-1)))))

  9. Vic says:

    heres te second try
    =(SQRT((SUM([BLF])-(COUNTROWS(([BLF])*((SUM([BLF])/COUNTROWS(([BLF]))^2/COUNTROWS(([BLF])-1))))))))

    gave me
    ERROR – CALCULATION ABORTED: MdxScript(Sandbox) (13, 81) The value for ‘BLF’ cannot be determined. Either ‘BLF’ doesn’t exist, or there is no current row for a column named ‘BLF’.

    • powerpivotpro says:

      Let’s start here: COUNTROWS takes a table name, and SUM takes a column.

      It’s also SUPER good form, in my experience, to always include the table name when referencing a column.

      So try references like: COUNTROWS(Table) and SUM(Table[Column]) and see if that gives you a different result.

  10. VIC says:

    so the its still not working. do you mind doing me a favor and writing it for me to copy? the table is called “vic-file june” and the colum that i want the data is “BLF”

  11. Matthew says:

    Not Sure if you would do it this way, but replace F:F with any column that will have a value in it…
    =(SQRT((SUM(F:F)-(SUMPRODUCT(–(F:F””))*((SUM(F:F)/SUMPRODUCT(–(F:F””))^2/SUMPRODUCT(–(F:F””))-1))))))

  12. Seth Murray says:

    What if I am trying to use the value I get for standard deviation in the context of a filter. I am trying to use standard deviation to filter out the outliers that show up in my graph. I currently only have a 2 column table. One with the transaction date of a service receipt, the other being the unit cost for that particular transaction. Depending on QTY ordered, occasionally unit cost is more then 3 standard deviations away. I only want the value to show in my table if it is within 1 or 2 standard deviations. Is something like this possible? My current attempt at the formula simply flattens my entire table down to 0 rows.

    =CALCULATE([Unit Cost],FILTER(SERVICE_RECEIPT,[Unit Cost]>[STD Unit Cost]))

  13. Claire says:

    Happy to see that some old post are still visited. Also a bit stuck today trying to calculate a standard dev in a project of calculating monthly High-Low sales based on the past 3 years’ data & by doing so, check whether the current monthly forecast is outside the historic range (Hi or Lo). Very interesting project, but I’ll be happy when I’ll find the solution :-) One side comment overall, i read all over the website (which I follow since a year or so) that PowerPivot is “magical, fast and easy”. I do agree overall, but please, do not underestimate the learning curve and that it often takes a bit more than “15 minutes” (ref to this post and today’s post) to make things perfect and work great. Cheers to all and by the way, I am interested by a PowerPivot group in Belgium (around Brussels) if s.o. reads this… thx

Leave a Comment or Question