Percentile Measures in DAX – Errata

May 22, 2012

Guest Post by Colin Banfield [LinkedIn]

In September of last year, I posted two articles on creating percentile measures in DAX.  See Creating Accurate Percentile Measures in DAX – Part I and Creating Accurate Percentile Measures in DAX – Part II. About three months after I posted Part I, Richard Mintz left a comment indicating that he was having trouble getting correct results when his data sets had a wide range of values and many duplicates. I haven’t been receiving notifications when comments are left, so it’s purely by chance that I saw Richard’s comment recently.

During the process of building the measures, I did do some testing with duplicates, but the testing was minimal and involved only duplicates at the 25th or 50th percentile mark.

To check out the reported issue, I built a new data set with many duplicate values. Figure 1 shows the results of the percentile calculations in this scenario:

Percentile Measures Update Figure 1

Figure 1

Huh? What’s going on here? The 50th and 75th percentile calculations are correct, but the 25th percentile calculation is totally and utterly incorrect! I created several intermediate measures along the way to the final result, so it was trivial to track down and correct the problem.

Recall for a moment how we calculate the percentile:

1) First we rank the data (Sum of Value) in ascending order.

2) We then determine the rank for given percentile, using the formula n=P(N-1)/100+1 (for percentiles inclusive), where n is the rank for a given percentile P, and N is the number of rows in the dataset. Let’s call this measure generically, PctRank.

3) The calculation in step 2 could result in a rank that’s a decimal value. Therefore, we find the data values corresponding the integer ranks below and above the calculated number from step 2. Let’s call these measures generically, PercentileDown and PercentileUp

4) We then interpolate to get the percentile numbers shown in Figure 1.

Figure 2 shows all of the intermediate formulas created for the 25th percentile, plus the Rank measure.

Percentile Measures Update Figure 2

Figure 2

From Figure 2 you can see:

1) The 25th Percent rank is 11.25.

2) The 25thPercentileDown measure is empty!

The formula I used for PercentileDown was

=MAXX(

      FILTER(

          ALL(Data),

          [Rank] = ROUNDDOWN( [25PctRank_INC], 0)

        ),

     [Sum of Value]

   )

The formula is filtering the data for [Rank] = 11. However, in Figure 2, there is no rank corresponding to 11, so the calculation returns blank. In Figure 1, we got correct calculations for the 50th and 75th percentile only because there were ranks corresponding to the 50thPercentileDown and 75thPercentileDown calculations. In other words, the correct calculations were a coincidence based on the specific data set – not the type of thing to inspire confidence in the technique.

What about the PercentileUp measures? They were correct in all cases because I used a different formula:

=MAXX(

      TOPN(

          ROUNDUP([25PctRank_INC],0),

          ALL(Data),

         [Sum of Value],1

       ),

     [Sum of Value]

   )

The situation is quite ironic. In my original post (Part I), I made the following statement:

“For 25thPercentileUp, you may be inclined to use a similar formula, substituting the ROUNDDOWN function for ROUNDUP [Ed. Here I’m referring to the formula I used for PercentileDown]. However, in the event of ties for the 25th percentile, the filter will be incorrect. This is so because [Rank] will calculate the same rank for ties (11 in this case), and ROUNDUP([25PctRank_INC],0)will be = 12. We can get around the problem by filtering the table using the TOPN function instead…”

And this was the extent of my pathetic testing with duplicates (ties). I completely overlooked the obvious situation shown in Figure 2, where you can have significant skips in the rank numbering, and depending on the data set, the rank number that the formula is looking for may not be present. This oversight is phenomenally glaring. 

Anyway, the solution to the PercentileDown is to simply use a formula similar to the one we for PercentileUp. So the 25thPercentileDown formula now becomes:

=MAXX(

      TOPN(

         ROUNDDOWN([25PctRank_INC],0),

          ALL(Data),

          [Sum of Value],1

        ),

     [Sum of Value]

   )

And similar corrections must be made for the 50thPercentileDowm and 75thPercentileDown measures.

TOPN returns the top 11 values (the result of ROUNDDOWN(11.25,0), with ties. In the data set shown in Figures 1 & 2, the top 11 values include all of the data from the 1st to the 7th rank (inclusive). MAXX returns the largest value in the TOPN set, which in this case is 960.46.  The corrected calculations are shown in Figure 3:

Percentile Measures Update Figure 3

Figure 3

Thanks again to Richard for pointing out the problem.

Note: Another commenter, Trevor Carnahan, suggested using ALLSELECTED() instead of ALL(), which I’ve used in my formulas. ALLSELECTED() is particularly useful if you have slicers, and want to take the slicers’ filter contexts into consideration, while removing the filter contexts from the PivotTable rows and columns.


Commission Calculations in PowerPivot Part 2

May 15, 2012

Guest post by David Churchward [Twitter]

Team and Manager Commissions Report

You may recall in my last post, COMMISSION CALCULATIONS IN POWERPIVOT, we got to the point where we could dynamically calculate the sales value and attributable commission rate that should be applied based on time, value and team parameters, reading from a Rates table.

In this post, we’ll complete the commission calculation, providing a different value for individuals and the team manager.

Where are we?

Just to recap, we got to the point in my last post where we had calculated [Sales_Value] and [Comm_Rate] as below:

Read the rest of this entry »


Commission Calculations in PowerPivot

May 9, 2012

Guest post by David Churchward [Twitter]

Commissions Report

Firstly, I have to be clear that I’m not presenting a “one-size-fits-all” approach to sales commission calculations here.  That wouldn’t be possible because commission schemes vary extensively and, in my experience, some schemes aren’t even based on a logic that can derive a mathematical answer!  However, I’m presenting an approach here that will hopefully provide a template approach that can be modified to a number of different scenarios.

The Target Outcome

The scheme that I’m using in this example operates as follows:

  1. Commission is paid monthly based on the achievement in that month
  2. As a salesperson sells more, then accelerators trigger.  That is to say, for example, a salesperson may receive 1% of sales up to £10,000 and 2% of sales between £10,000 and £20,000.  The 2% is payable on the whole value.  Therefore, if a salesperson sells £11,000 then they would receive 2% of the full £11,000
  3. Percentages and bands can change monthly at management discretion
  4. This is a monthly threshold so each salesperson is reset to zero at the start of each month.
  5. Each product group carries a separate set of bands and rates
  6. Managers receive commission at a different rate to the sales team based on the total sales for their team.

Read the rest of this entry »


Mini-Post 1 of 2: A “Better” Version of TODAY()

May 8, 2012

 
image

Two mini-posts today:  this one and then an updated ROI Quadrant.

This one falls under the “quick tip” category (and the “MacGyver” category were I to have one).

How “Up to Date” Is Your Data Source?

Sometimes you’ll find yourself wanting to know the absolute latest date in your data, so that you can calculate things like “how much of the current month has already elapsed” for instance.

And it’s tempting to use the TODAY() function to do that, but oftentimes your data “lags” behind today – maybe by a day, maybe more.  So TODAY() becomes pretty worthless for that.

So, what do you do?  I’ve recently started using a simple trick that I like better than my past approaches, so I thought I’d share it.

Read the rest of this entry »


Weighted Averages: Another Use of SUMX()

May 1, 2012

Another Question from the Mr. Excel Forums

Got a question on the forums the other day.  It took some extra twists and turns but the simplest version of the question is worth covering here:  how do I perform a weighted average?

Let’s say you have some data about ZIP (postal) codes in the United States:

image

And you build a simple pivot that shows total population and median age for each ZIP, grouped by that last column, which is how fast the population of that ZIP is growing:

image

Simple SUM Measure for Population, Simple AVERAGE for Median Age

But that “Average Median Age” measure is just:

[AVERAGE MEDIAN AGE]=
   AVERAGE(Zips[Med Age])

which treats all ZIP codes as equal, even if they have wildly different populations:

Read the rest of this entry »


Aged Debtors–Dynamic Banding in DAX

April 17, 2012

Guest post by David Churchward [Twitter]

Aged Debtors Report

Having hit the P&L and Cash Flow in previous posts, it seems only reasonable to move on to Balance Sheet aspects.  The die-hard “non-accountant” Excel Pros and programmers amongst you are probably experiencing a sudden bout of Narcolepsy, but let me assure you that this gives us the perfect opportunity to explore dynamic banding in DAX so please prop those matchsticks in place for the minute!

In this post, I’ll construct a dynamic Aged Debtors report.  This will calculate debt ageing values, showing outstanding debts by customer at user defined points in time.

The Accounting Terminology Bit

Let’s get this done quickly.  Debtors are those customers (normally) who owe the company money.  This is often legitimate as it is normal to offer a customer a period of time to process your invoice and pay.  However, it’s not unusual for this period to extend for a number of reasons.  I won’t go into this now because I could waffle on too long with experiences in this area and I’ve never even worked in Credit Control!

In short, an aged debtor report details all outstanding debts, by customer, categorised into timeframe buckets to show how old the debt is.

Historically, it wasn’t unusual for an aged debtor report to be something that could only be run as at “now”.  This means that accountants everywhere had a tiny window of opportunity to capture this information and preserve it in their archives with a wealth of information lost in hidden folders everywhere.  PowerPivot gives us the opportunity to derive a report that holds all of this information with historical ageing available based on user selection and the ability to profile customers to see who tends to take their time to pay and maybe highlight some reasons why.

The Dataset

The transaction process probably looks something like this:

  1. Invoice is created at a point in time
  2. Payment is received from the customer and a payment transaction is created
  3. Payment is “allocated” to the invoice

Allocations Fact Table

Aged Debtors Allocations Fact Table

My core table here is an Allocations table.  This table details all relevant transactions including invoices (and credit notes), payments and allocations.  These transactions might look like this:

Aged Debtors Invoice

An invoice is raised for £10k on 15th March 2012 with an invoice reference (or invoice number if you like) of 1100.

Aged Debtors Payment

Payment is received one month later with a reference of PAY1

Aged Debtors Allocation

The invoice and payment are allocated against each other meaning that the system has linked the payment PAY1 with the invoice 1100 and the outstanding debt is therefore zero and the payment is fully cleared down.

Customers Table

My customers table simply holds customer code and name, but it might hold other information about the customer such as customer contact and credit limit information.  Unfortunately, I can’t display this table here as it holds real customer information and a scrambled name that I ultimately show on this report…sorry! (it sort of defeats the object if I give that away in a screenshot!!)

Bands Table

Outstanding debts need to be categorised into distinct bands.  This is normally representative of months, but it can be anything.  I’ve created a linked table in Excel which provides a Name together with From and To parameters.  For example, 0-30 represents debts that are 0 (From) to 30 (To) days old.

Aged Debtors Bands Table

Dates Tables

I use 2 dates tables.  One is to link the allocation date to and the other is linked to the invoice date.  I’m using V1 for this so I need both tables, but in V2 you can hold one dates table and then use the V2 function USERELATIONSHIP to determine which field to link through to the Dates table on.

These tables are simply a list of sequential dates covering the timeframe of my analysis.  The invoice dates table holds an additional column called Month_End_Date simply to use as a logical slicer to put on my report.

Table Relationships

My relationships look like this:

Aged Debtors Relationships

You’ll notice that the Alloc_Fact table links through to Customers and both of the Dates tables but there is NO relationship to Bands.  Bands is a stand alone table that is used for report headings and parameters without any defined relationship.

Onto the DAX

My report is going to hold a slicer for Month_End_Date.  This is an “effective” run date.  That is to say that the user is going to select a time point up to which transactions and allocations will be included, but any transactions after those dates should be ignored.

Therefore, I create a measure called Debtor_Value which aggregates underlying transactions and allocations up to that date.

Debtor_Value

=CALCULATE(

SUM(Alloc_Fact[Value]),

FILTER(ALL(Alloc_Dates),

COUNTROWS

(

FILTER(Alloc_Dates,

EARLIER(Alloc_Dates[Month_End_Date])<=MAX(Alloc_Dates[Month_End_Date]))

)>0

  )

)

This measure uses a FILTER() on the Alloc_Dates table using COUNTROWS() to specify those dates that fit a criteria of being prior to the date selected on the slicer.

Why use FILTER() and COUNTROWS()?

It’s a reasonable question to ask why I can’t use a simple Month_End_Date filter as opposed to using FILTER() and COUNTROWS().  I have to hold my hands up and put a call out to Rob and The Italians for a technical reason, but in my mind, I know that I’m using the same field (Month_End_Date) in the evaluation, once from the slicer and once from my row set.  To define between the two, I need to use EARLIER() to reference my row set and MAX to call the value from my slicer.  To do so, I have to use a filter context that creates a table expression to identify all of the dates that ARE to be used in my measure.

Incidentally, if you want more on EARLIER(), Rob’s Rat Analytics sums it up.

Thanks to Alberto Ferrari here as this filter expression is something I picked up from him on the Mr Excel forum.

Aged Debtors UnAged

Irrespective of the technical explanation, I know that these values are correct, but the allocations to specific ageing buckets is obviously NOT.

Ageing Debtors

In order to allocate invoices to the correct ageing buckets, we need to calculate how old the invoice is and then calculate which bucket this relates to.  The age of the invoice has to reference the selected Month_End_Date.

Aged_Debtor

=IF(COUNTROWS(Bands)=1,

CALCULATE([Debtor_Value],

FILTER(Inv_Dates,

MAXX(Inv_Dates,

MAX(Alloc_Dates[Month_End_Date])-EARLIER(Inv_Dates[Date])

)>=MAX(Bands[From])

&&MAXX(Inv_Dates,

MAX(Alloc_Dates[Month_End_Date])-EARLIER(Inv_Dates[Date])

)<=MAX(Bands[To])

)

),

[Debtor_Value]

    )

The COUNTROWS() evaluation determines that we have one band to work with.  This eliminates any confusion over the aggregation level that we’re working at and ensures that we will get one result for the FROM and TO elements of the band.

CALCULATE() then takes over to use our previous measure [Debtor_Value] within a filter context that determines the invoices (or rather invoice dates) to use.

The FILTER() element of this expression is where things get tricky.  If I’m honest, in the first iteration of this measure, I tried to filter invoices.  My rationale was that each invoice carried a specific date.  From this date, I could determine how old the invoice was at any given point in time.  And, you’ll be pleased to know that it worked.  But there was a problem!  Because I’m carrying over 300,000 invoices in my dataset (which isn’t actually that many), any SUMX or MAXX is going to iterate over 300,000 records and therefore may not be very quick.  In addition, as my invoice count increases, so my performance is going to continue to degrade.

In an enlightened moment, I realised that I only had roughly 1,500 dates and the number of dates increases at a much slower rate than the number of invoices.  Could I use the same approach purely on dates and ignore the invoice number?  Ehhh………YES!

Essentially, MAXX is iterating over the table Inv_Dates and working out if the difference, in days, between the selected Month_End_Date on the slicer and Inv_Dates[Date] is greater than or equal to the Bands[From] AND also less than or equal to the Bands[To].  Where this is the case, this measure uses the value from [Debtor_Value] subject to the FILTER().  Where this is NOT the case, the measure uses the value from [Debtor_Value] but DOESN’T apply the FILTER() expression.

And, there you have it!

Aged Debtors Report

What’s Next

With these measures in place, wouldn’t it be great to mash this up with an invoicing profile and maybe even a dissection by product or industry sector to give a true profile on a customer?  I’ll be back with a load of mashups on this as soon as I can.

Incidentally, as I alluded to earlier, this analysis is a real dataset so I unfortunately can’t release the Excel.  Sorry.


Movie Grosses and Inflation: Methodology

April 5, 2012

 
In the voting from the last post, the majority of respondents found the third ranking to be most realistic:

image

I agree – this is the one that I personally found most consistent with reality.  “The Exorcist” surprises me – I did not expect to see it in the top ten – but it was actually higher in the other two rankings.

Mashup of Movie Grosses with Average Historical Ticket Prices

Data Set #1:  Raw Movie Grosses

image

Raw Movie Grosses
(Source:  The-Numbers.com – Click Image for Source Data)

Data Set #2:  Average Historical Ticket Prices

image

Average Historical Ticket Prices
(Source:  BoxOfficeMojo – Click Image for Source Data)

Relate both of these to the Years table:

image

Related the Two Mashup Sets to My Common “Years” Table
(Yes, I Have Many Data Sets Now)

The Measures

Yes, it’s overkill in this case to use measures.  I could just use a calc column.  In fact I probably should.  But I have a thing for measures, they just feel more “portable” over the long haul.

[US Movie Gross] =
   MAX(MovieGrossesUS[Total Box Office])

[Avg Movie Ticket Price] =
   AVERAGE(MovieTicketPrices[Avg  Price])

[Tickets Sold Approx] =
   [US Movie Gross]/[Avg Movie Ticket Price]

Flattened Pivot

One of my favorite techniques, the flattened pivot!

image

Yields…

image

Turn Off Subtotals

To get a truly flat pivot, you then have to turn off subtotals:

image

And that yields:

image

Download the Workbook!

Go ahead and grab the whole thing here:

http://www.powerpivotpro.com/wp-content/uploads/2012/04/ShareableMovies.xlsx 

(This only contains movies and ticket prices though – none of my other goodies like Inflation etc.)

What were the other two rankings?

They were both based on differing measures of inflation.  If I have time tomorrow, I may do a bonus post.


The Correct Usage of EARLIER()

March 8, 2012

 
image

Egg on My Face For Improper Use of EARLIER()

Tonight I was looking at one of my recent posts, the one about “fuzzy” time relationships in calculated columns, and I discovered that I had used the EARLIER() function in places that I did not need to.

OK, well, it’s not a LOT of egg.  I mean, my formulas were all returning the correct answers.  They were just needlessly complex.

Here was the calculated column formula in question:

=CALCULATE(MAX(Events[EventType]),
   FILTER(Events,
          Events[RatID]=EARLIER(Sniff[RatID]) &&
          Events[SessionID]=EARLIER(Sniff[SessionID]) &&
          Events[PostTimeID]>=EARLIER(Sniff[TimeID]) &&
          Events[PreTimeID]<=EARLIER(Sniff[TimeID])
         )
)

Every one of those highlighted EARLIER() functions is 100% unneeded.

If I remove all of the EARLIER()’s, the formula returns the same result.

=CALCULATE(MAX(Events[EventType]),
   FILTER(Events,
          Events[RatID]=Sniff[RatID] &&
          Events[SessionID]=Sniff[SessionID] &&
          Events[PostTimeID]>=Sniff[TimeID] &&
          Events[PreTimeID]<=Sniff[TimeID]
         )
)

And here’s the proof that the EARLIER()’s were NOT needed:

image

Old Formula, Fixed Formula,
and Comparison Column That Proves They Are Equivalent

Why Weren’t They Needed?

Let’s look at the FILTER() in my formula, and focus on just one of the comparisons:

FILTER(Events,
   Events[RatID]=Sniff[RatID]
   …

I am filtering the Events table to rows where the RatID is the same RatID in the Sniff table.

And this calculated column is in the Sniff table.  And there is not relationship between the Events and Sniff tables.

That FILTER() is quite straightforward then – it looks at the RatID in the current row of the Sniff table (since this is a calc column in the Sniff table) and then goes and finds rows in the Events table that have the same ID.

EARLIER() is only needed when you are “jumping back out” of something.  And there isn’t anything to jump back out of here.  Let’s revisit an example of where EARLIER() IS needed:

Flashback:   Simple Use of the EARLIER Function

This is an excerpt from a post last month, “reprinted” here for convenience.

Say I have the following VERY simple table like this:

image

And I want to add a third column that is the total for each customer:

image

The calc column formula for that third column is this:

=CALCULATE(SUM([Amt]),
           FILTER(ALL(‘Table’),
             ‘Table’[Customer]=EARLIER(‘Table’[Customer])

                 )
          )

Note the highlighted part:  in that formula we are filtering on ALL(Table) rather than just the “raw” Table.  Here’s the crux:

When I say FILTER(ALL(Table)), all of my references to columns in Table will have “forgotten” all notion of “current row” and will instead be references to the entire column.  That is because of the ALL().

So the EARLIER() function is my escape hatch that allows me to go back and inspect the current row’s value.

This line of the formula:

  ‘Table’[Customer]=EARLIER(‘Table’[Customer])

Can be understood as:

  ‘Table’[Customer]=CurrentRow(‘Table’[Customer])

EARLIER is Probably Best Understood as CURRENTROW

In fact that’s a better name for EARLIER 99% of the time.  Just think of it as a CURRENTROW function, useful only in calculated columns, and only when you are performing FILTERS on ALL(Table), when you need to “jump back out” of the ALL and fetch a value from the current row.

Yes, it IS useful in other cases.  But I suspect that those other cases are rare enough that rather than a general purpose function like EARLIER, we should have been given a dedicated, easy to understand version named CURRENTROW.  Or maybe we should have both.  Yeah, both.  That would be good.

Why My Formula Didn’t Need EARLIER

Quite simply, I had no ALL() to “undo.” 

FILTER(Events,
   Events
[RatID]=EARLIER(Sniff[RatID])
   …

I was just filtering on the Events table.  Not ALL(Events). 

And even if I had been filtering on ALL(Events), that wouldn’t have impacted the Sniff table, WHICH IS WHERE THIS CALC COLUMN LIVES.  So even ALL(Events) would NOT have required me to use EARLIER(Sniff[RatID]) to get the current row from Sniff – a simple Sniff[RatID] is sufficient.

Whew.  Glad I caught this before anyone noticed. 

But I suspect some people did, and were just polite Smile


Subtotals and Grand Totals That Add Up “Correctly”

March 1, 2012

 
Busy week here at the MVP Summit in Redmond.  As usual, I can’t tell you anything I have learned this week – it’s strictly under NDA.  But I have seen some really exciting things – at one point today I screamed out loud “YES!” in a packed room, and a little while later I said “I don’t think we can be stopped if we had something like this.”  Where the “we” means us – me and you.

Enough of that.  Here’s one from my archived list of “topics to cover on the blog at some point.”

Sales per Day Measure

Check out this relatively simple pivot:

image

Note that the subtotals for [Sales per Day] do not equal the sum of their parts.  2002’s total is $17,891 but if I add Mountain Bikes plus Road Bikes myself, I get a number closer to $19,000:

PowerPivot totals don't add up how do I fix it?

Totals don’t add up!

The formula for [Sales per Day] is:

[Sales per Day] =
[Sales] / COUNTROWS(DISTINCT(Sales[OrderDate]))

Which, in English, equates to “My Sales measure divided by the number of days for which I have sales transactions.”

Why Doesn’t it Add Up?

Well, if the “number of days for which I have sales transactions” is different for Mountain Bikes vs. Road Bikes, that will cause this problem.  I can illustrate by adding that as a separate measure:

[Days I have Transactions] =
COUNTROWS(DISTINCT(Sales[OrderDate]))

image

The Reason Why Sales per Day Doesn’t “Add Up”

So I only sold Mountain Bikes on 96 days in 2001, and Road Bikes on 180 days, but on 181 days, I sold at least one bike of any sort.  This is why it doesn’t add up.

“Shouldn’t you divide by Calendar[Date] instead of Sales[OrderDate]?”

Some of you will no doubt have this question already:  Isn’t it more accurate for [Sales per Day] to be based on the number of days in the calendar rather than the number of days on which it sold?”

And my answer is “probably, but it depends, AND this is a good example of a problem you WILL hit sooner or later in completely legitimate cases.”

For instance, if I opened my store in July 2001 and didn’t offer Mountain Bikes at all until September, I don’t want to divide either of my Sales amounts by 365, and I certainly don’t want to divide Mountain Bike sales by 180.

Even then though, I should probably have a separate table like Inventory or maybe [Start Date] and [End Date] columns in my Products table, and use those to create a measure named [Days Offered] measure, and use THAT as my denominator instead.  That would be the fairest/most accurate approach. 

But it would STILL have this same “doesn’t add up” problem.  So let’s move on to a fix.

Forcing the Totals to Add Up

Let’s write a new measure:

[Sales per Day FIXED] =
IF(COUNTROWS(VALUES(Category[Name]))=1,
   [Sales per Day],
   SUMX(VALUES(Category[Name]), [Sales per Day])
)

In English, this says

“If I am in a pivot cell that corresponds to a single product category like Mountain Bikes, then just use the normal [Sales per Day] measure.  But if the pivot cell corresponds to more than 1 category, that means I am on a total, and then I want to have the total be the sum of all of the individual categories beneath it.”

For details on the whole “IF COUNTROWS” thing, see this post.

For details on SUMX, see this post.

Does it work?  You bet:

PowerPivot totals fixed to add up

The Second Measure Works

Pretty slick.  There are shortcomings to this of course.  If I put something other than Category on the pivot, the measure won’t work right.  It is “tied” to the Category field.  So it isn’t quite as portable as most measures, but it still is amazingly useful when you need it.


Correlating “Fuzzy Time” Events in One Table with Precise Measurements in Another Table

February 26, 2012

 
Precisely-Timed Measurements vs. Imprecisely-Timed Events in PowerPivot - How Do We Relate Across Ranges of Time?

Precisely-Timed Measurements vs. Imprecisely-Timed Events: 
How Do We Correlate/Relate Them?

The Streak is Broken…  and a new streak begins.

Well it was a full six-month run of posting every Tuesday and Thursday, Cal Ripken-like consistency.  Then last week it ended.  I was just drained.  Oh well, time to start a new streak!  Back to Tuesday and Thursday.  So let’s dive in…

Calculated columns.  I am continually realizing how much there is to know about them – stuff I haven’t really been forced to learn since I rely so heavily on Pivotstream’s database team.  My recent work with scientific and medical data sets, starting with the really simple fake data set and then moving into the rat sniffing data, has really driven this home.

So let’s do a quick treatment of fetching data from one table into another, using calculated columns, and in different situations.

When you have a relationship:  =RELATED()

OK, I’ve covered this one before.  When you have two tables and a relationship between them, you can fetch data from the “lookup” or “reference” table into your data table using the RELATED() function.

But there are really only a few places where you SHOULD do this.  You can already include the “lookup” column in your pivots without fetching it into the data table, so there’s little benefit of fetching it.  And doing so will just make your file bigger AND your pivots slower.

So really, I just do this when I’m debugging or exploring my data set.  When I’m done, I delete the column.

When you don’t have a relationship…

OK, back to the rat data Smile.  This will parallel many other real-world examples, such as marketing campaigns etc.  So bear with me even if you don’t find scientific measurements to be interesting.

I have one table called “Sniffs” which records how often (and how deeply) the rats were inhaling:

image

Sniffs Table:  Coded by RatID, SessionID, and TimeID
(Where Each TimeID Represents 1/100 of a Second)

And then I have an Events table.  “Events,” in this case, are things that the rats are doing.  Like…  sniffing each other.  In various places.

image

Events Data – Entered by Humans Watching Time-Coded Video of the Rats
(Note that Time is Recorded in Seconds, not Hundredths of Seconds As in the Sniffs Table)

Event type 4, for instance, is “Rearing” – the rat raising up on its hind legs and sniffing its surroundings.  This event is recorded by someone in the lab watching time-coded video.

THE GOAL:  We want to see how breathing frequency and intensity (data from the Sniffs table) changes during such events – does a rat sniff more or less when Rearing than normal?  (And in Retail, the parallel would be something like “do we get decent lift from our ad campaigns” or “how do spending habits change around a particular event like the Super Bowl?”)

Well, you can’t create a relationship between these two tables, for a few reasons.

Problem 1:  Events are Coded at the Second Level, Sniffs at 0.01 Second

In a single second of elapsed time, there will be 100 rows of data in the Sniffs table, and at most 1 row in the Events table.  Right there, we have a problem.

Then again, each 0.01 second DOES map to a “full” second, in much the same way as a day maps to a year.  The real problem here is…

Problem 2:  Both Tables Are Non-Unique in Time Column

Even if we lined up each 0.01 second from Sniffs with a full second from Events, well, I can have multiple rows in Events that all have the same value for TimeInS, because each row is only unique when you consider RatID and SessionID.

Hmmm…  so how about I just concatenates RatID, SessionID, and TimeInS to form a unique composite column in Events, and then do something similar in the Sniffs table, and THEN I create my relationship?

There are two problems with that – one minor, and one fatal.

The minor problem:  this will make my file quite a bit larger, by adding a column to my largest table (Sniffs).  And it’s the worst kind of column to boot:  a calculated column with a lot of unique values.  Plus, it will form the basis for a relationship, which means my pivot performance will potentially be slow.

The fatal problem?

Problem 3:  The TimeInS Value in the Events Data is Imprecise!

First of all, the video of the rats’ behavior will never be precisely aligned with the scientific equipment that is measuring the rats’ breathing patterns.  There is some “wiggle” in that synchronization, maybe as much as half a second.

Secondly, this is compounded by human error and variability – the act of a rat rearing up to sniff its environment might cover a full two seconds elapsed, but the human being coding the video must pick a single second to flag the behavior.

Taken together, we can see that there’s really a fuzzy “range” defining when the event occurs.

The Solution Part One:  Adding “Time Window” Columns to Events

Let’s say I define the “Event” as lasting one second – 0.5 second before and after the time recorded in the Events table.

Then I add calc columns to my Events table that reflect that:

image

Where the *100 is to convert to the “centiseconds” granularity that I use in my other table (the Sniff table).  PostTimeID is the same, except +0.5 rather than –0.5

The Solution Part Two:  CALCULATE?  In a Calc Column?

Then I can add a calc column to my Sniffs table that “fetches” a corresponding event ID from the Events table whenever there was an event “in progress” for the current sniff row:

=CALCULATE(MAX(Events[EventType]),
   FILTER(Events,
          Events[RatID]=EARLIER(Sniff[RatID]) &&
          Events[SessionID]=EARLIER(Sniff[SessionID]) &&
          Events[PostTimeID]>=EARLIER(Sniff[TimeID]) &&
          Events[PreTimeID]<=EARLIER(Sniff[TimeID])
         )
)

***UPDATE:  I realized that EARLIER() was NOT needed in this formula.  See this post for an explanation.

Notes on this formula:

  1. I had to choose some aggregation function, and I chose MAX, but I could have chosen MIN just as easily.  SUM and AVERAGE would NOT have worked out for me though.  More on this later.
  2. Everything after that is a FILTER expression – with four clauses all “anded” together with the && operator
  3. The first two filter clauses basically just say “only grab rows from Events where the Session and Rat match the Session and Rat on the current row in Sniff”
  4. The function documentation for EARLIER() has a very abstract definition, but in this case you can think of EARLIER() as just meaning “grab the value from the current row in Sniff.”
  5. The last two filter clauses make sure we only grab rows from Events whose time window includes the current row of Sniff.  In other words, if an Event “contains” the time of the current Sniff, it counts
  6. Going back to the MAX() at the beginning of the formula, the MAX essentially breaks ties in favor of events with larger ID’s.  That is of course completely arbitrary.  But in truth there are very, very few overlapping events so it hardly matters.  If I cared more, I could narrow the window to less than +/- 0.5 seconds, or I could pay more attention to my Event ID’s to make sure that higher numbered ID’s truly were the most significant events.

That gives me a result in my Sniff table that basically looks like I had a relationship between the two tables and used =RELATED()

image

OK folks.  I’m tired and about to drop.  But hopefully this inspires some thinking, or at least some questions Smile