Modeling Viral Growth and Marketing in PowerPivot

January 18, 2013

A Tale of Two Charts

Let’s say you operate a business that relies heavily on “word of mouth” – customers recommending your product/service to their friends and colleagues. Or at least, you THINK it relies heavily on that sort of thing.

You need to decide how much to spend on traditional advertising – to supplement the social/viral marketing that your customers do on your behalf.  Take a look at each of these two charts – the captions for each attempt to capture the knee-jerk conclusions you might draw:

 
Modeling Viral Growth versus Traditional Direct Advertising in PowerPivot

“Advertising?  We Don’t Need No Stinking Advertising!
That is SO Yesterday!  We’re Viral Baby!”

Modeling Viral Growth versus Traditional Direct Advertising in PowerPivot

“All These Youngsters and Their ‘Viral This’ and ‘Social Media That’ – That’s All Just Fancy Excuses to Be Lazy – You Clearly Need to BRING Your Message to the Customer”

If chart 1 reflected reality, you may opt to spend very little on traditional advertising.  But in a chart 2 world, you’d be silly to rely on viral growth.  But which one (if either of them) describes your situation?

Back in October, Rahul Vohra (CEO of Rapportive) wrote a two-part blog series on this topic, posted here on LinkedIn.  I took a note, at the time, to revisit his work and “convert” it to PowerPivot.

It’s a very different kind of problem from what I normally do in PowerPivot – this isn’t about analyzing data I already have, but about calculating future outcomes based on a handful of parameters.  And that leads to some different kinds of thinking, as you will see.

 

Read the rest of this entry »


Conditional Formatting with Different Thresholds per Test, per Product, per Store, per Division, etc.

May 29, 2012

 
After a long hiatus, David Hager has returned with a new guest post.  He has a clever Excel trick/formula for applying different conditional formatting “acceptable ranges” depending on the context of the current row.  In his work, different Tests have different acceptable ranges of values that qualify as Pass/Fail/Warning.

clip_image003

Each Test Has a Different “Pass Range” and “Safe Range” –
David Hager’s Technique Translates This Table Quickly into CF Rules

I think this technique can be extended to basically anything:  an acceptable sales growth figure for Store A may be different than that for Store B (or Product A vs. Product B, etc.)

His post also got me thinking about the new “KPI” feature in PowerPivot V2, so I will return later today with a brief follow-on post.

Read the rest of this entry »


DataMarket Revisited: The Truth is Out There

April 10, 2012

image

How many discoveries are right under our noses,
if only we cross-referenced the right data sets?

Convergence of Multiple “Thought Streams”

Yeah, I love quoting movies.  And tv shows.  And song lyrics.  But it’s not the quoting that I enjoy – it’s the connection.  Taking something technical, for instance, and spotting an intrinsic similarity in something completely unrelated like a movie – I get a huge kick out of that.

That tendency to make connections kinda flows through my whole life – sometimes, it’s even productive and not just entertaining Smile

Anyway, I think I am approaching one of those aha/convergence moments.  It’s actually a convergence moment “squared,” because it’s a convergence moment about…  convergence.  Here are the streams that are coming together in my head:

1) “Expert” thinking is too often Narrow thinking

I’ve read a number of compelling articles and anecdotes about this in my life, most recently this one in the New York Times.  Particularly in science and medicine, you have to develop so many credentials just to get in the door that it tends to breed a rigid and less creative environment.

And the tragedy is this:  a conundrum that stumps a molecular cancer scientist might be solvable, at a glance, by the epidemiologist or the mathematician in the building next door.  Similarly, the molecular scientist might breeze over a crucial clue that would literally leap off the page at a graph theorist like my former professor Jeremy Spinrad.

2) Community cross-referencing of data/problems is a longstanding need

Flowing straight out of problem #1 above is this, need #2.  And it’s been a recognized need for a long time, by many people.

Swivel and ManyEyes Both Were Attempts at this Problem

Swivel and ManyEyes Both Were Attempts at this Problem

I remember being captivated, back in 2006-2007, with a website called Swivel.com.  It’s gone now – and I highly recommend reading this “postmortem” interview with its two founders – but the idea was solid:  provide a place for various data sets to “meet,” and to harness the power of community to spot trends and relationships that would never be found otherwise.  (Apparently IBM did something similar with a project called ManyEyes, but it’s gone now, too).

There is, of course, even a more mundane use than “community research mashups” – our normal business data would benefit a lot by being “mashed up” with demographics and weather data (just to point out the two most obvious).

I’ve been wanting something like this forever.  As far back as 2001, when we were working on Office 2003, I was trying to launch a “data market” type of service for Office users.  (An idea that never really got off the drawing board – our VP killed it.  And, at the time, I think that was the right call).

3) Mistake:  Swivel was a BI tool and not just a data marketplace

When I discovered that Swivel was gone, before I read the postmortem, I forced myself to think of reasons why they might have failed.  And my first thought was this:  Swivel forced you to use THEIR analysis tools.  They weren’t just a place where data met.  They were also a BI tool.

And as we know, BI tools take a lot of work.  They are not something that you just casually add to your business model.

In the interview, the founders acknowledge this, but their choice of words is almost completely wrong in my opinion:

image

Check out the two sections I highlighted.  The interface is not that important.  And people prefer to use what they already have.  That gets me leaning forward in my chair.

YES!  People prefer to use the analysis/mashup toolset they already use.  They didn’t want to learn Swivel’s new tools, or compensate for the features it lacked.  I agree 100%.

But to then utter the words “the interface is not that important” seems completely wrong to me.  The interface, the toolset, is CRITICAL!  What they should have said in this interview, I think, is “we should not have tried to introduce a new interface, because interface is critical and the users already made up their mind.”

4) PowerPivot is SCARY good at mashups

I’m still surprised at how simple and magical it feels to cross-reference one data set against another in PowerPivot.  I never anticipated this when I was working on PowerPivot v1 back at Microsoft.  The features that “power” mashups – relationships and formulas – are pretty…  mundane.  But in practice there’s just something about it.  It’s simple enough that you just DO it.  You WANT to do it.

Remember this?

OK, it’s pretty funny.  But it IS real data.  And it DOES tell us something surprising – I did NOT know, going in, that I would find anything when I mashed up UFO sightings with drug use.  And it was super, super, super easy to do.

When you can test theories easily, you actually test them.  If it was even, say, 50% more work to mash this up than it actually was, I probably never would have done it.  And I think that’s the important point…

PowerPivot’s mashup capability passes the critical human threshold test of “quick enough that I invest the time,” whereas other tools, even if just a little bit harder, do not.  Humans prioritize it off the list if it’s even just slightly too time consuming.

Which, in my experience, is basically the same difference as HAVING a capability versus having NO CAPABILITY whatsoever.  I honestly think PowerPivot might be the only data mashup tool worth talking about.  Yeah, in the entire world.  Not kidding.

5) “Export to Excel” is not to be ignored

Another thing favoring PowerPivot as the world’s only practically-useful mashup tool:  it’s Excel.

I recently posted about how every data application in the world has an Export to Excel button, and why that’s telling.

Let’s go back to that quote from one of the Swivel founders, and examine one more portion that I think reflects a mistake:

image

Can I get a “WTF” from the congregation???  R and SAS but NO mention of Excel!  Even just taking the Excel Pro, pivot-using subset of the Excel audience (the people who are reading this blog), Excel CRUSHES those two tools, combined, in audience.  Crushes them.

Yeah, the mundane little spreadsheet gets no respect.  But PowerPivot closes that last critical gap, in a way that continues to surprise even me.  Better at business than anything else.  Heck, better at science too.  Ignore it at your peril.

6) But Getting the Data Needs to be Just as Simple!

So here we go.  Even in the UFO example, I had to be handed the data.  Literally.  Our CEO already HAD the datasets, both the UFO sightings and the drug use data.  He gave them to me and said “see if you can do something with this.”

There is no way I EVER would have scoured the web for these data sets, but once they were conveniently available to me, I fired up my convenient mashup tool and found something interesting.

7) DataMarket will “soon” close that last gap

In a post last year I said that Azure DataMarket was falling well short of its potential, and I meant it.  That was, and is, a function of its vast potential much more so than the “falling short” part.  Just a few usability problems that need to be plugged before it really lights things up, essentially.

On one of my recent trips to Redmond, I had the opportunity to meet with some of the folks behind the scenes.

Without giving away any secrets, let me say this:  these folks are very impressive.  I love, love, LOVE the directions in which they are thinking.  I’m not sure how long it’s going to take for us to see the results of their current thinking.

But when we do, yet another “last mile” problem will be solved, and the network effect of combining “simple access to vast arrays of useful data sets” with “simple mashup tool” will be transformative.  (Note that I am not prone to hyperbole except when I am saying negative things, so statements like this are rare from me.)

In the meantime…

While we wait for the DataMarket team’s brainstorms to reach fruition, I am doing a few things.

1) I’ve added a new category to the blog for Real-World Data Mashups.  Just click here.

2) I’m going to do share some workbooks that make consumption of DataMarket simple.  Starting Thursday I will be providing some workbooks that are pre-configured to grab interesting data sets from Data Market.  Stay tuned.

3) I’m likely to run some contests and/or solicit guest posts on DataMarket mashups.

4) I’m toying with the idea of Pivotstream offering some free access to certain DataMarket data sets in our Hosted PowerPivot offering.

See you Thursday Smile


Correlating “Fuzzy Time” Events in One Table with Precise Measurements in Another Table

February 26, 2012

 
Precisely-Timed Measurements vs. Imprecisely-Timed Events in PowerPivot - How Do We Relate Across Ranges of Time?

Precisely-Timed Measurements vs. Imprecisely-Timed Events: 
How Do We Correlate/Relate Them?

The Streak is Broken…  and a new streak begins.

Well it was a full six-month run of posting every Tuesday and Thursday, Cal Ripken-like consistency.  Then last week it ended.  I was just drained.  Oh well, time to start a new streak!  Back to Tuesday and Thursday.  So let’s dive in…

Calculated columns.  I am continually realizing how much there is to know about them – stuff I haven’t really been forced to learn since I rely so heavily on Pivotstream’s database team.  My recent work with scientific and medical data sets, starting with the really simple fake data set and then moving into the rat sniffing data, has really driven this home.

So let’s do a quick treatment of fetching data from one table into another, using calculated columns, and in different situations.

When you have a relationship:  =RELATED()

OK, I’ve covered this one before.  When you have two tables and a relationship between them, you can fetch data from the “lookup” or “reference” table into your data table using the RELATED() function.

But there are really only a few places where you SHOULD do this.  You can already include the “lookup” column in your pivots without fetching it into the data table, so there’s little benefit of fetching it.  And doing so will just make your file bigger AND your pivots slower.

So really, I just do this when I’m debugging or exploring my data set.  When I’m done, I delete the column.

When you don’t have a relationship…

OK, back to the rat data Smile.  This will parallel many other real-world examples, such as marketing campaigns etc.  So bear with me even if you don’t find scientific measurements to be interesting.

I have one table called “Sniffs” which records how often (and how deeply) the rats were inhaling:

image

Sniffs Table:  Coded by RatID, SessionID, and TimeID
(Where Each TimeID Represents 1/100 of a Second)

And then I have an Events table.  “Events,” in this case, are things that the rats are doing.  Like…  sniffing each other.  In various places.

image

Events Data – Entered by Humans Watching Time-Coded Video of the Rats
(Note that Time is Recorded in Seconds, not Hundredths of Seconds As in the Sniffs Table)

Event type 4, for instance, is “Rearing” – the rat raising up on its hind legs and sniffing its surroundings.  This event is recorded by someone in the lab watching time-coded video.

THE GOAL:  We want to see how breathing frequency and intensity (data from the Sniffs table) changes during such events – does a rat sniff more or less when Rearing than normal?  (And in Retail, the parallel would be something like “do we get decent lift from our ad campaigns” or “how do spending habits change around a particular event like the Super Bowl?”)

Well, you can’t create a relationship between these two tables, for a few reasons.

Problem 1:  Events are Coded at the Second Level, Sniffs at 0.01 Second

In a single second of elapsed time, there will be 100 rows of data in the Sniffs table, and at most 1 row in the Events table.  Right there, we have a problem.

Then again, each 0.01 second DOES map to a “full” second, in much the same way as a day maps to a year.  The real problem here is…

Problem 2:  Both Tables Are Non-Unique in Time Column

Even if we lined up each 0.01 second from Sniffs with a full second from Events, well, I can have multiple rows in Events that all have the same value for TimeInS, because each row is only unique when you consider RatID and SessionID.

Hmmm…  so how about I just concatenates RatID, SessionID, and TimeInS to form a unique composite column in Events, and then do something similar in the Sniffs table, and THEN I create my relationship?

There are two problems with that – one minor, and one fatal.

The minor problem:  this will make my file quite a bit larger, by adding a column to my largest table (Sniffs).  And it’s the worst kind of column to boot:  a calculated column with a lot of unique values.  Plus, it will form the basis for a relationship, which means my pivot performance will potentially be slow.

The fatal problem?

Problem 3:  The TimeInS Value in the Events Data is Imprecise!

First of all, the video of the rats’ behavior will never be precisely aligned with the scientific equipment that is measuring the rats’ breathing patterns.  There is some “wiggle” in that synchronization, maybe as much as half a second.

Secondly, this is compounded by human error and variability – the act of a rat rearing up to sniff its environment might cover a full two seconds elapsed, but the human being coding the video must pick a single second to flag the behavior.

Taken together, we can see that there’s really a fuzzy “range” defining when the event occurs.

The Solution Part One:  Adding “Time Window” Columns to Events

Let’s say I define the “Event” as lasting one second – 0.5 second before and after the time recorded in the Events table.

Then I add calc columns to my Events table that reflect that:

image

Where the *100 is to convert to the “centiseconds” granularity that I use in my other table (the Sniff table).  PostTimeID is the same, except +0.5 rather than –0.5

The Solution Part Two:  CALCULATE?  In a Calc Column?

Then I can add a calc column to my Sniffs table that “fetches” a corresponding event ID from the Events table whenever there was an event “in progress” for the current sniff row:

=CALCULATE(MAX(Events[EventType]),
   FILTER(Events,
          Events[RatID]=EARLIER(Sniff[RatID]) &&
          Events[SessionID]=EARLIER(Sniff[SessionID]) &&
          Events[PostTimeID]>=EARLIER(Sniff[TimeID]) &&
          Events[PreTimeID]<=EARLIER(Sniff[TimeID])
         )
)

***UPDATE:  I realized that EARLIER() was NOT needed in this formula.  See this post for an explanation.

Notes on this formula:

  1. I had to choose some aggregation function, and I chose MAX, but I could have chosen MIN just as easily.  SUM and AVERAGE would NOT have worked out for me though.  More on this later.
  2. Everything after that is a FILTER expression – with four clauses all “anded” together with the && operator
  3. The first two filter clauses basically just say “only grab rows from Events where the Session and Rat match the Session and Rat on the current row in Sniff”
  4. The function documentation for EARLIER() has a very abstract definition, but in this case you can think of EARLIER() as just meaning “grab the value from the current row in Sniff.”
  5. The last two filter clauses make sure we only grab rows from Events whose time window includes the current row of Sniff.  In other words, if an Event “contains” the time of the current Sniff, it counts
  6. Going back to the MAX() at the beginning of the formula, the MAX essentially breaks ties in favor of events with larger ID’s.  That is of course completely arbitrary.  But in truth there are very, very few overlapping events so it hardly matters.  If I cared more, I could narrow the window to less than +/- 0.5 seconds, or I could pay more attention to my Event ID’s to make sure that higher numbered ID’s truly were the most significant events.

That gives me a result in my Sniff table that basically looks like I had a relationship between the two tables and used =RELATED()

image

OK folks.  I’m tired and about to drop.  But hopefully this inspires some thinking, or at least some questions Smile


Converting from Peaks to Frequencies

February 16, 2012

image32

Sniff something? Did ya, rat boy?”

Continuity!

Folks I just can’t resist a quick followup on the Rat Sniffing Project.  I…  just… can’t.  Plus I am absolutely worn down and don’t have the energy tonight for anything that isn’t inherently entertaining.

All of those calculated columns from the last post, remember, were just the setup so we could start doing some REAL observational stuff.

You know, something like this:

Measuring Event Frequency in Hz in PowerPivot

This wasn’t difficult at all.

Remember that I now had a column that flagged each row as “1” if it represented the peak of an inhalation:

image

Each Row is 0.01 second, Flagged as 1 if
it’s a Peak Inhale, aka a “Sniff”

So now I really just need some measures.

The First Measure I’ve Ever Written in the Unit of Hertz!

(Hmmm.  There’s a joke in here somewhere, the first line of which is something like “what do you do if your unit Hertz?”  Listen, I told you I was tired.)

Since I have a 1 in the RobPeak column, summing it will yield the number of peaks in an interval – I don’t need to do a CALCULATE(COUNTROWS()) with a filter set to 1.

So the measure formula for Hz (events per second) is:

(SUM(Data[RobPeak]) / COUNTROWS(Data)) * 100

Why times 100?  Because each row is actually 0.01 second.

Now THIS is a *Time* Table

This is another first:  creating a separate time table that is NOT measured in days.  In this case…  hundredths of a second:

A Time Table in PowerPivot - Not Dates.  Seconds.

Not a Calendar Table – a Time Table.  In  0.01 Second Increments

OK, it’s not a Calendar table but it is VERY similar.  Just like a Calendar table has columns like DayofWeek, DayOfMonth, CalendarYear, etc. – those columns represent properties of dates.  Well, the columns I have here like Second and FiveSecond are the same sort of thing.

And since I have 6 rats in the experiment, it is wasteful to duplicate those properties in my Data table.  Plus there’s that whole speed thing.

So I created this separate table and related it to my Data table.

And this table then “powers” my slicers like these:

Slicers Based on a Time Table Measured in Minutes and Seconds

Slicers Based on a Time Table Measured in Minutes and Seconds

These slicers represent a navigation method more than they do a “filter” in the classic sense.  I say that because fields from the time table are on the axis of the chart as well:

image

The net effect is that I can use my slicers to quickly move around the data and examine “windows” that are interesting or relevant.  Couple of quick clicks and I am now looking at minute 4 instead of minute 3, and just the first 10 seconds of that minute:

image

Quickly “Jumped” to the 4:00-4:10 Interval

It’s ALMOST like the slider control on a YouTube video:

image

YouTube Has a Time Slicer Control Too!

Wouldn’t it be neat if Excel gave us a YouTube-style slicer for time?  We can dream.

Side Note:  Rats Breathe FAST!

Check those charts out.  Between 5 and 9 Hz???  Really?  Those little rascals inhale between 5 and 9 times per second, on a SUSTAINED basis?  Wow.  I’m easily amused I guess, but wow.


Peak Detection: A Surprising Usage of PowerPivot

February 14, 2012

 
image

“Who are YOU, who are so WISE in the ways of SCIENCE?”

It’s all about following through…

In a recent post I covered a very simple scientific scenario.  It was an interesting diversion from the normal biz-style scenarios but it left me feeling hollow in two ways.  First, it was too simple and didn’t account for the possibility of multiple different treatment types, so I pinged the Italians.  (They responded, as they always do, and I owe a post on their reply, probably Thursday).

But the other problem was that, while claiming to be a scientific scenario, it was manufactured by me, and hey, I’m no scientist.  Then it hit me…

Hey!  My Neighbor is a REAL Scientist!

Yeah, Dan Wesson is a “he-runs-his-own-research-lab-at-the-university, is-published-in-prestigious-science-journals” kind of scientist.  So I decided to bounce the question off of him – was my example scenario at all useful, and if not, what WOULD be useful?  Boy, am I glad I asked.

Perhaps My Favorite PowerPivot Visualization Yet

PowerPivot Used for Peak Detection - My New Dining Room Mural

Click for Full Size!

First of all, HOW COOL IS THAT?  It looks like the double helix or something.  I joked with my wife that I was gonna have it turned into an eight-foot mural for the dining room.  (Yeah, I was joking sweetie.  Totally joking, I would never do something like that).

What do the Values Mean?

image

These are measurements of respiration taken from rats in Dan’s lab.  Negative values indicate “inhale” and positive indicate “exhale.”

Each value represents one-one-hundredth of a second – how’s that for time intelligence? Smile

What Does “Peak Detection” Mean?

In order to make sense of the data, Dan needs to measure the frequency of how often the rats “sniff.”  Sniffing is an indicator of all sorts of things apparently – a sign of interest, a sign of cognition, a sign of dominance…  I had no idea.

Anyway, he needs to measure the number of “sniffs” the rats take in a given time frame, and in order to do that, he needs to count the negative peaks in the breathing data:

image

PowerPivot Can Do This?  Yes, Better Than the Professional Tools!

I went into this mostly expecting that PowerPivot was not going to be a good tool for his needs.  And as usual, PowerPivot surprised me.

Not only can PowerPivot do this, it outperforms expensive specialized software.  Software that is designed to do precisely this task.  Let that sink in for a moment:

PowerPivot Outperforms Expensive and Specialized Scientific Software

$10,000 Specialized Scientific Software Often “Misses” the Peaks,
PowerPivot is Dead On

How Did I Do It?

It actually wasn’t that hard.

Given a data set like this with 180 thousand rows:

PowerPivot Six Rats’ Breathing Measured 100 Times per Second for Five Minutes

Six Rats’ Breathing Measured 100 Times per Second for Five Minutes

I needed to generate a 1/0, True/False column identifying whether a given row was a “sniff” peak.

I ended up creating a TimeID column so that I could work in integers:

image

Multiply by 100 So I Can Work in Integer Time Increments

And then I wrote the following calc column formula:

[IsPeakCandidate]

=IF(OR([TimeID]=0,[value]=0),0,
IF([value] > 0,0,   
    IF(
       AND(
           CALCULATE
             (MIN([value]),
             FILTER(ALL(Data), Data[Rat]=EARLIER(Data[Rat])),
             FILTER(ALL(Data),
                Data[TimeID]=EARLIER(Data[TimeID])-1)
             )
           >[value]
           ,

           CALCULATE
             (MIN([value]),
             FILTER(ALL(Data), Data[Rat]=EARLIER(Data[Rat])),
             FILTER(ALL(Data),
                Data[TimeID]=EARLIER(Data[TimeID])+1)
             )
           >[value],

          )
       ,1
       ,0
      )
   )

)

What is that formula doing?

Ok that looks complex.  But all it’s doing is asking three questions:

  1. Is the CURRENT row’s value less than zero?
  2. Was the PRIOR row’s value greater than this row’s value?
  3. Is the NEXT row’s value also greater than this row’s value?

If all three are “yes,” then this row represents a negative peak.

Flaw:  A “Hiccup” Becomes a Peak

Unsurprisingly, it turns out that formula wasn’t quite good enough:

image

In Basketball This is Called This a “Pump Fake”
(In American Football, a “Juke,” in Hockey, a “Deke…”)

So I added another calculated column:

[MinOverInterval]

=CALCULATE(MIN(Data[value]),
           ALL(Data),
           FILTER(ALL(Data),
                  Data[Rat]=EARLIER(Data[Rat])
           ),
           FILTER(ALL(Data),
                  Data[TimeID] <= EARLIER(Data[TimeID]) +5 &&
                  Data[TimeID] >= EARLIER(Data[TimeID]) –5
           )
)

For each row in my data that calculates what is the minimum value over an 11-row interval – five rows before, five rows after, and the row itself:

image

MinOverInterval:  The Smallest Value in the 11-Row Window

One Last Column!

Then, tying it all together, the last calc column is this:

[IsRealPeak]

=IF(
   AND(
       Data[PeakCandidate]=1,
       Data[MinOverInterval]=Data[Value]
      )
   ,1,0
)

This column says “if this row was flagged as a peak candidate already, AND it is the most negative row in its 11-row interval, then we bless it as a REAL peak row.”

And if I use that column instead, that fixes the false peaks:

image

Some Final Notes

First, I suspect that I can simplify my calc columns a bit.  There was some exploratory trial and error in this process and it might be that the “MinOverInterval” test is ALL I need, since there’s no way a row can be its minimum in the interval and NOT be more negative than its immediate neighbors.

But hey, I wanted to get this post out ok? Smile

Second, yes, these calc columns are something I’d ideally want to have done in a database.  But since I want to give Dan a completely self-contained toolset, I had to do it in calc columns.

Lastly, this “peak detection” is NOT the end of the road!  Far from it.  It’s just the beginning.  Now that we have a reliable “peak flag” column in our data, we can start doing the normal PowerPivot thing – measures of frequency, slicing by properties of the rat itself (age, sex, etc.), properties of the trial… 

So I expect Pivotstream will be doing some real work for Dan’s lab, helping support his Alzheimer’s research.  All starting from a quick conversation in Dan’s backyard over the weekend.  How cool is that?


Tracking Performance After an Event or Treatment

February 2, 2012

 
Jackie Brown - A Tarantino Movie I Need to Revisit Soon

“…then they start thinking that, ‘where there’s smoke there’s fire’ logic…”

Continuing the Impromptu Series of Simple Real World Examples

The Mr. Excel PowerPivot Forum has inspired me to change gears a little bit and focus more time on simple techniques that don’t break the DAX Spicy Scale while still being very useful, everyday stuff. I sometimes take this stuff for granted and end up looking for topics that are much more “clever” when in reality we all can use a good dose (or two, or ten) of basic useful examples.

Medical Treatments – Are They Effective?  Counterproductive?

A user over at the forum named “Mirknin” posted the following question:

I have a relatively simple data set where each row represents a week, each week has a number of data columns. Occassionally, however, we have treatments and I set up Excel with a column where the treatment is recorded as 1 – i.e. a treatment occurred that week, otherwise the cells are left blank.

I AddColumn in PowerPivot to generate a unique identifier (through the usual ampersand formula) to represent whether a treatment occurred in a specific week. Using the Slicer, I can PivotTable to see the data from the row where the treatment took place because all this info is related to that week, but seeing the data for following four weeks would let me see whether the treatment had made an effect.

I guess my query is whether the Slicer can be modified in such a way as to report the row it’s associated with plus subsequent rows…

Data Set:  Let’s Use Trees Rather than People Smile

When I was manufacturing sample data for this I got a bit squeamish about it.  Not for the obvious reasons though – I’ll explain at the end of the post.  But for now, just know that we’re going to work with Trees, like in a Botany lab, and some experimental treatment they are receiving.

We have a calendar of treatments – a list of dates and Yes/No for each:

image

Calendar of Treatments

And then a table of health readings taken for various trees in the lab on specific dates:

image

Tree Health Measurements

Normal Slicer Function Across Relationships

Notice that the two tables have the little “relationship” icons on the Date column?  These two tables are related on that column.  So if I write an “average health” measure and put the Yes/No treatment field on a slicer, I get:

image

image

But if I put Date on rows, we’ll see that all I am “getting” in my health scores is the health of the tree on the day the treatment was applied:

image

But I Want the Health Scores Over the Two Days AFTER Treatment!

I want to track a range of dates after a treatment is applied, so I need to do something different here.

What I need is a new column in the TreatmentCalendar table:

image

If I use THAT column as a slicer, or even better, put it on rows of the pivot, I get what I want:

image

Hey, check it out!  Trees are doing a little bit better in the two days after they receive treatment!  (That’s lucky, since this sample data was created using RANDBETWEEN).

But how did I build that “Recently Treated” Calculated Column?

Did I say up front that this was going to be LOW on the DAX Spicy Scale?  Hmm.

First let me repeat a longstanding recommendation:  if you are using a database as your data source for PowerPivot, you should seriously consider having this calculated column generated in the database and NOT in PowerPivot.  There are multiple benefits of that, as long as it’s an option.

If it’s not an option, you still have two choices.  For something relatively quick and dirty, you can do the calc column in Excel and just paste an entirely new TreatmentSchedule table into PowerPivot.

But for larger data sets, or cases where you can’t have that manual intervention step every time you get new data, you’re gonna have to write the calc column in PowerPivot.

And since PowerPivot lacks A1-style reference, and this calc column has to look at rows other than just the current row in order to get its answer, you have to use the dreaded EARLIER function.

Primer:   Simple Use of the EARLIER Function

Before we do something advanced with EARLIER, let’s cover the basic usage first:

Say I have the following VERY simple table like this:

image

And I want to add a third column that is the total for each customer:

image

How do I do that?

The calc column formula for that third column is this:

=CALCULATE(SUM([Amt]),
           FILTER(ALL(‘Table’),
             ‘Table'[Customer]=EARLIER(‘Table'[Customer])
                 )
          )

Explaining EARLIER() in depth may be its own post.  Yeah.  This was probably the LAST function I learned to use.  It’s an ongoing joke between me and the Italians actually.

Here’s the Formula for the RecentTreatment Yes/No Column

With that background in mind, here is the formula for the Yes/No column.  I actually did it in two columns:

image

That 1/0 column is the tricky one.  The Yes/No column is just a “cosmetic” column built using IF.

Here’s the 1/0 formula:

=CALCULATE(COUNTROWS(TreatmentCalendar),
           TreatmentCalendar[IsTreatmentDay]=”Yes”,
           FILTER(ALL(TreatmentCalendar),
                  TreatmentCalendar[Date] <
                       EARLIER(TreatmentCalendar[Date])
                       &&
                  TreatmentCalendar[Date] >=
                       EARLIER(TreatmentCalendar[Date])-2))

Easy right? Smile

I Promise to Come Back and Explain This!

EARLIER isn’t all that bad really and neither is the formula above.  But it definitely feels a lot harder than an Excel pro wants it to be.  Given the length of this post however I think I’ll do that next time.

Calling Marco and Alberto!

There are a couple of obvious questions here that my simplified example just outright skipped:

  1. What if I have more than one kind of treatment?
  2. What if not all “patients” receive the same treatment, and/or on different days?
  3. Can I use a slicer to control how “wide” the range of dates is rather than hardcoding 2 into my formulas?

The fact that I was dodging those problems in this post is why I was squeamish about using people as the example rather than trees.  And all three of those make this problem a lot harder.

And what do we do when we hit a problem that goes beyond the powers of mortal Excel Pros?  We throw up the Boot Signal of course:

image