Gantt Reworked with ADDCOLUMNS, FILTER, GENERATE and SUMMARIZE

Guest post by David Churchward [Twitter]

DAX Studio, ADDCOLUMNS, FILTER, GENERATE and SUMMARIZE

Gantt Hours measure debugging with DAX Studio – isn’t it pretty!

At the end of my last post in the Gantt Chart series, GANTT CHART WITH RESOURCE LOADING, I said that I would return to explain how the Hours measure worked.  I took a brief diversion, almost foray, into the world of CRM PIPELINE FUNNEL CHARTS but I’m now back to explain the workings of that Hours measure.  You can revisit PART1 and PART2 of this series to recap.

In the interim, I’ve been working on some “fine tuning” of that Hours measure.  I’ll come onto the final solution in due course, but I’ll work through the workings of the Hours measure as we left it initially to explain why it needed tuning up!

For those amongst you with an appreciation of SQL, you might think of this like creating a DAX Equivalent LEFT JOIN between tables with a BETWEEN thrown in for good measure.

First Up, DAX Studio

Sometimes I think I live a charmed life.  I end up with a measure that I’m struggling to optimise and then, almost out of nowhere, up pops DAX Studio to solve all of my problems.  At this point, a huge thanks has to go out to Marco Russo, Darren Gosbell, Paul te Braak and Grant Paisley for developing such a life saving tool and also to Javier Guillén for bringing it to my attention on his excellent BLOG.

It’s probably worth a few words to explain what DAX Studio does, except I’m going to do so in my language rather than perhaps more correct BI speak! Open-mouthed smile

If you’ve ever wondered what those functions such as VALUES(), FILTER(), CALCULATETABLE() (and others besides) are doing, here’s your answer!  Try advancing to SUMARRIZE(), ADDCOLUMNS() and GENERATE() and this tool becomes a must have weapon!

DAX uses a lot of in-memory table filtering and what I refer to as table massaging to get to an adjusted dataset from which to run calculations.  DAX holds these tables almost behind the scenes so you can’t see them and therefore, you’re left guessing what they’re up to.

DAX Studio allows you to run these intermediary tables to see what results they’re returning and this is how I got to the bottom of the Hours measure.

DAX Studio is in Beta, but go and get it – it’s free and quite brilliant!  You can DOWNLOAD IT FROM HERE.

The Hours Measure

Let’s just refresh our own (in-)memory with the Hours measure from before:

Hours

=SUMX

(

SUMMARIZE

(

FILTER

(

TASKS,

COUNTROWS

(

FILTER

(

Dates,Dates[Include]=1

&&Dates[Date]>=FIRSTDATE(Tasks[Start])

&&Dates[Date]<=LASTDATE(Tasks[End])

)

)=1

),

Tasks[Responsible],

Tasks[Description],

“Effort_Value”,

CALCULATE(MAX(Tasks[Effort_Per_Working_Day]))

),

AVERAGE([Effort_Value])

)

It’s certainly a bit spicy.  I’ll explain my rationale and then we’ll put DAX Studio through it’s paces to explain why I’ve ended up changing it.

The problem we have is that the Tasks table holds records with Start and End dates as below.

Gantt Chart Tasks List

All valid (ie not weekends) dates between those start and end dates need to record a resource loading equivalent to one days work on that task.  I can’t link through to a dates table because what do I link through on?  Start Date? No.  End Date?  Ehhh No!

The reality is that I have a one-to-many relationship requirement and that’s the problem.  DAX works on a strict many-to-one relationship basis.  I asked the question of myself whether I should switch the tables around and link Dates to Tasks thereby reversing the relationship and creating a many-to-one relationship.  Ehhh no again!  Multiple tasks could cover the same date ranges and I would therefore be in to a many-to-many situation.

Who was it who said “One one-to-many is one too many!”?  It was no doubt RC

The Rationale behind the Hours Measure

This was my rationale:

  1. Limit the Dates table, in context of each task’s start and end date
  2. Apply that limited dates table to each row in my tasks table, thereby applying the start and end date context of each task
  3. Summarise (or for the American and DAX version SUMMARIZE) that amalgamation of tasks and dates into another in-memory table
  4. Iterate over that in-memory table using SUMX to return results for each task and date combination

I already had two concerns:

  1. I’m running a filtered list of dates for every task.  That’s probably not a problem for ordinary projects with maybe less than 100 tasks.  Naturally, I was quickly informed that someone had over 4,000 tasks on one project!  That said, 4,000 isn’t a big number so we should be able to deal with it easily.
  2. My final SUMX evaluation required an AVERAGE of my resulting [Effort_Value] to return the correct result and I couldn’t understand why.

To dissect this, let’s look at what’s going on behind each element of my rationale.

Limit the Dates table, in context of each task’s start and end date

Filter Expression in DAX Studio

You’ll notice that the FILTER returns a list of records from my Dates table based on a criteria that says that the date should be “included” and falls between the earliest date provided (ie FIRSTDATE) and the last date provided (ie LASTDATE).

Apply that limited dates table to each row in my tasks table, thereby applying the start and end date context of each task

Second Filter Expression

At this point, I got confused.  I’m aware that ROW and COLUMN context needs to be considered because this evaluation will be run for each element on my final pivot, but interpretation of these results isn’t easy.  What I had envisaged was a table of every task repeated multiple times over the dates that were relevant to that task.  I realised that I would probably only get one task result set as I don’t have any other context in this example, but I wasn’t expecting this result.

To summarise the final elements of my rationale, I get the following result set which further emphasises the confusion.

SUMMARIZE function in DAX Studio

I had hit a brick wall.  I had the right answer but it occurred to me that I didn’t know exactly why.  What’s more, I was acutely aware that this approach wasn’t optimised.  With my new DAX Studio in hand, I got underway with producing a revised, more optimised version.

The Result of That Hard Work together with DAX Studio

So here it is.  The new, revised, optimised (nearly – I hope) solution.

Hours2

=SUMX

(

ADDCOLUMNS

(

FILTER

(

GENERATE

(

SUMMARIZE

(

Dates,

Dates[Date],

Dates[IncludeText]

),

SUMMARIZE

(

Tasks,

Tasks[Responsible],

Tasks[Description],

Tasks[Start],

Tasks[End]       

)

),

Dates[Date]>=Tasks[Start]

&&Dates[Date]<=Tasks[End]

&&Dates[IncludeText] = “Include”

),

“Effort_Value”,

CALCULATE(MAX(Tasks[Effort_Per_Working_Day]))

),

[Effort_Value]

)

It’s a beauty!  Let’s start to interrogate.

The Art of SUMMARIZE()

The way that I like to think of the SUMMARIZE() function is that it is a group by.  If you consider any distinct combination of field elements on each row, that’s what you’ll get with the SUMMARIZE() function.

The official syntax is:

SUMMARIZE

(

<table>, <groupBy_columnName>[, <groupBy_columnName>]…[, <name>, <expression>]…

)

Essentially, define the table that you want to summarise, provide the column names that you want to summarise (or group) by and then (optionally) provide names and expressions of any calculations that you want to run on the resulting dataset (ie SUM(VALUE) or anything for that matter that represents a DAX aggregation).

Having spent some time digesting Marco Russo’s best practice guidance on this matter, conducting the expression valuation within the SUMMARIZE function may not be the best thing to do.  Instead, ADDCOLUMNS provides an alternative better approach that I’ll come on to.

Let’s quickly check what these two SUMMARIZE functions are doing in isolation:

SUMMARIZE in DAX Studio

In short, I’ve created an in-memory version of the critical columns that I need from my tasks table.

Summarize Dates Table

Again, a simple extract of two key fields from the Dates table.

Now, imagine if I could smash these two tables together and assemble them into something meaningful!  Bring on GENERATE()!

The Art of GENERATE()

The official syntax of GENERATE() is

GENERATE(<table1>, <table2>)

GENERATE() is designed to create the Cartesian product of two tables.  An alternative way of wording this is to Cross Join the two tables.  One further way of wording it is to simply smash them together!

The concept is that every row in Table1 will be repeated for every row in Table2.  As a result, we’ll get an extended “joint” table which we then need to filter down to something that makes sense in terms of our dataset.

GENERATE in DAX Studio

 

I can’t show enough of the dataset to prove this, but I essentially have a table with Task records repeated for every date in my Dates table.  This is useful in the sense that I’ve managed to create a one-to-many relationship, but it isn’t really a relationship just yet.  Bring on FILTER().

FILTER a GENERATE(d) SUMMARIZE

Weird title to this section, but that’s what we’re going to do.  Jumping straight in:

Filter the Generate

You’ll notice that we’ve now filtered the previous in-memory table that GENERATE() generated Smile for us.  Take for instance the first 7 records in this dataset.  The task runs from 16 Sep 2012 to 25 Sep 2012 which is 10 days.  16 Sep 2012, 22 Sep 2012 and 23 Sep 2012 are all weekends and have therefore been excluded using Dates[IncludeText] = “Include” in our FILTER() function.  In addition, the dates have been filtered to only appear where they feature between the start and end date of the specific task.

Cool!  LEFT JOIN in SQL – eat your heart out!  The only thing I would say is that SQL is much easier in this respect.

The Art of ADDCOLUMNS()

Previous steps have provided a list of relevant task and date combinations, but I’m not seeing any values to use for the hours regarded as work effort on that day.  As per my interpretation of Marco’s best practice, we can now us ADDCOLUMNS to populate some values.

The official syntax of ADDCOLUMNS is

ADDCOLUMNS(<table>, <name>, <expression>[, <name>, <expression>]…)

We therefore provide a table and then set name and expression syntax for the additional columns that we want to add.  Our FILTER (GENERATE (SUMMARIZE x SUMMARIZE)) provides the table.  We then need to provide the additional columns.

ADDCOLUMNS in DAX Studio

Now we seem to have some semblance of order with values that we can use.  Add one important fact to that – the time taken to run this query was milliseconds and more efficient than each of the evaluations for every task and date combination in my original measure.  As a result, even as the number of tasks starts to climb, performance shouldn’t be impacted too heavily.

Our previous in-memory table has been supplemented with a value called Effort_Value where that value is the MAX(Tasks[Effort_Per_Working_Day]) .  I’ve wrapped this in a CALCULATE and I’m trying to understand the circumstances when you do and don’t need to do so.  It isn’t clear to me right now, but is sufficient to say at that this point that you can get different answers depending on whether you wrap with CALCULATE or not.  In this case, I get the right answers by using the CALCULATE wrap.

Finish It Off

With our in-memory tables secure and in order, it’s simply a matter of exposing them.  I do this by wrapping my in-memory table with SUMX.

Hours2

=SUMX

(

ADDCOLUMNS

(

FILTER

(

GENERATE

(

SUMMARIZE

(

Dates,

Dates[Date],

Dates[IncludeText]

),

SUMMARIZE

(

Tasks,

Tasks[Responsible],

Tasks[Description],

Tasks[Start],

Tasks[End]       

)

),

Dates[Date]>=Tasks[Start]

&&Dates[Date]<=Tasks[End]

&&Dates[IncludeText] = “Include”

),

“Effort_Value”,

CALCULATE(MAX(Tasks[Effort_Per_Working_Day]))

),

[Effort_Value]

)

And the result is a measure that appears to be pretty well optimised.

Final Gantt Chart with Optimised Resource Loading Measure

In my opinion, the conclusions of all of this are as follows:

  1. Treat your in-memory “hidden” stuff seriously and get to know it really well.
  2. Get DAX Studio to help you de-bug and get a proper view on what’s going on behind the scenes
  3. Become friends with SUMMARIZE(), GENERATE() and ADDCOLUMNS().  They’re seriously good quality.
  4. You can proclaim to have Rationale, but not necessarily be Rational Confused smile
  5. This is a seriously long post (apologies)
  6. We’ve still got loads to expose about these huge concepts.

If you want to download the latest workbook, YOU CAN DO SO HERE.

2 Responses to Gantt Reworked with ADDCOLUMNS, FILTER, GENERATE and SUMMARIZE

  1. David Hager says:

    This took a lot of work, David. Thanks for sharing!

    • David Churchward says:

      Thanks David. Much appreciated. I wanted to investigate these functions closer for the in-memory tables and with DAX Studio I had the opportunity. It just so happens that Gantt charts got the treatment!

      Cheers
      David

Leave a Comment or Question