Lt. Weinberg: Doing what? Kaffee will have this finished in about four days.
Capt. Whitaker: Doing various administrative things. Backup. Whatever.
Lt. Weinberg: In other words I have no responsibilities whatsoever.
Capt. Whitaker: Right.
Collectively we’ve spent a lot of digital ink on the topic of the relative roles of IT and BI professionals on one hand and the Excel pros on the other – here and here for instance.
In those comment threads, some IT and DB pros were wondering whether PowerPivot marginalizes their roles, and the consensus has been “absolutely not, you might even be more important now.”
If you have any doubts about that, check out this fiasco.
In the last football post, I created a spiffy new measure for Rush Yards. Since then, I’ve created similar measures for Passing and Receiving Yards. And now I want to create measures for Touchdowns – Rush, Pass, and Rec. Trouble is…
Problem #1: No Touchdown Column
Yep, whereas my Plays table contains a [Yards] column, it does not contain any columns remotely related to touchdowns.
Awesome. So Where’s the touchdown column??? Well, in another table (FactPlays), I find a [Result] column, and one of the values of that column is “Touchdown” – Neat.
I filter to [Result] to “Touchdown” – and the [PointScored] column in the same table shows me nothing but 6’s, which is good news since 6 is the number of points in a touchdown
Problem #2: Not all 6-pt plays are touchdowns?
But then to make sure, I clear the filter on [Result] and then filter [PointScored] = 6, and guess what? Sometimes when [PointScored] = 6, [Result] = <Blank>…
…which is NOT good because the only way to score 6 pts is to score a touchdown. So that [Result] column is basically worthless.
But hey, no big deal, I can work around this by just using [PointScored] = 6 as my “Is Touchdown” test, either by creating a new calc column or just doing this directly in a measure.
All I need to do is relate this FactPlays table to my Plays table…
Problem #3: These two tables speak completely different languages
What columns can I use to link these two tables? Given that the relevant entity here is a Play, I start looking for things related to PlayID… and I find a trainwreck.
The Plays table contains [GameCode], [DriveId], [Period], [PlayUniqueId],[SeasonNumber], and [PlayGUID] – the last of which is a calc column I created awhile back by concatenating [PlayUniqueID] and [SeasonNumber] – I had determined that combination to be globally unique (the db lacked this to begin with – ugh).
The FactPlays table contains [GameDate], [GameCode], [Time], and [Quarter].
Hey look – NOTHING MATCHES UP. FactPlays contains nothing resembling a Play ID. It doesn’t even have a GameCode. Nope, games in FactPlays are uniquely identified by [GameDate] combined with [GameNumber], and I assume I can identify Plays by somehow combining those two columns with [Time], which is the time on the game clock. Oh wait, that’s not quite right. I have to then combine that with [Quarter], since the clock starts over at 15:00 in each quarter.
This is gonna take me a little while.
Griping with a Purpose
I was actually working on a much sexier blog post when I ran into this problem. I stopped to share it because this is real-world data. EXPENSIVE data at that! STATS charges six figures (or more) for subscriptions to this stuff.
IT/BI/DB Pros: Please don’t do this to us
OK, that’s obvious. And I am pretty sure that no IT/BI/DB pros reading this would ever design a db like this in a million years. But things like this manage to happen anyway – after 2 months of following a number of SQL MVP’s on Twitter, I can say that one common theme is “oh my god I just got a new project and this is the worst db I have ever seen – how am I supposed to clean this up???”
I think we know how it happens. Without being told, I can say with certainty that it happened like this: When STATS got started as a company, their value proposition was in the collection of data, data that was far more detailed than anywhere else. How the data got stored and shared was an afterthought. And boom, they had customers immediately, clamoring for that data. “Here, take our money! Send us the data however you’ve got it, we need it!”
Collect data, get paid huge sums of money, repeat. No normal human being would interrupt that process to clean up the storage format. I know I wouldn’t. Years later, they surely realize that this format is terrible. But customers’ apps are now tightly bound with this format, so any investment in a new format would have minimal impact, at first anyway.
Price of Entry, or Welcome Forcing Function?
Hey, it’s one thing when you’re selling data to, say, ESPN.com, and the cost of understanding the data is just a small piece of the application that is being built. But with PowerPivot, bad schema is at best an expensive time sink, and at worst, a source of user error.
So… IT/BI/DB pros, I’d like to hear from you – how far do we have to go in order to be ready for broad PowerPivot adoption? And somewhat alternatively, is PowerPivot perhaps a reason why data quality will be taken more seriously in your organization? I (mostly) come from the Excel side of things and would love to know what your experience tells you.
Next Football Post: Solving the Touchdowns Problem >>
Good post, reflective of my PowerPivot experiences so far.
I would love to think that PowerPivot will make people think more carefully about the quality of their data, but fear that it will be sold and adopted without recourse to the central IT function, generally the owners of a company’s data marts and warehouses.
In this case a business area and its PowerPivot vendor (much cheaper than ‘traditional’ BI vendors !) will get to where you are before realising that there is a problem Houston…now I would hope that at that point they would put their hands up and ask for IT help, but from experience out could come the auxiliary linking spreadsheets to paper over the cracks.
Hi Rob. I’ve been enjoying your blog , and am excited about PowerPivot.
I agree that BI professionals should not be worried that PowerPivot will make them superfluous, as your example in this post well shows. However, PowerPivot makes a profound statement about the tools, techniques and jargon that currently accompany “BI”, and it’s not complimentary. If you have a big stake in these, you should be very worried.
PowerPivot will make clear to a very wide audience what has been well-known for a very long time – that the best logical way to organize data for reporting and analysis is the relational database model, and the best physical way to store it is as an inverted database, or column-store. Cubes, dimensional modeling, star schemas… all of this is completely unnecessary and complicates things for no benefit.
Another way of saying this is that a well-designed SQL Server database is the perfect logical representation for reporting and analysis, nothing else is needed. The only problem is that it is an inefficient row-store, and that SQL is an insufficient query language. PowerPivot solves this problem.
As your example in this post shows, there will always be a need to clean up databases, and to gather data from multiple sources into a coherent relational database model suitable for PowerPivot; there is no escaping the timeless fundamentals of Date and Codd. But PowerPivot makes a vast array of books, products, and technology instantly obsolete.