Big Data is Just Data, Why Excel “Sucks”, and 1,000 Miles of Data

 

***UPDATE:  Slides uploaded here.

image

One of My Slides From Last Night – Equally Relevant to Excel, BI, DB, and Big Data Pros

Had a great time last night at the NYC MSBIgData group.  I’ve never spoken to a group quite like last night’s, but I struggle to explain how they were different.  It’s easier to explain what they were not.  Even though the user group is a Big Data and BI user group, they were not a Hive/Hadoop crowd, which shouldn’t have surprised me – there aren’t enough Hive/Hadoop people in the world to really have crowds of them laying around, at least not yet.

But there also wasn’t critical mass of seemingly any other discipline – not BI, not Excel, not DBA, not SharePoint, not programmer.  There were some people from each of those backgrounds but no more than 10% of each.

I think my best assessment is that they were simply a group of people who DO things.  A very pragmatic collection of flexible people.  People who happily use different tools to solve different problems.  I find that fascinating all on its own.

(If you were at last night’s talk, please replace every instance of the word “they” above with the word “you.” Smile)

Big Data is Just Data, and Hadoop is Just a Way to Store Lots of It

I think I am coming around to the conclusion that Hadoop is not something that most readers of this blog really needs to know about.  (Pause for sigh of relief).

If a company has tons of data streaming in, they don’t know what to do with it, and don’t want to throw it all away, but it doesn’t cheaply fit into databases because it’s unstructured stuff like tweets…  well…  Hadoop, there it is (as Denny Lee says).  Great way to store the data.

A Row of Data.  Forever and Ever.  Amen.

But when it comes times to analyze that data, guess what?  It’s time for good old-fashioned rows of data.  (“Records,” if you want to be all highfalutin about it).  And Hadoop is happy to serve up rows for us!

Hadoop Data in PowerPivot

This is Hadoop Data, Loaded into PowerPivot.  Note How Different it Looks.
(Oh Wait, it Looks Exactly Like, um, Data.)

imageNow it’s true that I can use Hadoop to turn unstructured text data (like millions of tweets) into rows, and then filter those rows on import so that I only get tweets containing the term “PowerPivot” for instance.  Sounds like a database.

Today I might want each row to represent a single tweet, with columns like “Date,” “Country,” and “SentimentScore,” and then tomorrow I might want each row to represent an aggregated average sentiment score for each Country/Date pair.  Hadoop can do all of that.  Again, sounds like a database.

Excel pros who want to perform this kind of analysis will of course require some expert back-end assistance with it, to store the data properly and then expose the right kinds of queries (so that we can fetch the right kinds of rows!)

Again, that’s just like a database.  So let’s just call it that – it’s a new kind of database.  You may encounter it in the future and you may not.  But from our perspective, again it’s all about fetching the right rows.

Really, I think that’s all we (Excel Pros specifically) need to know about Hadoop.  And I didn’t get shouted down for being an idiot, so clearly I’m correct Smile with tongue out

1,000 Miles of Data

Here’s a screenshot from one of my PowerPivot workbooks:

image

161 Million Rows – How Long Would This Table Be if I Printed it Out?

My seatmate on the flight to NYC was a hedge fund guy – an Excel user for sure but not really a pivot guy.  Well he got a demo anyway, muhaha.  Actually he was quite a willing victim and a very thoughtful fellow – he asked me how long this table would be if I printed it out (he knew I was going to give a talk and thought it might be an intriguing figure to share, depending on the answer.)

At first I kinda dismissed the idea – I guess I’ve done metaphors like that too many times in the past and have grown weary of them – but when he pressed me on it, I was shocked by the answer.

On average, 47 rows fit on a single sheet of printer paper.  Let’s round that up to 50.

That means this table would consume more than 3 million sheets of paper.  Let’s round 11.5 inches up to an even foot, that becomes 3 million feet of paper.  A million feet is about 200 miles.

This single table would stretch 600 miles (about 1,000 km) if printed out!  That’s farther than my flight to NYC!

And there’s another table in the same workbook with 140 million rows:

image

So that’s another 400-500 miles.  And then the Sales table has 12 M rows, or about another 50 miles.  In total there’s more than 1,000 miles of data in this workbook, approaching 2,000 km!

Imagine the data strung out across the United States as a single, narrow green line of printouts:

image

The Data in My Workbook, if Printed, Would Stretch from Boston to Atlanta!

And I demonstrate that data set all the time on my 4GB HP Folio 13 Ultrabook:

image

All of that Data on a Tiny Little Computer

That Ultrabook is so small, it doesn’t even have a VGA-out port on it.  Unbelievable.

Why Excel “Sucks”

image

Another Slide From Last Night

First off, I apologize – there’s a lot of jargon on that slide.  I wasn’t speaking to an Excel Pro audience last night (in fact I never am, you folks just don’t congregate – please start congregating), I was speaking more to a tech/IT audience.  And the lingo above is their lingo.  Know your audience.  I’ll translate below, don’t worry.

Secondly, you know that I do NOT think Excel sucks.  “Ich bin ein Exceller,” ok?  But as someone who has used the product for a long time, AND worked on developing it at Microsoft, I’ve spent a lot more time than is healthy thinking about ways in which it is lacking.

Excel does have a reputation in tech/IT circles.  That reputation is 50% vicious slander (I’m kidding) and 50% earned.  So this slide covers the “earned” part.

As I progressed through the talk, showing off PowerPivot’s features, I marked items off of that list – the things that PowerPivot fixes about Excel.  Here’s what the slide looked like at the end:

image

Excel Plus PowerPivot (Including PowerPivot Server):
Solves All But One Traditional Excel Shortcoming

All right, so what do those 8 points mean?

Explaining the 8 Weaknesses of Traditional Excel

1) One Million Row Capacity – obvious enough.  But the other point – that Excel gets slow even when you are using much smaller data volumes – may not seem correct.  A calculated column in a 100,000-row normal Excel table calculates faster than the same calc column in PowerPivot actually.  Specifically, what I mean here is that Excel gets slow when you use functions like VLOOKUP() with tens of thousands of rows.  So really, the “slow” thing is an outgrowth of the next problem…

2) Normal Excel only works on single flat tables – again, pretty obvious.  VLOOKUP() and similar functions are the typical way you solve this, which is where the slow thing comes in.  “Dimensional Modeling” is a fancy way of saying “working with separate but related tables,” perhaps best exemplified by the budget vs. actuals problem that PowerPivot handles so well.

3) Tempting Place to Perform ETL, and no Autorefresh – PowerPivot workbooks, when placed on a PowerPivot Server, can be scheduled to refresh themselves on a regular schedule, with no human intervention required.  This is Pure Magic for an Excel Pro.  ETL stands for “Extract, Transform, and Load” and is really just a fancy way of saying “reshaping and prepping data.”  Excel is a natural place to do this sort of thing when you have no readily-available alternative, such as assistance from a database pro.  But PowerPivot’s autorefresh capability, combined with the read-only PowerPivot window, strongly incents you to pursue a better way of doing things.  I could write many pages on this.

4) Files are the Source Projects, with No Blessed Version – remember, Excel is a cleverly-packaged programming language.  And “real” programming tools offer something called “source control” – a single place where you store your work safely and securely.  There is never any ambiguity about where the most current program code lives – it’s in the source control system.  In the Excel world, only the most disciplined teams can say they always know where the “known good/blessed” version of a workbook is at all times.

PowerPivot Server changes all of that.

Furthermore, teams of programmers collaborating on a single application can split that application into many, MANY separate files to reduce the instances of overwriting each others’ work.

In Excel/PowerPivot terms, imagine breaking a single workbook into one file that contains all the PowerPivot measures, one file for each worksheet, and maybe even separate files controlling all the formatting – wouldn’t that be convenient if there were five of you working on the same workbook?  Probably overkill for most situations though – we are lone wolves, us Excel Pros.  Anyway, PowerPivot does not do anything to rectify that, except for the whole thin workbooks thing…  which I guess actually IS a pretty significant advance.

5) Files are the distribution method – if you think of Excel workbooks as applications (which they are), and more specifically as applications that are updated every day, suddenly the process of emailing them around looks ridiculous.  If I were emailing around, say, Angry Birds, and you could copy it to your desktop, no one would EVER be running the same version.  In fact, a single person would have five copies laying around various places.  But Angry Birds isn’t updated every day, or even every week.  Workbooks frequently are.

With PowerPivot server, a URL is the “distribution” method.  So it’s more like Facebook or Amazon.  You can’t exactly copy those applications to your desktop.  What version of Amazon.com are you running?  Yeah that’s right.  The latest.

6) The Runtime and Dev Environment are the Same.  You sit down and build an application (workbook) in Excel.  Then you send it to someone else and it opens…  in Excel!  They can see all of your source data, all of your logic.  They can then modify it if they want.  And the less savvy folks are just outright intimidated by it – all of that Excel stuff around the edges feels like work.  It doesn’t feel like an app.

Again, PowerPivot Server solves that, by making the workbook a web app.

7) Self-obfuscating (A1-style reference).  With PowerPivot, gone are the days when you have to reference your data by column letter.

8) No separation of presentation and logic, no “portability.”  I’ve talked about this before at length on the official Excel blog, so I’ll just link you over to the explanation of why PowerPivot feels like being freed from quicksand.

Whew.  Long post.  Gonna stop for today Smile

5 Responses to Big Data is Just Data, Why Excel “Sucks”, and 1,000 Miles of Data

  1. schm0e says:

    Well, that’s what New Yorkers are: “simply a group of people who DO things”!

  2. schm0e says:

    FYI, When working with IBM’s “Big Insights” Hadoop front end, guess what you use to analyze all that data with? A spreadsheet. “BigSheets”, to be exact. So why learn BigSheets if you can get Excel to handle it?

  3. Meta Brown says:

    OMG, how did I miss your reasons why Excel sucks all this time?

    Another important reason why Excel sucks (the most important, in my view) is that users do a lot of things from scratch (whether they need to or not), creating loads of opportunities for error. Creating a formula is an opportunity for error, selecting a range is another and adding even one data point is another. Since one thing feeds another, these errors propagate. It’s ugly!

    A resource for loads of excellent information from serious research on spreadsheet errors is “What We Know About Spreadsheet Errors” (http://bit.ly/sserrors) by Raymond Panko of the University of Hawaii.

    • powerpivotpro says:

      :)

      Once you start using Power Pivot for analysis, the “portable formulas” benefit kicks in. What used to require 50 formulas copy/pasted/filled-down into thousands of cells, with arcane and unreadable references, now becomes 5-6 formulas that exist only one time *each* – no copies at all. Oh, and those formulas adapt to each new report/analysis without rewriting them.

      The surface area for random errors (that “one error per 100 actions” stat) drops dramatically, while the chances of an error being able to “hide” simultaneously is decreased multiple times. You are inherently testing every Power Pivot formula over and over as you manipulate pivots.

      I am comfortable saying that a well-constructed Power Pivot model is in the ballpark of 100x less error prone than an optimally constructed Excel model for the same purpose, assuming the problem is of reasonable complexity. Power Pivot is not suited for all tasks of course, but for reporting and analysis, it virtually eliminates the errors that EUSPRIG etc. have documented.

      I’d love to see Dr. Panko do some followup research on Power Pivot.

Leave a Reply