(The use of this word as a verb was my favorite part of those movies -
use it sparingly to liven up informal meetings!)
What Counts as Big Data?
We recently received some advice, from some very smart and well-connected people, that we should be using the term “Big Data” in our marketing. After all, it’s a hot term. But sophisticated sources of advice like the one in question aren’t slaves to what’s hot. They are shrewd followers of what everyone ELSE thinks is hot. And when these people say “use this term,” they don’t do so lightly.
Now, some of the PowerPivot models we build at Pivotstream have hundreds of millions of rows of data in them. By any reasonable metric, that is big. And when you consider that PowerPivot is a free extension to Excel, the world’s most popular analytics and reporting tool, PowerPivot may already be the world’s most ubiquitous Big Data tool.
So I think we qualify to use the term. But most of our Cloud PowerPivot customers, at least so far, aren’t using nearly that volume of data. Tens of thousands of rows is much more the norm, which is something Excel has always “done.”
So while a marketing mind sees “hot term” and immediately begins throwing it around like candy, a more analytical type of guy like me wants to make his peace with it first.
Some of my conclusions lead me to believe that YOU in fact may already be doing a flavor of big data, and could fairly lay claim to it. Join me on my journey
The “Official” Definition – The Three (Four) V’s
Ask someone “in the know” about Big Data, and they will likely tell you about the Three (or Four) V’s. and right away we begin to see disputes about the definition.
You know, like hundreds of millions of data points. Or billions. Or even trillions. No one has dared specify a particular minimum number, because of course it’s subjective. But if you have large data volumes, feel free to call yourself Big. Try it out, it’s fun.
I’ve seen two definitions of this that disagree quite a bit:
2a) The speed at which you must produce results – you need answers now, not tomorrow or in the Fall. Well, not many things can match PowerPivot here. Not a BI project, for sure, and certainly nothing that uses the word “Implementation” or “Project.”
Heck, even traditional (non-PowerPivot) Excel is high velocity compared to those alternatives.
Humorous true story: When I worked on Bing at Microsoft, the query interface we used to analyze users’ search patterns required you to learn a hybrid of Perl and SQL, as well as an arcane cmd interface, to submit your queries. 24 hours later, you’d be notified that you had a syntax error. That’s probably improved by now, but man was that funny. Top Ten Ridiculous Moments at Microsoft – perhaps a future post
2b) The speed at which the data is coming in – imagine a data set that is sampling hundreds or thousands of times per second. In some sense that just indicates Volume, but there are legitimate differences in data sources like this.
For one, it may be perfectly natural to throw away most of the data and just keep a regular or random sample of it for analysis. PowerPivot doesn’t help you with this. Maybe SQL can, maybe you need to use something else. I haven’t done much of this.
Another example: you may need to “pre-analyze” the data, because only certain data points count. For instance, the respiration data from lab rats is only interesting when it “peaks,” indicating a single breath. And yes, PowerPivot CAN do that, and surprisingly more accurately than software specialized for that purpose!
Put simply, data of varying shapes and formats. But again, definitions diverge quickly here.
Some people mean Variety to indicate things like video, documents, images, etc.
Others mean a variety of sources – pulling from a retail transactional system, for instance, and then pulling weather data from DataMarket, and pulling demographic data from another source, would quality. Nothing mashes up data like PowerPivot. So we get a “checkmark” here for sure.
Still others interpret variety to mean data “records” of semi-structured format. Web log information is a great example of this. It requires a reasonable amount of text parsing just to turn it into something that’s friendly to analysis, but even then, there are many other pre-processing steps you can take to clarify semantic intent and classification.
The medical world is another example of this – a physician’s report on a patient contains some structured information (temperature, age, weight), but then also some text like the prescription, and then some even less-structured text such as the physician’s notes.
PowerPivot needs some help from other systems in order to handle stuff like semi-structured web logs and physician reports. Hadoop is one example, and it is PowerPivot friendly. If you have a project like this and want some help, let me know. We’d like to explore this but currently lack a good project.
Forrester added this one after the first 3 V’s had been around for awhile. Others pushed back and said that this was covered by Variety. I personally think this fourth V doesn’t add much to the story, primarily because everything I’ve heard about it was something I already included in my picture of Variety. So this one may have added something depending how broad your perceived definition of Variety.
In Truth, The Market Has Co-Opted the Term
I remember, back around 2002 or so, that “web services” were the hot new tech. There was a reasonably clear definition of a web service – a programming construct that you could call over internet protocols. An API. Or a function, to think of it in Excel terms.
There were some squishier characteristics too, like “simple” rather than complex, document payload oriented rather than session-based, etc., but those quickly got watered down in practice. The Analysis Services team at Microsoft defined a web service protocol, in conjunction with other companies, that was about as non-simple as it could get (XML/A).
But those were small dilutions of the definition and spirit. The big dilution came when companies like SalesForce and Expedia started describing their web SITES as web services. This was before the term Software as a Service was invented, so I guess it was fair. But true web services, according to the original meaning, NEVER had a user interface.
Big Data Now Also Means “We Have Lots of Data to Analyze”
The same thing is happening today with Big Data. The general need for analysis and reporting is exploding, thanks to exploding availability of data as well as the rising need for efficiency brought on by economic downturn.
In other words, people who were never in the business of analyzing ANYTHING are now confronted with an undeniable need for it. People who were not BI pros, and were not Excel pros either. And those people are now scrambling for ways to analyze their critical data.
When they hear the term “Big Data,” it resonates with them. They latch onto it much more readily than they latch onto terms like BI, or even Self-Service BI. They are not Excel pros either (or maybe they are and have not yet discovered PowerPivot).
So Big Data sure sounds like the answer they’ve been looking for.
Remember my diagram from the Insight as a Service post? The one that showed what Excel Pros look like to the rest of their teams? They look like magic:
The data on the left is undeniably BIG. Even if it’s a few thousand rows, it’s far too much for a human brain to make sense of. The magic comes from taking the “big” and making it “small.”
Humans Need Small Data!
One of my favorite (and most popular) posts of all time is Analysis in the Three Seconds of Now. It explains that humans have a biological three-second window in which something feels like “now,” and everything else feels slow.
Well, there’s another human limit that comes into play every day. We basically can only make sense of things that we can see all at once!
In other words, if “data” doesn’t fit on a single screen, it’s Too Big for Humans!
So I encourage you to think of Big Data as meaning two things. One is the “official” definition, as squishy as it is. And PowerPivot does play a role in multiple aspects of that official “domain” of Big Data.
But the other meaning is more of a consumer-ish meaning. People who are new to analysis hear the term “Big Data” and often identify with it even if they only have thousands of data points, and they need to turn those into something smaller, more readily-perceivable and actionable. Oh, and there are many more people in this camp than there are who are experts in the “official” version of Big Data. Their needs are real and they are willing to pay for assistance.