Is Big Data a Bubble?
In case you’re in a hurry: Of course it is. And that is good.
Quentin Hardy in The New York Times Bits blog summarises the state of play regarding the business world’s interest in and utilisation of big data. As he recognises, big data is really about “the benefits we will gain by cleverly sifting through it to find and exploit new patterns and relationships.” Big data drives up the obviousness of the need for analytics. Its often invoked qualities of volume, velocity, and variety mean that we can’t pretend to analyse it without statistical and machine learning methods.
Big data analytics at the present moment, however, is characterised by uncertainty. What is big data, exactly? What’s its value? How is it analysed? Which technologies should be used? Which standards will prevail? Where to invest? There is, Hardy finds:
a common problem in the Big Data proposition: Often people won’t know exactly what hidden pattern they are looking for, or what the value they extract may be, and therefore it will be impossible to know how much to invest in the technology. Odds are that the initial benefits, as it was with Google’s Adwords algorithm, will lead to a frenzy of investments and marketing pitches, until we find the logical limits of the technology. It will be the place just before everybody lost their shirts.
This is a common characteristic of technology that its champions do not like to talk about, but it is why we have so many bubbles in this industry. Technologists build or discover something great, like railroads or radio or the Internet. The change is so important, often world-changing, that it is hard to value, so people overshoot toward the infinite. When it turns out to be merely huge, there is a crash, in railroad bonds, or RCA stock, or Pets.com. Perhaps Big Data is next, on its way to changing the world.
Such is technology. Hence the level of uncertainty, and also reasons for optimism:
There are an uncountable number of data-mining start-ups in the field: MapReduce and NoSQL for managing the stuff; and the open-source R statistical programming language, for making predictions about what is likely to happen next, based on what has happened before. Established companies in the business, like SAS Institute or SAP, will probably purchase or make alliances with a lot of these smaller companies.
Expect to see a lot more before it all gets sorted out.
TDWI’s recent Big Data Analytics report by Philip Russom, based on 325 sets of responses to a May 2011 survey, echoes this and provides a far more comprehensive overview of how businesses are coping with the realities of big data. One measure of how unsettled the field is right now is that, in response to the question ‘Which of the following best characterizes your familiarity with big data analytics and how you name it?’, 65% of the survey’s respondents replied, ‘I know what you mean, but I don’t have a formal name for it.’
The report is recommended in full, but in case you’re still in a hurry, its margin summaries can be skimmed through in a matter of minutes.
Related Analyst First posts:
Philip Russom at TDWI observes that the data needs of analysts aren’t best addressed using the standard practices of data warehousing:
On the one hand, all of us in BI and data warehousing are indoctrinated to believe that the data of an enterprise data warehouse (EDW) (and hence the data that feeds into reports) must be absolutely pristine, integrated and aggregated properly, well-documented, and modeled for optimization. To achieve these data requirements, BI teams work hard on extract, transform, and load (ETL), data quality (DQ), meta and master data management (MDM), and data modeling. These data preparation best practices make perfect sense for the vast majority of the reports, dashboards, and OLAP-based analyses that are refreshed from data warehouse data. For those products of BI, we want to use only well-understood data that’s brought as close to perfection as possible. And many of these become public documents, where problems with data could be dire for a business.
On the other hand, preparing data for advanced analytics requires very different best practices – especially when big data is involved. The product of advanced analytics is insight, typically an insight about bottom-line costs or customer churn or fraud or risk. These kinds of insights are never made public, and the analytic data they’re typically based on doesn’t have the reuse and publication requirements that data warehouse data has. Therefore, big data for advanced analytics rarely needs the full brace of ETL, data quality, metadata, and modeling we associate with data from an EDW.
In fact, if you bring to bear the full arsenal of data prep practices on analytic datasets, you run the risk of reducing its analytic value.
The article identifies three properties of data suited for advanced analytics which distinguish it from data prepared for BI and data warehousing:
- More granular: The most granular data in an EDW is typically summarised. Analysts want the full flexibility and information potential that comes from the lowest level of detail.
- Less treated: The “overly studied, calculated, modeled, and aggregated data of an EDW” is the result of a number of layers of applied assumption (by non-analysts). But as Russom points out, some kinds of analytic applications, such as fraud detection, “actually thrive on questionable data in poor condition.” EDW preparation also removes data from its source schema context. But Analysts sometimes want exactly this context because it’s the clearest reflection of the underlying process which created the data. Consequently, to BI professionals, “the preparation of data for discovery analytics seems minimal (even slipshod) – often just extracts and table joins – compared to the full range of data prep applied to EDW data.”
- Indifferent to reporting purposes: ETL, DQ, and data modelling processes for EDW operate on the assumption that the data they create will at some point be seen by a wide range of business consumers and potentially the public. The downstream value of data being used for advanced analytics, by contrast, is fundamentally unknowable at the time of the analysis.
In other words, analysts want data closer to raw form whereas business consumers want it closer to information. The implication for data providers is that the data needs of analysts are in fact easier to serve than those of business consumers.
Related Analyst First posts:
Philip Russom at TDWI warns against the ‘analytic cul-de-sac’, “where the epiphanies of advanced analytics never get off a dead-end street to be fully leveraged elsewhere in the enterprise.” He is concerned about analysts who love to chase insights but “can’t be bothered with operationalizing their epiphanies”:
In other words, once you discover the new form of churn, analytic models, metrics, reports, warehouse data, and so on need to be updated, so the appropriate managers can easily spot the churn and do something about quickly, if it returns. Likewise, hidden costs, once revealed, should be operationalized in analytics (and possibly reports and warehouses), so managers can better track and study costs over time, to keep them down.
I find this interesting primarily for its overarching assumption, which is that an insight suggests its own operationalisation. Russom’s examples contain insights that close the gap between data and decision, that are “actionable”. This has not always been my experience. To the contrary, insights more often than not turn the simple and understandable into the complex and uncertain. In response, the appropriate course of action may be unclear. It may be that nothing can be done. It may be that nothing should be done—that doing anything will make things worse. It certainly can’t be assumed that managers will know what to do, let alone quickly.
Insights can contribute to sensemaking and situational awareness without carrying implications for decision support and action. This makes them valuable, but not always welcome. Sometimes new information is disruptive.
Related Analyst First posts:
Philip Russom at the TDWI Blog:
The current hype and hubbub around big data analytics has shifted our focus on what’s usually called “advanced analytics.” That’s an umbrella term for analytic techniques and tool types based on data mining, statistical analysis, or complex SQL – sometimes natural language processing and artificial intelligence, as well.
The term has been around since the late 1990s, so you’d think I’d get used to it. But I have to admit that the term “advanced analytics” rubs me the wrong way for two reasons:
First, it’s not a good description of what users are doing or what the technology does. Instead of “advanced analytics,” a better term would be “discovery analytics,” because that’s what users are doing. Or we could call it “exploratory analytics.” In other words, the user is typically a business analyst who is exploring data broadly to discover new business facts that no one in the enterprise knew before. These facts can then be turned into an analytic model or some equivalent for tracking over time.
Second, the thing that chaffs me most is that the way the term “advanced analytics” has been applied for fifteen years excludes online analytic processing (OLAP). Huh!? Does that mean that OLAP is “primitive analytics”? Is OLAP somehow incapable of being advanced?
His answer is that OLAP can indeed be advanced, but not in the same way as advanced analytics. There are important differences:
In my mind, advanced analytics is very much about open-ended exploration and discovery in large volumes of fairly raw source data. But OLAP is about a more controlled discovery of combinations of carefully prepared dimensional datasets. The way I see it: a cube is a closed system that enables combinatorial analytics. Given the richness of cubes users are designing nowadays, there’s a gargantuan number of combinations for a wide range of users to explore.
This is a useful distinction. Although exploratory, OLAP is a controlled and self-contained environment. Someone has decided which dimensions to include and which to exclude, and how they should be structured, and how deep they should go, and how they should be summarised, and so on. Large cubes will indeed offer users a “gargantuan number of combinations”, but this does not necessarily make exploration a richer activity. It may in fact make it ineffective. Manually slicing and dicing twenty dimensions certainly risks being inefficient.
Here it’s helpful to draw a further distinction between OLAP and advanced analytics. OLAP makes multidimensional data exploration about as fast and intuitive as it can be when a human is doing the driving. This means being able to arrange on screen, in two dimensions (perhaps taking advantage of colour and shape to visualise a third and fourth), relatively small subsets and arithmetical summaries of data. Advanced analytics, however, automates exploration. Only data mining methods can look at all dimensions simultaneously, at all levels, in combination. And they can do this in unsupervised (looking for natural structure in the data) or supervised (inferring input-outcome relationships) modes.
Consider also the role of assumptions. OLAP is, as a data exploration vehicle, fairly assumption-laden. Each cube-based analysis reflects, at a minimum, the assumptions of the cube’s user on top of those of its architect. Data mining methods, by contrast, are naïve by design, deliberately insulating exploration from human biases.
Russom argues convincingly against the notion that advanced analytics will render OLAP obsolete:
In defense of OLAP, it’s by far the most common form of analytics in BI today, and for good reasons. Once you get used to multidimensional thinking, OLAP is very natural, because most business questions are themselves multidimensional. For example, “What are western region sales revenues in Q4 2010?” intersects dimensions for geography, function, money, and time.
This is a good illustration of my contention that BI provides context: what happened, where, as described by existing measures and dimensions. What OLAP can’t do is address more causal and complex business questions like:
- What increases sales?
- What predicts loyalty?
- Who is most likely to be loyal in future?
- What is the loyalty profile of the western region?
- How does it compare to other regions?
Such questions are best tackled by advanced analytics in symbiosis with BI.
Related Analyst First posts:
About usAnalyst First is a new approach to analytics, where tools take a far less important place than the people who perform, manage, request and envision analytics, while analytics is seen as a non-repetitive, exploratory and creative process where the outcome is not known at the start, and only a fraction of efforts are expected to result in success. This is in contrast with a common perception of analytics as IT and process.
Tags in a CloudAIPIO analyst first Analyst First Chapters analytics analytics is not IT arms race environments big data business analytics business intelligence cargo cults collective forecasting commodity and open source tools complexity data decision automation decision support educated buyer EMC-greenplum forecasting HBR holy trinity human infrastructure incentives intelligence model of analytics investing in data lean startup literacy management culture MBAnalytics operational analytics organisational-political considerations Philip Russom Philip Tetlock prediction markets presales R Robin Hanson Strategic Analytics tacit data TDWI Tom Davenport uncertainty uneducated buyer vendors why analyst first