Most of the literature on why BI projects fail tends to put some variant of ‘a lack of communication between IT and the business’ near the top of the list of culprits. This implies a relationship between two parties prone to friction. Either IT or ‘the business’ can own the BI initiative, but each needs the other in order for it to progress. In crude terms, the business has information needs; IT has data and the electronic infrastructure required to access it, transform it, and publish it as information. The two parties don’t speak the same language so each has to second-guess the other. Each is also required to generalise the multiple views of its members into a unified position for representation and negotiation purposes. The result is an equilibrium conception of the information needs of business consumers which may only poorly approximate each individual user’s actual needs.
The data needs of analysts are, by comparison, easier to serve than the information needs of business consumers. Analysts want data in closer to its raw form, and are less reliant on others for its transformation into packaged information. There are fewer representation and negotiation steps between demand and fulfilment.
Related Analyst First posts:
- Admit you have a data quality problem. Just like any twelve-step program, admitting the problem is key. From there, follow the next three steps for data improvement.
- Focus on the data you expose to customers, regulators, and others outside your organization. Take a careful look at your system of controls. Is it up-to-snuff? Make sure — not only that the right controls are in place, but that you’re actually using them. Every time.
- Define and implement an advanced data quality program. Making sure data leaves the door correctly may be a viable short-term alternative, but there is already too much data, and the quantities are growing. Just as manufacturers found that they had to “prevent errors at their sources,” so too with data. You need a quality program that does so.
- Take a hard look at the way you treat data more generally. Almost everyone readily acknowledges that “data are among our most important assets.” But they don’t manage them that way. Indeed, data are almost invisible. And the top person responsible for data may be an architect buried deep in the bowels of IT. If this description rings true, you need to get an aggressive data program, with real talent, budget, and teeth in place.
Steps 1 and 4 get to the heart of how data is understood, and misunderstood, within organisations. They’re also about the status of data.
Data has a number of curious properties that limit the explanatory reach of commonly used metaphors:
- Data is a resource in that it’s quantifiable and useful and can be transformed into something of higher value through extraction and processing. But it’s not a commodity. At the granular level, unlike physical resources, every piece is different; and in the aggregate its value is not a function of its volume.
- Data mining makes sense as a value extraction analogy, but it also fails in the sense that physical mining processes are well defined, highly automated, repetitive, and relatively predictable. Data mining, not so. Its exploratory, bespoke and synergistic nature makes it more analogous to prospecting than mining.
- Data as an asset has its limits too. As Redman notes, this is a deceptively simple idea. Everyone touches and interacts with data differently. But unlike physical assets, data persists after reuse and can be endlessly recycled and repurposed, in parallel, and not always in forseeable ways.
Step 1 – admitting to data quality problems – reflects my contention that the default assumption should always be of problematic data.
Step 4 addresses the status of data. I’ve argued before that data is a spellword. It’s easily and often dismissed as technical and geeky, and therefore not worthy of executive concern. As Redman argues, however, its importance shouldn’t be underestimated, nor should the responsibility for understanding and managing it be deprioritised, outsourced, or consigned to the “bowels of IT.”
Steps 2 and 3—the action steps—are of particular interest to Analytics practitioners (as distinct from Business Intelligence practitioners) given the different data needs of analysts compared with business consumers. An advanced data quality program designed to prepare data for exposure to customers, regulators and other outsiders—if it takes a ‘one size fits all’ approach—will most likely interfere with analytics. As Redman contends, “the tendrils of a data program will affect everyone,” and they will do so differently. Such a program needs to recognise that the exploratory work of analysts is a separate stream of activity from the ETL-DQ-MDM-EDW-BI stream, with its own distinct set of needs.
Related Analyst First posts:
Philip Russom at TDWI observes that the data needs of analysts aren’t best addressed using the standard practices of data warehousing:
On the one hand, all of us in BI and data warehousing are indoctrinated to believe that the data of an enterprise data warehouse (EDW) (and hence the data that feeds into reports) must be absolutely pristine, integrated and aggregated properly, well-documented, and modeled for optimization. To achieve these data requirements, BI teams work hard on extract, transform, and load (ETL), data quality (DQ), meta and master data management (MDM), and data modeling. These data preparation best practices make perfect sense for the vast majority of the reports, dashboards, and OLAP-based analyses that are refreshed from data warehouse data. For those products of BI, we want to use only well-understood data that’s brought as close to perfection as possible. And many of these become public documents, where problems with data could be dire for a business.
On the other hand, preparing data for advanced analytics requires very different best practices – especially when big data is involved. The product of advanced analytics is insight, typically an insight about bottom-line costs or customer churn or fraud or risk. These kinds of insights are never made public, and the analytic data they’re typically based on doesn’t have the reuse and publication requirements that data warehouse data has. Therefore, big data for advanced analytics rarely needs the full brace of ETL, data quality, metadata, and modeling we associate with data from an EDW.
In fact, if you bring to bear the full arsenal of data prep practices on analytic datasets, you run the risk of reducing its analytic value.
The article identifies three properties of data suited for advanced analytics which distinguish it from data prepared for BI and data warehousing:
- More granular: The most granular data in an EDW is typically summarised. Analysts want the full flexibility and information potential that comes from the lowest level of detail.
- Less treated: The “overly studied, calculated, modeled, and aggregated data of an EDW” is the result of a number of layers of applied assumption (by non-analysts). But as Russom points out, some kinds of analytic applications, such as fraud detection, “actually thrive on questionable data in poor condition.” EDW preparation also removes data from its source schema context. But Analysts sometimes want exactly this context because it’s the clearest reflection of the underlying process which created the data. Consequently, to BI professionals, “the preparation of data for discovery analytics seems minimal (even slipshod) – often just extracts and table joins – compared to the full range of data prep applied to EDW data.”
- Indifferent to reporting purposes: ETL, DQ, and data modelling processes for EDW operate on the assumption that the data they create will at some point be seen by a wide range of business consumers and potentially the public. The downstream value of data being used for advanced analytics, by contrast, is fundamentally unknowable at the time of the analysis.
In other words, analysts want data closer to raw form whereas business consumers want it closer to information. The implication for data providers is that the data needs of analysts are in fact easier to serve than those of business consumers.
Related Analyst First posts:
“We’ve got good data.”
It’s common to hear people make this claim in the context of a nascent or proposed Business Analytics initiative. Sometimes a rationale is offered: there’s a data warehouse, or a system migration project was recently completed. However if by “good” we mean data having at least all of the following properties…
- Fitness for purpose
…then a better starting assumption would be: “We don’t have all the data we want, and what we have is problematic.”
There are some basic reasons for this expectation:
- In BI you’re generally surfacing existing data that has been historically under-accessible. You should expect data that has not been routinely scrutinised to be bad because there’s been no incentive for it to be good.
- In Analytics you’re frequently uncovering new insights. You should expect these to raise hitherto unanticipated questions. Answering these is typically going to require at least some data you haven’t thought to collect, source, acquire, or derive before.
The prudent approach is to assume data problems. This shouldn’t derail your Business Analytics initiative. Far from it. Acquiring a comprehensive understanding of your data should be one of your initial, primary and explicit objectives – not an incidental series of in-project obstacles.
Related Analyst First posts:
This post examines data as an economic resource and a source of value, and provides a new functional definition of Analytics on that basis.
Data is a fascinating resource, with a number of characteristics that distinguish it from other things that we think of as “resources”.
The first point is that data is not a commodity. The inherent value of data is that every piece tells you something new, and the pieces vary dramatically in their meaning, importance, value and reliability. While it is intuitively appealing to think of data in aggregate, and to consider its volume in tera-, peta- or zettabytes, it pays to note that the value in such data is not a function of volume in any reliable sense. This is at odds with commodities such as iron ore, gold or crude oil, which can be measured directly in dollars per kilogram.
The value in data, while very real, is not easy to quantify. In this way, data is more like human resources and less like gold or oil. It is inherently heterogeneous in nature, and its value is not proportional to its mass.
Nor is the value extraction process homogeneous. While the way you extract value from one lump of iron or is no different than that applied to another, the same does not hold true for data. Further, the method of value extraction may well be unknown, and require further exploration. Even if a value extraction method has been identified, and indeed proven to work, there may well be additional value in the data. Finally, the value of data may well only prove itself in concert with other data. This is a synergistic, almost alchemical effect for which there is no good metaphor in the world of commodities.
Where the commodity resource analogy does hold is that data holds value locked within it, and effort must be applied to locate and extract it.
Analytics, particularly in its “data mining” incarnation, is best described as investment in the extraction of value from this curious, heterogeneous, synergistic resource. The tools, skills, techniques and processes of Analytics ares all in the service of this investment enterprise.
The act of investing in data is best considered by way of analogy. Investing can be a simple, hands-off activity, based on a simple transaction – money is put in, and value is generated by an external process. This is how stock investment works. The agent investing in the stock merely puts up the money.
Now consider investing in one’s own education. Or health and fitness program. In these cases, considerable effort by the individual is required: they cannot outsource their intimate involvement in the process invested in.
Analytics is like the latter process for any business leader investing in data. If they are not intimately involved in the process, then the process is a most likely a failure, and almost certainly a waste.
The “mining” analogy is thus used at one’s peril. So is an over reliance on specific tools, processes or preconceptions about what is in the data. Regular mining is not a highly specialised form of labour, and much mining activity is rapidly automated.
Data mining is very different, and the mining analogy fails spectacularly if it envisages a repetitive, well-defined activity, where large machines extract value in a predictable, reliable way, supported by interchangeable people performing repetitive tasks. Indeed, following the mining analogy, real data miners are less like miners and more like prospectors and geologists, performing difficult, ever-changing and highly skilled work to detect value when and how it may arise.
Where they do identify repeatably extractable value (e.g. credit scoring models, churn models, fraud detection algorithms) these nuggets of (temporary) value can indeed be automated and a production processes applied. But this is no longer Analytics, rather it is the IT instantiation of Analytics results. The data miner has moved on to discover value somewhere else.
Thus data is a very unusual value store, with equally unusual, heterogeneous value extraction methods, and a reliance on adaptive, exploratory, highly skilled professionals to make them happen.
Data has another unusual economic property: it remains. As an economic asset, it can be preserved at negligible cost, while other assets -premises, machines, staff, funds and lines of credit – may diminish in tough economic times. The potential value of the data is hard to measure as discussed above, but whatever it may be it grows relative to other shrinking assets. Thus, in hard times, the relative value of data grows, and the business case for Analytics, or Investing in Data grows.
Data itself may grow in volume, scope and complexity with relative ease and low cost, thus the business case for Analytics grows with it.
Data may take many forms, with the structured, electronic enterprise data most familiar to analysts the most familiar form. But it is one of many. Unstructured data is anywhere from 4 to 20 times more voluminous than the structured kind. Survey data may be collected to answer those questions that enterprise data fails to answer.
A Postscript – Tacit Data Mining
The definition of Analytics above extends very broadly in its definition of data.
The most interesting, readily available, strategically relevant and poorly understood form of data is tacit data: the information contained in the brains of staff, board, shareholders and anyone else who would see the organisation do well. Tacit data is seen as the most difficult to identify, extract, value and leverage.
The definition of Analytics presented above includes investment in the extraction of value from all data, including tacit. How is tacit data mined? The most effective and powerful way is by use of collective intelligence and forecasting techniques, such as prediction markets. This is a topic for another day.
About usAnalyst First is a new approach to analytics, where tools take a far less important place than the people who perform, manage, request and envision analytics, while analytics is seen as a non-repetitive, exploratory and creative process where the outcome is not known at the start, and only a fraction of efforts are expected to result in success. This is in contrast with a common perception of analytics as IT and process.
Tags in a CloudAIPIO analyst first Analyst First Chapters analytics analytics is not IT arms race environments big data business analytics business intelligence cargo cults collective forecasting commodity and open source tools complexity data decision automation decision support educated buyer EMC-greenplum forecasting HBR holy trinity human infrastructure incentives intelligence model of analytics investing in data lean startup literacy management culture MBAnalytics operational analytics organisational-political considerations Philip Russom Philip Tetlock prediction markets presales R Robin Hanson Strategic Analytics tacit data TDWI Tom Davenport uncertainty uneducated buyer vendors why analyst first