Philip Russom at TDWI observes that the data needs of analysts aren’t best addressed using the standard practices of data warehousing:
On the one hand, all of us in BI and data warehousing are indoctrinated to believe that the data of an enterprise data warehouse (EDW) (and hence the data that feeds into reports) must be absolutely pristine, integrated and aggregated properly, well-documented, and modeled for optimization. To achieve these data requirements, BI teams work hard on extract, transform, and load (ETL), data quality (DQ), meta and master data management (MDM), and data modeling. These data preparation best practices make perfect sense for the vast majority of the reports, dashboards, and OLAP-based analyses that are refreshed from data warehouse data. For those products of BI, we want to use only well-understood data that’s brought as close to perfection as possible. And many of these become public documents, where problems with data could be dire for a business.
On the other hand, preparing data for advanced analytics requires very different best practices – especially when big data is involved. The product of advanced analytics is insight, typically an insight about bottom-line costs or customer churn or fraud or risk. These kinds of insights are never made public, and the analytic data they’re typically based on doesn’t have the reuse and publication requirements that data warehouse data has. Therefore, big data for advanced analytics rarely needs the full brace of ETL, data quality, metadata, and modeling we associate with data from an EDW.
In fact, if you bring to bear the full arsenal of data prep practices on analytic datasets, you run the risk of reducing its analytic value.
The article identifies three properties of data suited for advanced analytics which distinguish it from data prepared for BI and data warehousing:
- More granular: The most granular data in an EDW is typically summarised. Analysts want the full flexibility and information potential that comes from the lowest level of detail.
- Less treated: The “overly studied, calculated, modeled, and aggregated data of an EDW” is the result of a number of layers of applied assumption (by non-analysts). But as Russom points out, some kinds of analytic applications, such as fraud detection, “actually thrive on questionable data in poor condition.” EDW preparation also removes data from its source schema context. But Analysts sometimes want exactly this context because it’s the clearest reflection of the underlying process which created the data. Consequently, to BI professionals, “the preparation of data for discovery analytics seems minimal (even slipshod) – often just extracts and table joins – compared to the full range of data prep applied to EDW data.”
- Indifferent to reporting purposes: ETL, DQ, and data modelling processes for EDW operate on the assumption that the data they create will at some point be seen by a wide range of business consumers and potentially the public. The downstream value of data being used for advanced analytics, by contrast, is fundamentally unknowable at the time of the analysis.
In other words, analysts want data closer to raw form whereas business consumers want it closer to information. The implication for data providers is that the data needs of analysts are in fact easier to serve than those of business consumers.
Related Analyst First posts:
About usAnalyst First is a new approach to analytics, where tools take a far less important place than the people who perform, manage, request and envision analytics, while analytics is seen as a non-repetitive, exploratory and creative process where the outcome is not known at the start, and only a fraction of efforts are expected to result in success. This is in contrast with a common perception of analytics as IT and process.
Tags in a CloudAIPIO analyst first Analyst First Chapters analytics analytics is not IT arms race environments big data business analytics business intelligence cargo cults collective forecasting commodity and open source tools complexity data decision automation decision support educated buyer EMC-greenplum forecasting HBR holy trinity human infrastructure incentives intelligence model of analytics investing in data lean startup literacy management culture MBAnalytics operational analytics organisational-political considerations Philip Russom Philip Tetlock prediction markets presales R Robin Hanson Strategic Analytics tacit data TDWI Tom Davenport uncertainty uneducated buyer vendors why analyst first