Philip Russom at TDWI observes that the data needs of analysts aren’t best addressed using the standard practices of data warehousing:

On the one hand, all of us in BI and data warehousing are indoctrinated to believe that the data of an enterprise data warehouse (EDW) (and hence the data that feeds into reports) must be absolutely pristine, integrated and aggregated properly, well-documented, and modeled for optimization. To achieve these data requirements, BI teams work hard on extract, transform, and load (ETL), data quality (DQ), meta and master data management (MDM), and data modeling. These data preparation best practices make perfect sense for the vast majority of the reports, dashboards, and OLAP-based analyses that are refreshed from data warehouse data. For those products of BI, we want to use only well-understood data that’s brought as close to perfection as possible. And many of these become public documents, where problems with data could be dire for a business.

On the other hand, preparing data for advanced analytics requires very different best practices – especially when big data is involved. The product of advanced analytics is insight, typically an insight about bottom-line costs or customer churn or fraud or risk. These kinds of insights are never made public, and the analytic data they’re typically based on doesn’t have the reuse and publication requirements that data warehouse data has. Therefore, big data for advanced analytics rarely needs the full brace of ETL, data quality, metadata, and modeling we associate with data from an EDW.

In fact, if you bring to bear the full arsenal of data prep practices on analytic datasets, you run the risk of reducing its analytic value.

The article identifies three properties of data suited for advanced analytics which distinguish it from data prepared for BI and data warehousing:

  1. More granular: The most granular data in an EDW is typically summarised. Analysts want the full flexibility and information potential that comes from the lowest level of detail.
  2. Less treated: The “overly studied, calculated, modeled, and aggregated data of an EDW” is the result of a number of layers of applied assumption (by non-analysts). But as Russom points out, some kinds of analytic applications, such as fraud detection, “actually thrive on questionable data in poor condition.” EDW preparation also removes data from its source schema context. But Analysts sometimes want exactly this context because it’s the clearest reflection of the underlying process which created the data. Consequently, to BI professionals, “the preparation of data for discovery analytics seems minimal (even slipshod) – often just extracts and table joins – compared to the full range of data prep applied to EDW data.”
  3. Indifferent to reporting purposes: ETL, DQ, and data modelling processes for EDW operate on the assumption that the data they create will at some point be seen by a wide range of business consumers and potentially the public. The downstream value of data being used for advanced analytics, by contrast, is fundamentally unknowable at the time of the analysis.

In other words, analysts want data closer to raw form whereas business consumers want it closer to information. The implication for data providers is that the data needs of analysts are in fact easier to serve than those of business consumers.

The Planet Data Center

Related Analyst First posts:

Tagged with:
 

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Set your Twitter account name in your settings to use the TwitterBar Section.