EMC Greenplum’s Mark Burnard and I sat down late last year to talk about the future of data warehousing, big data, and analytics…
SS: How did you fall into data warehousing?
MB: I built spreadsheet systems during the early to mid-1990s while working for Rio Tinto and then Telstra. Then I was working for Corporate Express around 1995 when someone dropped a folder on my desk that said “Cognos”, so I picked it up and kind of ran with an embryonic data warehouse project at Corporate Express, built in SQL Server. So DTS, transformations, dumps out of a mainframe into text file, suck them up every night, build a cube in Cognos. And we took that live, and I thought okay this is really cool, I’d like to do more of this, so I jumped into consulting. That would have been 1999. Since then I’ve moved from BI report development to requirement definitions though to solution architecture and then more strategic consulting, and now to EMC Greenplum. That summarises about 20 years in IT.
SS: In data warehousing there’s a “warehouse first, marts second” approach associated with Inmon and a “marts first, radiate out” approach associated with Kimball. Do you have a view on those design philosophies?
MB: I think Inmon, as the father of data warehousing, came up with the concept of the corporate information factory. The philosophy as I understand it is that you want to build this machine that can answer any question that anyone could possibly ask about the business. So to do that you start off with the data architect and he maps out the entities that are of interest to the organisation—what matters and then what the attributes of those entities are at a high level, then comes up with a conceptual data model. The goal is that you encapsulate in this model everything that matters to the organisation. You turn it into a logical data model, then into a physical data model, then you start building the data warehouse, and eventually you can answer any question out of it. Data marts come along as a secondary spin-off. The underlying warehouse is big and complex, and it’s assumed that the analysts who want to work with its information are 1) only interested in a specific subset: sales, marketing, finance, HR…whatever it happens to be; and 2) that there is complexity in the model that you don’t want them to see as they’ll get really confused, or they’ll drag and drop the wrong columns into a report and start reporting numbers that are faulty. I think that approach works fantastically if the world stays still long enough for you to complete your data model and get the data warehouse built, but the reality is that that never happens. You could probably safely say that world was a bit slower 30 years ago when organisations were doing this model. The early sites tended to be companies that were so big that a project like that was able to return a business benefit simply because of the size of the organisation. The Bank of America, Citibank, AIG—some of the big guys who built the early data warehouses—benefited a lot from that approach. But I think now the game’s changed, and a 6 month turnaround time for a BI project is too slow. Marketing teams want to be able to spin off new products, sometimes one or two a month. Telcos, for example, are spinning off new plans, new product bundles, new marketing messages and solutions that require a lot of agility in the billing system and the provisioning system. Their core systems have to be a lot more agile and the data warehouse has to keep up with that. So if we’re structuring a new product bundle every month and taking it to market, the billing system may be able to keep up with that, the provisioning system may be able to keep up with that, even multiple provisioning systems, but when that all hits the data warehouse—in some of the situations I’ve seen—it becomes a screaming mess. Nobody can keep up, the warehouse can’t keep up, and then the folk who’ve rolled out the new bundle want new reports within the first week of having it in the market. They want to know “How many are we selling?”, “What’s the profitability?”, “What is the profile of the market segment that is purchasing this new product?”, and the warehouse says “We can give you that information in 6 weeks, but right now you’ve just broken all my reports with your new constructs and we’ve got to do some extra work on this data mart over here, and then we’ve got to tweak the ETL, and then we’ve got to do regression testing, impact analysis, etc. So these days, there needs to be a way of meeting the business need for both agility and reporting that keeps up with the pace of change. From what I’ve seen, the classic traditional data warehouse model is not able to do that.
SS: So business is becoming faster and organisations are becoming more complex, and people with an appetite for analysis are demanding more in terms of information, feedback, and measurement?
MB: I think it’s fair to say that the pace of business change has accelerated. There was a time when the core business of a bank might have been savings accounts and mortgages, but within the last 20 or 30 years that has diversified into online trading, reselling insurance products, having superannuation funds to manage. So the diversity of offerings that a bank is now bringing to the market—they might be offering 50 to 100 products where in the past they offered 10 or 12. Same thing with a telco. Back in Telstra’s early days it was fixed lines, and the only thing to worry about was local calls, STD calls and then international calls, there was nothing else. Now you’ve got mobile, data over mobile, over 3G, all the different product bundles that people have, caps—you’ve got telcos churning out new caps every 6 months—new phones, new bundles, bundling fixed with mobile with home internet and even cable TV—and needing to track when someone cancels one service but is still getting the bundle discount, for example. All that kind of complexity never existed even 10 years ago. So I think the pace of change and the complexity of doing business has changed, and the classic model that we’ve had for data warehousing has not significantly changed. In fact it used to be a line we would drop to reassure the client when positioning a data warehousing project to board level executives, to say “Data warehousing has not changed in 30 years”. In other words, you can be confident that this is a well-worn space, an area of technology that was not invented yesterday, and therefore that all the lessons have been learned, all the pain has been experienced, all of the lessons have been written up, and so building a data warehouse is a pretty safe endeavour.
SS: So presumably you don’t present that message today?
MB: No. Because I think the message isn’t accurate anymore. I see it diverging in two directions. I see your classic data warehouse which has the stringency around it. It has the discipline around things like being able to identify your data lineage, having your metadata available, being able to explain that this number on this report came from this table in the data mart, which came from this table in the core data model, which came through these transformations, from this source system, and that says 100, and this says 100, and we know that the numbers are correct. That’s been the standard approach to data warehousing, and everybody spends huge amounts of energy, time and money trying to keep the warehouse up to date so that you can confidently say to the board, or the ASX, or ASIC, or APRA, or whoever it happens to be when you put a report down in front of them, that you’ve got your traceability, your auditability, your metadata, your data lineage, and so on. The CFO, for example, has to be able to sign off on those numbers. The problem now is that we’ve taken that disciplined model—which is fantastic for reporting on financial numbers to regulators—and we’ve extended it into the domains of HR, and marketing, and across the entire enterprise data model. We wanted to fit the entire enterprise into the data warehouse. What I think we’re seeing already happen is that the locus of subjects that fit into the traditional data warehouse is shrinking to include only those where that level of rigour is required for reporting and analysis—areas such as finance, risk, and other things that go to regulators. All the other stuff, which is the other 95% of the business, will probably end up in a much more flexible, dynamic platform which we’re starting to call an analytical warehouse or analytic warehouse. It’s not your traditional data warehouse. Usually in terms of data volume it’s a lot larger, a lot more atomic, a lot more nimble, agile, and almost ad-hoc in a way. To be honest, people have always done this—they’ve just done it under the table: take a data set, throw it into an Access database, take a bigger data set, throw it into SQL Server, do something on a laptop, do something on a server that they bought on the corporate credit card down the road because the data warehouse wasn’t able to give them the data they required, or because running the query would slow down all the standard reporting. People have always done analytic warehouses; they were just under the radar.
SS: So they’re coming out of the closet in a sense?
MB: That’s right. Because the pain of the traditional approach is now such that it’s unsustainable, and so what everybody’s been doing all along covertly, organisations are now beginning to look at and say, “We have to cater for that model.” They have to have a way of managing information that caters for ad-hoc analysis, for marketing guys running quick, nimble little models to come up with some market segment so they can generate a campaign to a particular demographic. Traditionally their only official option was the data warehouse, but they couldn’t throw enough information in, or the queries ran too long, or the joins were too complex given some third normal form industry data model. They’ve been doing in under the table, so let’s elevate that and bring it back into the fold of IT but on a platform that can handle that kind of ad-hoc dynamic: throw some web clicks in here, throw some Twitter feeds in there, mix in some data from the Australian Bureau of Statistics, get some smart people to have a look for correlations, integrate the results from the latest campaign for marketing a platinum credit card to a certain demographic, and let’s do some follow up on that. All of this dynamic information needs a home—well, it’s always had an unofficial home, on someone’s hard drive—but what I see happening is that moving back into a more recognised, structured place, without the imposed controls that had it running away from home in the first place.
SS: So it sounds like we’ve got the emergence of two domains. There’s a domain of stability—the traditional data warehouse which fundamentally hasn’t changed much in 20 to 30 years. Its set of disciplines maps well to data and business processes that also fit that description. Something like financial reporting, where the accounting standards are pretty stable and where there are widely accepted processes which are professionally constrained for getting data into that shape. Then there’s a more uncertain domain—the data’s uncertain, the analysis in uncertain, the queries are uncertain, the perspective is uncertain, and it sounds like that’s broken in a way too. You’re saying that people have been doing it under the table for a long time, but it sounds like the scale is now such that it doesn’t fit under the table anymore. The tools are breaking.
MB: That’s exactly right. The data volumes that we are beginning to deal with won’t fit on a little server under the desk anymore. You’re talking about data volumes that will explode out past many terabytes eventually and you’re not going to fit it on your laptop, and if you did your single query would take the whole weekend. Maybe a real world example would be useful. One of EMC Greenplum’s reference sites is T-Mobile in the US. They have their classic traditional data warehouse. It’s 100 terabytes of data and it has the discipline and the control and the industry standard data model. Billing information comes in, provisioning information come in, and they can report to the market out of the data warehouse their precise number of subscribers and their exact financials every month. All of that controlled information that you’ve just described. Alongside that they have an analytical warehouse which happens to be a petabyte. It’s about 10 times the size of the traditional data warehouse. Into that flows a lot—if not all—of the same data. But it also contains 900 terabytes of other data, which is market sentiment analysis, unstructured data, and all kinds of data that is structured but which is too hard to fit into the industry standard data model. The other thing that’s probably worth mentioning is that within the 100 terabyte traditional data warehouse you have a lot of summarisation of call data records—for example, after 2 or 3 months you’ve finished with your standard reports and you summarise them to one row per customer per month instead of one row per phone call or one row per text message. This means you can still report accurately on all of your regulatory requirements, but you can’t do analysis anymore at the atomic level of detail. In the analytic warehouse, on the other hand, you might have that history for 3 or 4 years. Why would T-Mobile need that? Well, they were looking for drivers of churn, which is one of the biggest expenses for a telco. It takes 10 or 20 times the amount of money to win back a customer than it does to keep them and not have them leave. So they kicked off a project to try and identify the most significant indicators of a customer who was about to churn. To start with, they did some modelling around complaints to the call centre. Was there a pattern whereby someone who calls the call centre 3 or 4 times within 3 months was more likely to churn? That couldn’t be done on the existing data warehouse platform. Some of the data was there, but not all of it, and there wasn’t a business case to put it in there on the basis of speculation. And even if there was, had they started to run such an analysis it would have slowed down all of the business-as-usual reporting that the warehouse needed to support. Hence the analytic warehouse. First they did analysis on call centre logs looking to relate complaints to the churn indications that came from the warehouse. No correlation. Then they did a similar exercise looking at call dropout rates and call termination codes. Were dropped calls triggering churn? Again, no clear correlation. Note that they needed the unsummarised call records for this, but the data warehouse had already summarised them. So then they let a data scientist loose in the analytical workspace, unconstrained by the data model, unconstrained by performance concerns related to running obtuse queries. It turned out that the most significant precursor to a customer churning was whether or not any of the people a given customer called the most churned. It makes sense because now I’m calling them telling them I’m with a new telco. The correlation was so strong that if I churn today, the people I call the most are seven times more likely than the average customer to churn in the next 3 months. So they took that insight and operationalised it. The action they developed was that as soon as an individual churned, anybody they frequently called was targeted with an attractive offer to re-contract. They estimated the monetary benefit to the organisation from this was $70 million. Rob Strickland, The CTO of T-Mobile at the time still tells this story. He says ‘Here is my data warehouse. I’m spending millions and millions of dollars on maintaining this thing, and what it gives me is a bunch of standard reports. And yet over here, from my sandpit that cost me one million, comes the insight that was of $70 million benefit to the organisation.” They’re spending tens of millions of dollars on the data warehouse, but it’s the one million dollars on the analytic warehouse that’s differentiating them in the market.
SS: There’s a distinction worth recognising there between the process that generated the insight and then the subsequent operationalisation of that insight. It may have cost more than a million dollars to operationalise the insight—to get it out to call centres, to marketing and product teams, to come up with and execute the right campaigns and performance manage the whole thing—all of those things cost money too. But the discovery process to get to the insight was itself relatively cheap. Interesting too that many of the instinctively explanatory metrics didn’t turn out to matter in this case: complaints to the call centre, dropped calls, network, phone, plan, demographics. It turned out to be the social network.
MB: As measured by the call data records. One of the outcomes of that learning was the benefit of keeping atomic level call data for months or years rather than summarising it—which you have to do in a traditional warehouse because of the cost of storage.
SS: Because you don’t know from what perspective you’re going to analyse it?
MB: Correct. You may come up with a question in 3 months’ time that requires atomic level data over 2, 3, 5 years. If that’s been summarised you can’t get it back. This is part of the big data story. Organisations are starting to realise that there’s real gold in this atomic level data if they can find a way to keep it long enough, cheaply enough, in order to enable analytics be run.
SS: That in some sense addresses the ‘volume’ component of big data. What would you say about the ‘velocity’ and ‘variety’ components?
MB: Storage and computational power are both becoming cheaper, and technologies like Hadoop are making those two factors a lot easier to wrestle with. I won’t say that Hadoop is easy, because it’s not, it’s a very embryonic technology. But the vision is to be able to take large data sets that previously were ignored—that come under the label of ‘exhaust data’—application logs, machine logs, detailed stuff that you would never bother collecting, because why would you? Organisations are now beginning to see that it’s not actually that expensive to keep this data. And not only that, really valuable information can be harvested from it. One organisation we work with, for example, captures all of the application logs from a certain application it uses and mines them for indicators of internal fraud. Are any employees or other operators going into areas of the system they shouldn’t be? There was time when that log file information wouldn’t be collected because it was just too expensive to store, let alone analyse. The internal audit approach in such a regime relies on occasional random sampling. But now you can actually have definitive indications from the application logs. So you’re saving time, money, and the imprecision of having someone look randomly, because as soon as an alert comes up it can be managed by exception. It frees up the team who used to spend all day looking at random samples to be more productive.
SS: What are the barriers to an uptake of exploratory analytics as an accepted activity? Cost is one, and you’ve addressed that by pointing out that cheaper solutions are now available.
MB: I think that the IT architect in particular is going to have to embrace the new paradigm. A traditional IT architect is all about structure, reusability, and building things in a certain way so that they benefit the entire enterprise. This drives things like single enterprise views of architecture and standardisation on as few technologies as possible. It tends to weigh against speculative, exploratory and unconventional ways of doing things. So if the business approaches IT and says, “Look, I don’t know exactly what we’re going to do but we think there are insights in our data and we know we need a few terabytes to play with. Can you give me space?”, they’d be very lucky to get funding. That’s not considered a business case. At the same time, analysts probably need to be a bit more aware of how to put their requirements into the right language for architects: What level of security does it need? What level of availability does it need? What level of backup and recovery does it need if any? How much data volume? What are the impacts on the network going to be? What will be the flow on effects to production applications? Standard non-functional requirements—security, availability, backup, recovery. There are 6 or 7 of those that are your standard application/solution non-functional requirements and they’re the kind of things that an architect needs to be able to tick the box on.
SS: Finally, what attracted you to your current role?
MB: I see Greenplum as probably the most innovative and ambitious of the new breed of analytical database technologies. At EMC, Pat Gelsinger in particular looked at the other technologies out there and chose Greenplum because it was able to run on commodity hardware, it had flexibility within the data model to handle unstructured, semi-structured, and structured data, row based and column based storage in the same table—a whole load of stuff that everybody else is trying to catch up with at the moment. So I see it as early days in the market for an innovative, disruptive technology.
Related Analyst First posts:
About usAnalyst First is a new approach to analytics, where tools take a far less important place than the people who perform, manage, request and envision analytics, while analytics is seen as a non-repetitive, exploratory and creative process where the outcome is not known at the start, and only a fraction of efforts are expected to result in success. This is in contrast with a common perception of analytics as IT and process.
Tags in a CloudAIPIO analyst first Analyst First Chapters analytics analytics is not IT arms race environments big data business analytics business intelligence cargo cults collective forecasting commodity and open source tools complexity data decision automation decision support educated buyer EMC-greenplum forecasting HBR holy trinity human infrastructure incentives intelligence model of analytics investing in data lean startup literacy management culture MBAnalytics operational analytics organisational-political considerations Philip Russom Philip Tetlock prediction markets presales R Robin Hanson Strategic Analytics tacit data TDWI Tom Davenport uncertainty uneducated buyer vendors why analyst first