This article highlights the communication challenge for accredited A1 professionals.
We all recognise Analytics is about using information better than competitors, so we are: 1. doing things better, and 2. doing better things than competitors/relevant comparators. But like so much of the coverage of our sector, the article focusses solely on Operational Analytics, not the latter area of Strategic Analytics.
Secondly, the article fails to recognise speed is only one part of the equation.
Taking the author’s example of the retail sector, sure real time analytics can detect an early decline in sales for a particular product, controlling for some extraneous factors. But a retailer’s promotional response (who they target and how) doesn’t necessarily require real time analytics (they can apply in real time outputs from models created last week, with little risk of degradation).
The most important questions for shareholders of the retailer require Strategic Analytic capability: how should pricing across the entire product portfolio be optimised?, what products should we be ordering now for next season (or the season after)?, how to optimise the physical network and supply chain? These strategic questions demand the right answer, not necessarily the fastest answer.
Any experienced industry professional gets that making sense of data is our primary role. But clearly interpreting data to the best of our ability flies in the face of throwing away information (e.g. because inconsistencies in the available data makes the task more cognitively complex). No one would advocate storing and processing data which possesses no incremental information value, but information value can be measured, so that shouldn’t be an issue.
Critically this article fails to recognise many of the barriers for Australian companies in effectively using their data relate to data quality, not their data storage and processing capacity.
Finally, there is no explicit recognition of the talent required to use data more effectively than your competitors.
From an A1 perspective we should welcome the growing focus on our sector, but we need to better articulate the more nuanced (and interesting) story of Analytics in an A1 Practice. It would be easy to criticise the journalist for being naive in swallowing the line of vendors and other vested interests, but the responsibility is ours to better explain the reality.
Eugene is totally right that we need to stand with a united voice. From today, NTF with publicly back A1 in all our proposals and marketing collateral. I regret not taking this action sooner.
Continuing with the big data meets big hype theme:
So you want to get into Business Analytics/Big Data/Predictive Analytics.
What areas, skills, tools, data should you focus on first ?
There are three rather big questions that you need to ask yourself:
1. How well do I really understand the problem(s) that I want Analytics to solve, and The roles(s) that Analytics would play ?
2. How well do I understand my data?
3. What data do I actually have, or can get ?
Each question explores a continuum. Together they represent a three dimensional space of possibilities. There is no “magic quadrant” here, each part of the space is a legitimate place to be, with its own solutions, risks and benefits.
Let’s go through them.
1. The range of possibilities looks something like this:
A: having built preliminary offline random forest models and created some prototypes, I want to extend these existing customer acquisition and retention models we have to our intentional markets, and operationalise them for real-time, event based activity, provided this is seem to yield further significant yield. We will need an industrial strength, scalable, and reliable tool, probably a commercial vendor tool, and possibly a Hadoop-based MapReduce solution
B. my CEO just attended a lavish conference where he saw a slide presentation mentioning the Davenport HBR article from 2006 and now he wants us to “get into analytics”.
Most people are somewhere in between. But you get the idea. And there are far too many initiatives that are precisely at B. the ideal vendor customer is precisely at A. Unfortunately, there are not enough A’s around (we call them “Eduacated Buyers”) so some vendors must sell to people who look more like B’s.
Naturally, Analyst First does not advise Bs to get into Big Data, buy expensive vendor tools, or ever believe anyone that there is such a thing as “a solution for getting started in Analytics” especially when said solution is no more than a bunch of software and maybe a few relatively junior technical consultants for a few months.
Indeed, we advise the Bs of this world to invest in learning, exploring and gaining experience, while managing their sponsors’ expectations and growing their personal investment and participation in the new Analytics enterprise (yep, it’s an Enterprise, with all the Lean Startup that entails), and eliciting from said sponsors their real, and realistically achievable needs.
This is a crucial time to invest in smarts, experience, talent, learning and plenty of Lean Startup.
If this approach is not feasible, I do not have high hopes for the future of the function, which will, at best become a showpiece trophy of high tech adding no value, and will more likely be shut down, “restructured” and restarted again, hopefully with a more sensible approach.
And what of the As ?
Speaking to an A recently, indeed one of the best As I know, he noted that his team had kicked some great business goals recently, having implemented a very necessary expensive vendor tool, after trying R and seeing that it was not up to the big data / big crunch job they had to do. He noted that this was necessary, even though he agreed with A1, and that this was not in line with A1′s preference for open source tools.
“not at all”, I replied, “This is exactly A1, you were the quintessential Educated Buyer! A1 is not against vendor tools. We are against people spending money on what they do not understand in the hope of a magic solution. You don’t fall into that category.”
Hopefully, the anonymous A in question will write a more detailed post on this blog, outlining his success story in more detail.
So, our advice to As is… You don’t really need our advice, until you want to do something new again. In which case, chances are you are following A1 principles already, explicitly or not – otherwise how did you get to A in the first place,anyway ?
Most people are somewhere in between, and usually closer to B than to A.
Answering the “what the heck are we going to do?” question involves exploration on a number of axes, including stakeholders needs, own capability, available resources (human and electronic), any impediments or constraints (Hello IT!) and data, the subject of questions 2 and 3. The actual hidden contents of the data, the “gold” of the data “mining” metaphor is a huge exploratory subject in its own right, and must be considered in the context of the others.
This is not a very easy target to hit, and needs defining before that can happen !
So, to all the Bs and almost-B’s out there : invest in learning : invest in your own and your sponsors’. Invest in getting your sponsor invested, supporting and covering you, letting you explore and grow. Invest, above all, in exploration and invest in managing expectations and delivering intermediate ressults to allow all this to happen. Buy your analytics function a chance to grow, learn, explore and breathe free of unreasonable pressures and constraints.
The other two questions will be covered in upcoming posts.
A few thoughts on Big Data, and how it matters, as well as where the hype does more harm than good. Also some thoughts on what I call “Big Crunch”, Big Data’s Siamese twin, and one I am well acquainted with, more so than Big Data proper.
First of all, the hype. A colleague in the industry spoke to me recently regarding an all too common scenario.
The colleague is effectively the analytics manager of the business. The trouble is, the business does not really have much of an idea about what data analytics is, apart from the fact that it has something to do with big brand software, which the company has in abundance. There is far less idea that it requires direct engagement and understanding from management, and most importantly that it requires clean, accessible and appropriate data before anything can happen, and that this requires a clear, concerted effort, and that this usually consumes the bulk of investment in time, money and political capital. Certainly, there is no realization that what the business needs done is easily served by open source and commodity software, but requires investment in good staff, and appropriate, coordinated effort from IT and management.
So far, so typical A1 case study, could really be anyone, and probably an adequate description of at least 100 cases in the city of Sydney alone. The Big Data twist is that a senior manager has been sending the colleague material on Big Data, and how important it is. Again, this is an increasingly common occurrence.
Naturally, the colleague is exasperated. First of all, the business is not really one that collects or benefits from terabytes per day of data. Secondly, they have not even got their small data right It is neither clean, nor reliable, nor ready for analysis. Finally, the manager in question has done little to get appropriately educated or involved in analytics, but bought deeply into the Big Data hype, seemingly without understanding it too much either.
And here endeth yet another, all too typical cautionary tale.
But Analyst First is more about “Though Might Want To” than about “Thou Shalt Not”
So here are the “Thou Shalt” Analyst First commandments on Big Data:
Hear ye o senior manager/sponsor of Analytics!
Before you worry about Big Data, or dare speak its name:
1. Get your Small Data right first. Organised, cleaned, and analysed.
2. Be engaged in analytics. Investing in analytics is like investing in a Gym membership, or an education – you don’t just spend the money and make it someone else’s problem. It only works if you are directly engaged.
3. Invest in the the best Human Infrastructure you can
4. Squeeze all the value you can out of your small data first.
5. Use Big Crunch on small data before moving to Big Data.
So point 5. brings me to Big Crunch.
Ok, admittedly, Big Crunch is actually a feature of an upcoming post, but the intro and teaser is here (Just as A1 itself started life as a teaser at an R presentation to MelbURN almost 2 years ago)
So, what is Big Crunch ?
Big Crunch is my name for the very cool stuff you can do with small data given heaps of computation. Much of my recent work has made extensive use of tools such as re sampling, and its specific manifestation in Random Forests, k-fold cross validation, and other advanced predictive modelling techniques.
I have over the last couple of years developed a number of new, tortuously computationally complex (but deliciously parallelisable) techniques for measurement of such beasts as robust predictive variable importance, generalised nonlinear correlation and forecast performance. I use these to squeeze information out of data sets big, and, more importantly, small.
Incidentally, in practice “small” includes huge data sets with tiny classes, where the task is classification, and the business problem may be customer retention, acquisition or fraud detection.
Big Crunch applies extreme statistical sophistication to small data sets, just the very thing you need to get the very limited amount of information out.
These methods really do squeeze the informational juice out of data the way simple, mean estimating, linear models and Pearson correlation measures cannot ever hope to do.
Indeed, much of my recent work in areas as diverse as geophysical analysis, retail price elasticity modelling and customer acquisition in entertainment can be described as Big Crunch. In all cases, we see dramatic improvements in robustness, performance and reliability. The biggest challenge is to explain the difference between less reliable, classical measures such as Rsquared on training data, and far more reliable, seemingly magical OOB error on random forests, and how these serve as building blocks for far more sophisticated solution.
Finally, the Big Crunch mini-FAQ.
Is “Big Crunch” actually just “Big Data” ? Yes, if you like, in a way. Similar principles usually apply.
Can/should you Hadoop/MapReduce it ? Sure. Highly parallelisable as a rule.
Can you deploy Big Crunch on commercial systems ? Sure, why not. But do you really need to ? (In fact, we just deployed a Big Crunch solution in SAS, after prototyping in R.)
Is Big Crunch difficult ? It shouldn’t be. Not if you have the right people. They are around. And your people can become the right people with a bit of mentoring. A stats degree helps.
Is Big Crunch expensive ? In people, moderately. In systems and software: hey, Hadoop is open source. And so is R. And the same applies to Big Data too. So no.
The buzzword of the year seems to be “Big Data”. There is a massive wave of promoters of the term, and there are inevitable detractors. There is also the issue of exactly how to define it. What follows is the A1 view on Big Data.
It is real, it is a game changer, and it is here to stay. It is no one thing, and its definition, both quantitiative and qualitative, is rather fluid. Nevertheless, some basic truths apply: Big Data is not a brand name. Neither is Big Data a tool, a business process or a solution. It isn’t even an idea as such. In fact, Big Data is best understood as a problem. Not a problem as in “trouble”, but a problem in the sense of a challenge or puzzle, or more precisely a growing family of problems that we are increasingly forced to grapple with. It’s a problem that does not come with an automatic solution, although there are a growing number of tools to help roll it around.
The A1 angle on this is: you cannot outsource your investment in Big Data any more than you can outsource your own education, or exercise, or being a patient in a surgery theatre. In this sense, what is true of Big Data is also true of Analytics.
Getting Big Data right means getting Small Data even righter. The sort of business that can get value out of Big Data will be one already getting value out of Small Data. Without the business fundamentals in place, Big Data will produce only Big Nonsense. Alternatively, if the logic is there, then Big Data will enhance an existing value-adding framework.
So: small data first, then big data. And before small data, tacit data, which you can always get your hands on, even if you have trouble wrangling the electronic stuff. And before all of those: logic, and human infrastructure. A well understood, well defined business model with well defined intelligence objectives. And incentives, with staff capable of navigating such an environment, managed by a sponsor possessed of the A1 “holy trinity” of adequate influence, appropriate motivation and sufficient understanding of the value, role, and needs of Analytics under their command. Is this too much to ask for?
I should also probably mention tools. Maybe. Last. Do they matter? Of course. So does oxygen. But it is ubiquitous, effectively free, and we take it for granted…
EMC Greenplum’s Mark Burnard and I sat down late last year to talk about the future of data warehousing, big data, and analytics…
SS: How did you fall into data warehousing?
MB: I built spreadsheet systems during the early to mid-1990s while working for Rio Tinto and then Telstra. Then I was working for Corporate Express around 1995 when someone dropped a folder on my desk that said “Cognos”, so I picked it up and kind of ran with an embryonic data warehouse project at Corporate Express, built in SQL Server. So DTS, transformations, dumps out of a mainframe into text file, suck them up every night, build a cube in Cognos. And we took that live, and I thought okay this is really cool, I’d like to do more of this, so I jumped into consulting. That would have been 1999. Since then I’ve moved from BI report development to requirement definitions though to solution architecture and then more strategic consulting, and now to EMC Greenplum. That summarises about 20 years in IT.
SS: In data warehousing there’s a “warehouse first, marts second” approach associated with Inmon and a “marts first, radiate out” approach associated with Kimball. Do you have a view on those design philosophies?
MB: I think Inmon, as the father of data warehousing, came up with the concept of the corporate information factory. The philosophy as I understand it is that you want to build this machine that can answer any question that anyone could possibly ask about the business. So to do that you start off with the data architect and he maps out the entities that are of interest to the organisation—what matters and then what the attributes of those entities are at a high level, then comes up with a conceptual data model. The goal is that you encapsulate in this model everything that matters to the organisation. You turn it into a logical data model, then into a physical data model, then you start building the data warehouse, and eventually you can answer any question out of it. Data marts come along as a secondary spin-off. The underlying warehouse is big and complex, and it’s assumed that the analysts who want to work with its information are 1) only interested in a specific subset: sales, marketing, finance, HR…whatever it happens to be; and 2) that there is complexity in the model that you don’t want them to see as they’ll get really confused, or they’ll drag and drop the wrong columns into a report and start reporting numbers that are faulty. I think that approach works fantastically if the world stays still long enough for you to complete your data model and get the data warehouse built, but the reality is that that never happens. You could probably safely say that world was a bit slower 30 years ago when organisations were doing this model. The early sites tended to be companies that were so big that a project like that was able to return a business benefit simply because of the size of the organisation. The Bank of America, Citibank, AIG—some of the big guys who built the early data warehouses—benefited a lot from that approach. But I think now the game’s changed, and a 6 month turnaround time for a BI project is too slow. Marketing teams want to be able to spin off new products, sometimes one or two a month. Telcos, for example, are spinning off new plans, new product bundles, new marketing messages and solutions that require a lot of agility in the billing system and the provisioning system. Their core systems have to be a lot more agile and the data warehouse has to keep up with that. So if we’re structuring a new product bundle every month and taking it to market, the billing system may be able to keep up with that, the provisioning system may be able to keep up with that, even multiple provisioning systems, but when that all hits the data warehouse—in some of the situations I’ve seen—it becomes a screaming mess. Nobody can keep up, the warehouse can’t keep up, and then the folk who’ve rolled out the new bundle want new reports within the first week of having it in the market. They want to know “How many are we selling?”, “What’s the profitability?”, “What is the profile of the market segment that is purchasing this new product?”, and the warehouse says “We can give you that information in 6 weeks, but right now you’ve just broken all my reports with your new constructs and we’ve got to do some extra work on this data mart over here, and then we’ve got to tweak the ETL, and then we’ve got to do regression testing, impact analysis, etc. So these days, there needs to be a way of meeting the business need for both agility and reporting that keeps up with the pace of change. From what I’ve seen, the classic traditional data warehouse model is not able to do that.
SS: So business is becoming faster and organisations are becoming more complex, and people with an appetite for analysis are demanding more in terms of information, feedback, and measurement?
MB: I think it’s fair to say that the pace of business change has accelerated. There was a time when the core business of a bank might have been savings accounts and mortgages, but within the last 20 or 30 years that has diversified into online trading, reselling insurance products, having superannuation funds to manage. So the diversity of offerings that a bank is now bringing to the market—they might be offering 50 to 100 products where in the past they offered 10 or 12. Same thing with a telco. Back in Telstra’s early days it was fixed lines, and the only thing to worry about was local calls, STD calls and then international calls, there was nothing else. Now you’ve got mobile, data over mobile, over 3G, all the different product bundles that people have, caps—you’ve got telcos churning out new caps every 6 months—new phones, new bundles, bundling fixed with mobile with home internet and even cable TV—and needing to track when someone cancels one service but is still getting the bundle discount, for example. All that kind of complexity never existed even 10 years ago. So I think the pace of change and the complexity of doing business has changed, and the classic model that we’ve had for data warehousing has not significantly changed. In fact it used to be a line we would drop to reassure the client when positioning a data warehousing project to board level executives, to say “Data warehousing has not changed in 30 years”. In other words, you can be confident that this is a well-worn space, an area of technology that was not invented yesterday, and therefore that all the lessons have been learned, all the pain has been experienced, all of the lessons have been written up, and so building a data warehouse is a pretty safe endeavour.
SS: So presumably you don’t present that message today?
MB: No. Because I think the message isn’t accurate anymore. I see it diverging in two directions. I see your classic data warehouse which has the stringency around it. It has the discipline around things like being able to identify your data lineage, having your metadata available, being able to explain that this number on this report came from this table in the data mart, which came from this table in the core data model, which came through these transformations, from this source system, and that says 100, and this says 100, and we know that the numbers are correct. That’s been the standard approach to data warehousing, and everybody spends huge amounts of energy, time and money trying to keep the warehouse up to date so that you can confidently say to the board, or the ASX, or ASIC, or APRA, or whoever it happens to be when you put a report down in front of them, that you’ve got your traceability, your auditability, your metadata, your data lineage, and so on. The CFO, for example, has to be able to sign off on those numbers. The problem now is that we’ve taken that disciplined model—which is fantastic for reporting on financial numbers to regulators—and we’ve extended it into the domains of HR, and marketing, and across the entire enterprise data model. We wanted to fit the entire enterprise into the data warehouse. What I think we’re seeing already happen is that the locus of subjects that fit into the traditional data warehouse is shrinking to include only those where that level of rigour is required for reporting and analysis—areas such as finance, risk, and other things that go to regulators. All the other stuff, which is the other 95% of the business, will probably end up in a much more flexible, dynamic platform which we’re starting to call an analytical warehouse or analytic warehouse. It’s not your traditional data warehouse. Usually in terms of data volume it’s a lot larger, a lot more atomic, a lot more nimble, agile, and almost ad-hoc in a way. To be honest, people have always done this—they’ve just done it under the table: take a data set, throw it into an Access database, take a bigger data set, throw it into SQL Server, do something on a laptop, do something on a server that they bought on the corporate credit card down the road because the data warehouse wasn’t able to give them the data they required, or because running the query would slow down all the standard reporting. People have always done analytic warehouses; they were just under the radar.
SS: So they’re coming out of the closet in a sense?
MB: That’s right. Because the pain of the traditional approach is now such that it’s unsustainable, and so what everybody’s been doing all along covertly, organisations are now beginning to look at and say, “We have to cater for that model.” They have to have a way of managing information that caters for ad-hoc analysis, for marketing guys running quick, nimble little models to come up with some market segment so they can generate a campaign to a particular demographic. Traditionally their only official option was the data warehouse, but they couldn’t throw enough information in, or the queries ran too long, or the joins were too complex given some third normal form industry data model. They’ve been doing in under the table, so let’s elevate that and bring it back into the fold of IT but on a platform that can handle that kind of ad-hoc dynamic: throw some web clicks in here, throw some Twitter feeds in there, mix in some data from the Australian Bureau of Statistics, get some smart people to have a look for correlations, integrate the results from the latest campaign for marketing a platinum credit card to a certain demographic, and let’s do some follow up on that. All of this dynamic information needs a home—well, it’s always had an unofficial home, on someone’s hard drive—but what I see happening is that moving back into a more recognised, structured place, without the imposed controls that had it running away from home in the first place.
SS: So it sounds like we’ve got the emergence of two domains. There’s a domain of stability—the traditional data warehouse which fundamentally hasn’t changed much in 20 to 30 years. Its set of disciplines maps well to data and business processes that also fit that description. Something like financial reporting, where the accounting standards are pretty stable and where there are widely accepted processes which are professionally constrained for getting data into that shape. Then there’s a more uncertain domain—the data’s uncertain, the analysis in uncertain, the queries are uncertain, the perspective is uncertain, and it sounds like that’s broken in a way too. You’re saying that people have been doing it under the table for a long time, but it sounds like the scale is now such that it doesn’t fit under the table anymore. The tools are breaking.
MB: That’s exactly right. The data volumes that we are beginning to deal with won’t fit on a little server under the desk anymore. You’re talking about data volumes that will explode out past many terabytes eventually and you’re not going to fit it on your laptop, and if you did your single query would take the whole weekend. Maybe a real world example would be useful. One of EMC Greenplum’s reference sites is T-Mobile in the US. They have their classic traditional data warehouse. It’s 100 terabytes of data and it has the discipline and the control and the industry standard data model. Billing information comes in, provisioning information come in, and they can report to the market out of the data warehouse their precise number of subscribers and their exact financials every month. All of that controlled information that you’ve just described. Alongside that they have an analytical warehouse which happens to be a petabyte. It’s about 10 times the size of the traditional data warehouse. Into that flows a lot—if not all—of the same data. But it also contains 900 terabytes of other data, which is market sentiment analysis, unstructured data, and all kinds of data that is structured but which is too hard to fit into the industry standard data model. The other thing that’s probably worth mentioning is that within the 100 terabyte traditional data warehouse you have a lot of summarisation of call data records—for example, after 2 or 3 months you’ve finished with your standard reports and you summarise them to one row per customer per month instead of one row per phone call or one row per text message. This means you can still report accurately on all of your regulatory requirements, but you can’t do analysis anymore at the atomic level of detail. In the analytic warehouse, on the other hand, you might have that history for 3 or 4 years. Why would T-Mobile need that? Well, they were looking for drivers of churn, which is one of the biggest expenses for a telco. It takes 10 or 20 times the amount of money to win back a customer than it does to keep them and not have them leave. So they kicked off a project to try and identify the most significant indicators of a customer who was about to churn. To start with, they did some modelling around complaints to the call centre. Was there a pattern whereby someone who calls the call centre 3 or 4 times within 3 months was more likely to churn? That couldn’t be done on the existing data warehouse platform. Some of the data was there, but not all of it, and there wasn’t a business case to put it in there on the basis of speculation. And even if there was, had they started to run such an analysis it would have slowed down all of the business-as-usual reporting that the warehouse needed to support. Hence the analytic warehouse. First they did analysis on call centre logs looking to relate complaints to the churn indications that came from the warehouse. No correlation. Then they did a similar exercise looking at call dropout rates and call termination codes. Were dropped calls triggering churn? Again, no clear correlation. Note that they needed the unsummarised call records for this, but the data warehouse had already summarised them. So then they let a data scientist loose in the analytical workspace, unconstrained by the data model, unconstrained by performance concerns related to running obtuse queries. It turned out that the most significant precursor to a customer churning was whether or not any of the people a given customer called the most churned. It makes sense because now I’m calling them telling them I’m with a new telco. The correlation was so strong that if I churn today, the people I call the most are seven times more likely than the average customer to churn in the next 3 months. So they took that insight and operationalised it. The action they developed was that as soon as an individual churned, anybody they frequently called was targeted with an attractive offer to re-contract. They estimated the monetary benefit to the organisation from this was $70 million. Rob Strickland, The CTO of T-Mobile at the time still tells this story. He says ‘Here is my data warehouse. I’m spending millions and millions of dollars on maintaining this thing, and what it gives me is a bunch of standard reports. And yet over here, from my sandpit that cost me one million, comes the insight that was of $70 million benefit to the organisation.” They’re spending tens of millions of dollars on the data warehouse, but it’s the one million dollars on the analytic warehouse that’s differentiating them in the market.
SS: There’s a distinction worth recognising there between the process that generated the insight and then the subsequent operationalisation of that insight. It may have cost more than a million dollars to operationalise the insight—to get it out to call centres, to marketing and product teams, to come up with and execute the right campaigns and performance manage the whole thing—all of those things cost money too. But the discovery process to get to the insight was itself relatively cheap. Interesting too that many of the instinctively explanatory metrics didn’t turn out to matter in this case: complaints to the call centre, dropped calls, network, phone, plan, demographics. It turned out to be the social network.
MB: As measured by the call data records. One of the outcomes of that learning was the benefit of keeping atomic level call data for months or years rather than summarising it—which you have to do in a traditional warehouse because of the cost of storage.
SS: Because you don’t know from what perspective you’re going to analyse it?
MB: Correct. You may come up with a question in 3 months’ time that requires atomic level data over 2, 3, 5 years. If that’s been summarised you can’t get it back. This is part of the big data story. Organisations are starting to realise that there’s real gold in this atomic level data if they can find a way to keep it long enough, cheaply enough, in order to enable analytics be run.
SS: That in some sense addresses the ‘volume’ component of big data. What would you say about the ‘velocity’ and ‘variety’ components?
MB: Storage and computational power are both becoming cheaper, and technologies like Hadoop are making those two factors a lot easier to wrestle with. I won’t say that Hadoop is easy, because it’s not, it’s a very embryonic technology. But the vision is to be able to take large data sets that previously were ignored—that come under the label of ‘exhaust data’—application logs, machine logs, detailed stuff that you would never bother collecting, because why would you? Organisations are now beginning to see that it’s not actually that expensive to keep this data. And not only that, really valuable information can be harvested from it. One organisation we work with, for example, captures all of the application logs from a certain application it uses and mines them for indicators of internal fraud. Are any employees or other operators going into areas of the system they shouldn’t be? There was time when that log file information wouldn’t be collected because it was just too expensive to store, let alone analyse. The internal audit approach in such a regime relies on occasional random sampling. But now you can actually have definitive indications from the application logs. So you’re saving time, money, and the imprecision of having someone look randomly, because as soon as an alert comes up it can be managed by exception. It frees up the team who used to spend all day looking at random samples to be more productive.
SS: What are the barriers to an uptake of exploratory analytics as an accepted activity? Cost is one, and you’ve addressed that by pointing out that cheaper solutions are now available.
MB: I think that the IT architect in particular is going to have to embrace the new paradigm. A traditional IT architect is all about structure, reusability, and building things in a certain way so that they benefit the entire enterprise. This drives things like single enterprise views of architecture and standardisation on as few technologies as possible. It tends to weigh against speculative, exploratory and unconventional ways of doing things. So if the business approaches IT and says, “Look, I don’t know exactly what we’re going to do but we think there are insights in our data and we know we need a few terabytes to play with. Can you give me space?”, they’d be very lucky to get funding. That’s not considered a business case. At the same time, analysts probably need to be a bit more aware of how to put their requirements into the right language for architects: What level of security does it need? What level of availability does it need? What level of backup and recovery does it need if any? How much data volume? What are the impacts on the network going to be? What will be the flow on effects to production applications? Standard non-functional requirements—security, availability, backup, recovery. There are 6 or 7 of those that are your standard application/solution non-functional requirements and they’re the kind of things that an architect needs to be able to tick the box on.
SS: Finally, what attracted you to your current role?
MB: I see Greenplum as probably the most innovative and ambitious of the new breed of analytical database technologies. At EMC, Pat Gelsinger in particular looked at the other technologies out there and chose Greenplum because it was able to run on commodity hardware, it had flexibility within the data model to handle unstructured, semi-structured, and structured data, row based and column based storage in the same table—a whole load of stuff that everybody else is trying to catch up with at the moment. So I see it as early days in the market for an innovative, disruptive technology.
Related Analyst First posts:
Is Big Data a Bubble?
In case you’re in a hurry: Of course it is. And that is good.
Quentin Hardy in The New York Times Bits blog summarises the state of play regarding the business world’s interest in and utilisation of big data. As he recognises, big data is really about “the benefits we will gain by cleverly sifting through it to find and exploit new patterns and relationships.” Big data drives up the obviousness of the need for analytics. Its often invoked qualities of volume, velocity, and variety mean that we can’t pretend to analyse it without statistical and machine learning methods.
Big data analytics at the present moment, however, is characterised by uncertainty. What is big data, exactly? What’s its value? How is it analysed? Which technologies should be used? Which standards will prevail? Where to invest? There is, Hardy finds:
a common problem in the Big Data proposition: Often people won’t know exactly what hidden pattern they are looking for, or what the value they extract may be, and therefore it will be impossible to know how much to invest in the technology. Odds are that the initial benefits, as it was with Google’s Adwords algorithm, will lead to a frenzy of investments and marketing pitches, until we find the logical limits of the technology. It will be the place just before everybody lost their shirts.
This is a common characteristic of technology that its champions do not like to talk about, but it is why we have so many bubbles in this industry. Technologists build or discover something great, like railroads or radio or the Internet. The change is so important, often world-changing, that it is hard to value, so people overshoot toward the infinite. When it turns out to be merely huge, there is a crash, in railroad bonds, or RCA stock, or Pets.com. Perhaps Big Data is next, on its way to changing the world.
Such is technology. Hence the level of uncertainty, and also reasons for optimism:
There are an uncountable number of data-mining start-ups in the field: MapReduce and NoSQL for managing the stuff; and the open-source R statistical programming language, for making predictions about what is likely to happen next, based on what has happened before. Established companies in the business, like SAS Institute or SAP, will probably purchase or make alliances with a lot of these smaller companies.
Expect to see a lot more before it all gets sorted out.
TDWI’s recent Big Data Analytics report by Philip Russom, based on 325 sets of responses to a May 2011 survey, echoes this and provides a far more comprehensive overview of how businesses are coping with the realities of big data. One measure of how unsettled the field is right now is that, in response to the question ‘Which of the following best characterizes your familiarity with big data analytics and how you name it?’, 65% of the survey’s respondents replied, ‘I know what you mean, but I don’t have a formal name for it.’
The report is recommended in full, but in case you’re still in a hurry, its margin summaries can be skimmed through in a matter of minutes.
Related Analyst First posts:
The recent IAPA discussion panel on ‘Aligning IT and Analytics to deliver sustainable innovation’, plus a later conversation with fellow panellist, EMC-Greenplum’s James Horton, prompted me to sketch some thoughts on what an Analytics Lab ought to do. The lab is the natural home for Analysts engaged in the narrower definition of Analytics:
The Analytics Lab is an innovation factory which constantly evaluates data, quantitative methods and tools looking for sources of competitive advantage.
- Data: structured and unstructured, sourced from both inside and outside the organisation, established and new.
- Methods: data transformation, and then data mining, machine learning, statistical, mathematical, and other analytical methods.
- Tools: as appropriate to method, from programming languages through to GUI applications, from commodity and open source through to commercial tools.
- Analysts: the lab enables the organisation to evaluate the technical abilities and innovative propensities of its analysts, as well as those on offer from external service providers, without many of the interfering factors present in operationally hardened IT environments.
Its outputs are:
- BI prototypes
- Instantiation candidates
- Identifies data and knowledge gaps: Analysing data and generating insights brings to light new data needs and exposes gaps in knowledge which may impact the business. Additional data may need to be sourced, gathered through survey, collected by tweaking an existing business process, or purchased from a third party. Additional analyses and subject matter expertise may be required to close knowledge gaps.
- Resolves disharmonies: All businesses struggle with ‘different views of the truth’, and it’s often the crunching of data which brings these to light. Disharmonies might be within or between data sets, or between conventional wisdom and the drivers of a model. They could relate to anything from actual observations to tacit assumptions. Resolving such disharmonies—harmonisation—involves identifying, scoping, validating, and correcting them.
These last two are not the core business of Analytics, but they’re important activities, and doing Analytics naturally leads to them. Most organisations don’t explicitly provision for them, but arguably they should. The lab is as good a home for them as any other.
The Analytics Lab services all levels of business, but in different ways:
- Senior Management: through the provision of strategic insights.
- Middle Management and Knowledge Workers: through one-off and/or prototyped BI analyses.
- Frontline Workers: through the identification of instantiation candidates, i.e. deployable operational analytics.
Many analyses typically need to be tried before those which merit instantiation are discovered. Furthermore, “instantiation” doesn’t necessarily mean a repeatable process. It could simply mean the communication of a one-off insight, e.g. “revenue growth is unmistakeably slowing in all but one customer segment” or “the most reliable predictor of a customer’s propensity to churn is their social network membership.” Such insights are typically complex, valuable, but not “actionable” in any deterministic, automatable way.
Other findings are suited to more regularised delivery, for example as managerial decision support through business intelligence.
Some analytical results, in order to be fully leveraged, need to be integrated into frontline business processes. Predictive models which predict customer acquisition or churn, for example, might require integration in sales, marketing, call centre, channel management and customer support processes.
Conduct disciplined, exploratory analyses which repeatedly cycle through the following sorts of questions:
- Is there structure in the data (patterns, trends, relationships, networks, segments, clusters, indicators, drivers, outliers, anomalies)?
- Are there new insights in the data?
- Which models are viable?
- Which variables are important?
- Which variables do we control?
- What are the implications for revenue, cost, risk?
- What data do we want that we don’t have? How could we get it?
- What are the implications of this insight?
- Who is our internal customer for this insight?
- Would this analysis be valuable if provided on an ongoing basis? To whom?
- Into which existing or envisioned business processes should this insight be instantiated?
- Where are there disharmonies in tacit or explicit data and assumptions?
- Which projects, processes and decisions are affected by these disharmonies?
- How do we validate and resolve these disharmonies?
Infrastructure can usefully be separated into the ‘electronic infrastructure’ of hardware and software and the ‘human infrastructure‘ of people, relationships, management and incentives.
- Secure, off-network ‘sandpit area’
- Big storage, big memory, scalable to big data
- Eclectic analytical toolset: commodity, open source, commercial, experimental, in-house
- Snapshots, copies, feeds of all manner of available data sources: pre-ETL, pre-warehouse, post-warehouse, external, web, social media, unstructured. In the context of the lab, the data warehouse is just another source system.
- De-emphasis on repeatable technical processes and compliance with production IT architecture
- Insulated from IT Service Level Agreements and other production / core system / business-as-usual constraints
- Human Resources:
- Analysts: Data scientists
- Management: Validate analysis objectives, ensure that analysts remain focused, performance manage the innovation process.
- Sponsorship from Executive
- Cross-functional relationships with business units: both ‘push’ (business unit as customer) and ‘pull’ (business unit as subject matter expert)
- Close relationship with Strategy function
- ‘Caveat utilitor’ relationship with IT for data provision and tool support
- Various relationships with service providers: vendors, consultants, training and mentoring providers, industry expertise, academia if appropriate
- Performance Management:
- Innovation / Research metrics
- Risk metrics
- Sentiment metrics
- Dimensions of opportunity: Internal, Competitor, Market, Customer, Product, Channel
Related Analyst First posts:
- *Aligning IT and Analytics to deliver sustainable innovation*
- Needles, Haystacks, and Category Errors, or, Where Does Operational Analytics Fit?
- Systemising skepticism
- Assume bad data
- The Economics of Data – Analytics Is… Investing in Data
- Decision support versus decision automation
Ted Cuzzillo, writing at TDWI and citing Blake Johnson of Stanford, identifies 6 conditions for [or barriers to] the rise of business analysts:
1. The best analysts are skilled in three areas: First, they engage stakeholders and have an eye for business opportunity. Second, they inspire stakeholders’ trust with consistently excellent analysis. Third, “big data” requires skill with data management and software engineering.
This paints a similar but not identical picture to Drew Conway’s Data Scientist Venn Diagram. The key point of difference is that Conway places more weight on mathematical and statistical training, which is not the same thing as “consistently excellent analysis”, but is more important than is often assumed in enabling it.
2. Each analyst’s skills should be about 80 percent in data management and about 20 percent in business and analytics — but Johnson expects that to change over the next five or 10 years as tools make data management easier. Eventually the mix of skills will be the opposite: 20 percent data management and 80 percent business.
I have no strong view on this, but my intuition is that data wrangling will always consume far more time and effort than analysis. Analysis is a feedback loop and a read-write activity. Standardisation and automation continue to consolidate efficiencies but these tend to raise the analytical bar. That said, I’d be happy for future tools to prove me wrong.
3. Gaining a foothold within an organization is best done in small bites with an entrepreneurial approach. Forget trying for a “big bang,” he says. Instead, find a need and fill it quickly, then move on to others. Identify and solve one business problem after another — always making sure to keep your methods scalable.
This agrees with Analyst First’s contention, seconded by others, that the monolithic IT project approach doesn’t work, and that—within an existing organisation—a bottom-up Lean Startup approach is your best bet. The only exceptions to this are analytic-centric online startups and quantitative hedge funds.
4. Location of analysts’ workspace matters. They should work in a cluster for critical mass, which encourages sharing of best practices and support. If they sit within business teams, their work becomes more visible.
This makes sense. Isolated analysts are a problem whether they’re isolated from each other or from management oversight and direction. Generally speaking, senior executives need to be broadened while analysts need to be narrowed. Middle managers need to be skilled up to bridge between the two.
5. It’s an adjustment for everyone — on the business side but especially on the IT side. It means fundamental changes in the way data is organized and managed, and accessed and used, with both new technologies and skill sets.
6. Many IT pros deny access to data based on obsolete knowledge. Johnson reports that many don’t know about modern load-balancing and other technology that make such access safer.
Certainly true. I’ve written before about the data needs of analysts as distinct from traditional business intelligence consumers, and also observed that big data is at once driving up the need for advanced analytics and rendering traditional data warehousing approaches obsolete. But the odd part about the commonly invoked ‘IT vs business’ balance of power is the acceptance of IT as a ‘stakeholder’ as opposed to an enabler. It’s unquestionably the case that analytics doesn’t happen without software, but that’s just as true of accounting, graphic design, and most other activities conducted in front of computers in today’s workplace. It simply doesn’t follow that IT deserves, so to speak, a seat on the Security Council.
Cuzzillo closes well aware of both the future possibilities for Business Analytics, and the status quo political realities standing in its way:
You would think that both sides would sign up for the bargain the new middlemen [i.e. analysts] seem to offer. IT would cede control and concentrate on what it does best, managing the back end. Meanwhile, business stakeholders would get insights from these newly empowered, eager specialists. Analysts would be newly ready to answer business questions, conjure up new questions, and offer strategic options.
Analysts would colonize what had been the no-man’s-land between IT and business. Trouble is, the analysts may end up ruining the neighborhood for them. If the strategies Johnson suggests work, IT and business would find a new power growing alongside them. Analysts — simply from the position they would find themselves in, not from any wish to rule the world — would be indispensible, powerful, and well funded.
Who wouldn’t want that?
Related Analyst First posts:
- Analytics Education and Recruitment – Builders vs Finders
- Analysis is read-write
- Forrester on the need for agility
- Analytics Is… A Lean Startup Enterprise
- *Why Software Is Eating The World*
- The data needs of analysts
- Big data as an advanced analytics driver
- *Building for Yesterday’s Future*
- *The Elusive Definition of Agile Analytics*
Week 3, Day 2 of the CORTEX MBAnalytics program includes TDWI’s best practices report, ‘Strategies for Managing Spreadmarts – Migrating to a Managed BI Environment’, by Wayne W. Eckerson and Richard P. Sherman, from 2008 and based in part on a survey conducted in 2007. It’s an excellent document—one of the best articulations of a problem with which Business Analytics practitioners and interested parties ought to be familiar: the nature and causes of ‘spreadmarts’, their strengths, weaknesses and limitations, what to do about them, and the risks involved. Given how well the report covers these topics, I commend it in full. But I’m also going to address where it falls short. It’s right in its reasoning and its conclusions, but potentially misleading in its emphasis and what it omits.
By way of definition:
A spreadmart is a reporting or analysis system running on a desktop database (e.g., spreadsheet, Access database, or dashboard) that is created and maintained by an individual or group that performs all the tasks normally done by a data mart or data warehouse, such as extracting, transforming, and formatting data as well as defining metrics, submitting queries, and formatting and publishing reports to others. Also known as data shadow systems, human data warehouses, or IT shadow systems.
Finance generates the most spreadmarts by a wide margin, followed by marketing, operations, and sales… Finance departments are particularly vulnerable to spreadmarts because they must create complex financial reports for internal and external reporting as well as develop detailed financial plans, budgets, and forecasts on an ad hoc basis. As a result, they are savvy users of spreadsheets, which excel at this kind of analysis.
Spreadmarts are categorised into three types:
- One-off reports. Business people use spreadsheets to filter and transform data, create graphs, and format them into reports that they present to their management, customers, suppliers, or partners. With this type of spreadmart, people are using data they already have and the power of Excel to present it. There’s no business justification—or even time—for IT to get involved.
- Ad hoc analysis. Business analysts create spreadmarts to perform exploratory, ad hoc analysis for which they don’t have a standard report. For instance, they may want to explore how new business conditions might affect product sales or perform what-if scenarios for potential business changes. They use the spreadmart to probe around, not even sure what they’re looking for, and they often bring in supplemental data that may not be available in the data warehouse. This exploration can also be time-sensitive and urgent.
- Business systems. Most spreadmarts start out as one-off reports or ad hoc analysis, then morph into full-fledged business systems to support ongoing processes like budgeting, planning, and forecasting. It’s usually not the goal to create such a system, but after a power user creates the first one, she’s asked by the business to keep producing the report until, eventually, it becomes an application itself. This type of spreadmart is called a “data shadow system.”
Of these, it’s ‘business systems’ which are the report’s focus. The ‘one-off report’ and ‘ad hoc analysis’ categories of spreadmart, the report fittingly concludes, are inappropriate for systemisation.
But defining spreadmarts in terms of them being ‘shadow systems performing all the tasks normally done by a data mart of data warehouse’ gets things somewhat backwards. Presupposing that all spreadmarts are appropriate for systemsation—in effect viewing them as ‘data marts in waiting’—is misleading when their one-off and ad hoc uses are recognised and their wide proliferation and coverage taken into consideration. One might more accurately define data marts and data warehouses as ‘scaled-up systems which perform some of the tasks normally done by a spreadmart’.
The report’s framing of spreadmarts is a product of its BI/DW view of the world. In that worldview, data lives in source systems and needs to be ETL-ed into a relational repository in order to be published en masse to business users whose analysis requirements consist of read-only slice and dice. These needs do of course exist, but analysis entails much more. This narrower BI/DW view is reflected in the TDWI survey’s design. For example, respondents are asked to rank the top five reasons why spreadmarts exist in their group. “Quick fix to integrating data” ranks second overall, “Inability of the BI team to move quickly” third, ”This is the way it’s always been done” fifth, and “Desire to protect one’s turf” seventh. These options are hardly worded neutrally. They’re phrased so as to norm systemisation and to cast ad hoc and one-off uses of spreadmarts as the products of sloppiness, frustration, ignorance, and narrow self-interest. The section on the benefits of spreadsheets is similarly biased in its framing. “Ideal for one-time analysis” and “Good for prototyping a permanent system” are listed, but not ‘ideal for exploring data, creating scenarios, capturing assumptions, and enriching existing data’. Interpretation also suffers. For example:
Ironically, organizations with a “low” degree of [BI] standardization have the lowest median number of spreadmarts (17.5), and only 31% haven’t counted them. The proper conclusion here is that standardization increases awareness of spreadmarts.
It may be a convenient conclusion, but it certainly isn’t the only one. Perhaps the 70 to 80 percent failure rate of BI standardisation projects is driving business users back to spreadsheets. Excel integration with BI platforms is also filtered through a read-only lens:
BI vendors are starting to offer more robust integration between their platforms and Microsoft Office tools. Today, the best integration occurs between Excel and OLAP databases, where users get all the benefits of Excel without compromising data integrity or consistency, since data and logic are stored centrally.
This tends to be how BI vendors understand Excel integration. They recognise that users enjoy Excel as a query and reporting interface, but they understate its importance as a data and logic creation tool. The BI/DW worldview understands all data as the product of business processes which write it to ‘source systems’. (The one exception to this is the sub-discipline of enterprise budgeting and planning, which is in effect BI with people as the source systems.) What is underappreciated is that analysis itself is a business process—one which can’t help but create data.
These biases are understandable. The TDWI report is sponsored by a collection of BI software vendors, and 60% of those surveyed are IT professionals. What’s missing, then, is a more nuanced understanding of what analysis entails. Such an understanding needs to recognise:
- The centrality of data transformation and enrichment by individual analysts
- The value of tacit data
- The importance of presentation
Simply put, analysis is a read-write activity. Routine analytical tasks I find myself doing, for example, include:
- Entering hitherto tacit data
- Codifying business knowledge
- Finding and synthesising data from outside sources
- Creating dummy and randomised data
- Capturing novel assumptions
- Imposing new categories on existing categorical data
- Enriching existing data by deriving or devising on-the-fly metadata
- Building scenarios and constructing counterfactuals
- Drafting and adding commentary, interpretation, and notes
- Formalising and detailing new questions and follow-on analyses
All of these activities involve me creating new data, and I would submit that neither I nor any BI/DW requirement gathering cycle would be able to anticipate that data ahead of time. These are creative, reflective, results-contingent activities. As the report puts it:
[M]ost BI vendors have recognized that a large portion of customers are using their BI tools simply to extract data from corporate servers and dump them into Excel, Word, or PowerPoint, where they perform their “real work.”
If we additionally overlay the question of data value, one of Analyst First’s key contentions is that the most valuable information in organisations lives in people’s heads. It’s tacit, and spreadsheets are one of the best tools for eliciting it and making it explicit:
Spreadsheets are the most pervasive and effective decision support tools. No organisation doesn’t use them, and it’s a safe bet that this will always be the case. No amount of data warehousing will ever be able to provide decision makers with all the information they need. To the extent that it can, those decisions can be automated. Decisions invariably require new data. That new data will be either unanticipatable, or tacit, or both. Spreadsheets are unbeatable for ad hoc data analysis and turning tacit data into explicit data.
But spreadsheets aren’t the only tools available for tacit data mining. Nor, for some types of data, are they the best tools. As the ‘The Economics of Data – Analytics is… Investing in Data‘ post argued:
The most interesting, readily available, strategically relevant and poorly understood form of data is tacit data: the information contained in the brains of staff, board, shareholders and anyone else who would see the organisation do well… How is tacit data mined? The most effective and powerful way is by use of collective intelligence and forecasting techniques, such as prediction markets.
Finally—analysis process, ad hoc, and one-off needs aside—decision support systems need to be more than portals for publishing structured data as tables, charts, and indicators. As the TDWI survey picks up:
While Excel is the most popular tool for building spreadmarts, business analysts also use Microsoft Access, PowerPoint, and SAS statistical tools.
SAS and PowerPoint are telling inclusions: SAS contains a great deal of statistical and modelling functionality that BI stacks don’t (or certainly didn’t in 2007); PowerPoint is able to flexibly integrate unstructured commentary with the more structured outputs of BI platforms—as is Word. The TDWI report itself is an example of this: most of it is unstructured text, then there are graphics and other design elements, and finally the charts and tables. Very few high value analyses don’t contain narrative, diagrams, and other unstructured presentation elements.
The TDWI report does in fact acknowledge all of the above:
[T]here is often no acceptable alternative to spreadmarts. For example, the data that people need to do their jobs might not exist in a data warehouse or data mart, so individuals need to source, enter, and combine the data themselves to get the information. The organization’s BI tools may not support the types of complex analysis, forecasting, or modeling that business analysts need to perform, or they may not display data in the format that executives desire.
The report’s biases are in its view of analysis as fundamentally amendable to structure, and spreadmarts (as one of its enablers) as precursors of systems. Scalability, repeatability and automation certainly have their place, but a more realistic view would recognise that the analytical activities which a data warehouse can viably support are a subset. This has important implications for BI/DW practices now that the Business Analytics domain incorporates them alongside advanced analytics and the emerging field of big data.
Related Analyst First posts:
About usAnalyst First is a new approach to analytics, where tools take a far less important place than the people who perform, manage, request and envision analytics, while analytics is seen as a non-repetitive, exploratory and creative process where the outcome is not known at the start, and only a fraction of efforts are expected to result in success. This is in contrast with a common perception of analytics as IT and process.
Tags in a CloudAIPIO analyst first Analyst First Chapters analytics analytics is not IT arms race environments big data business analytics business intelligence cargo cults collective forecasting commodity and open source tools complexity data decision automation decision support educated buyer EMC-greenplum forecasting HBR holy trinity human infrastructure incentives intelligence model of analytics investing in data lean startup literacy management culture MBAnalytics operational analytics organisational-political considerations Philip Russom Philip Tetlock prediction markets presales R Robin Hanson Strategic Analytics tacit data TDWI Tom Davenport uncertainty uneducated buyer vendors why analyst first