Currently viewing the tag: "EMC-greenplum"

EMC Greenplum’s Mark Burnard and I sat down late last year to talk about the future of data warehousing, big data, and analytics…

SS: How did you fall into data warehousing?

MB: I built spreadsheet systems during the early to mid-1990s while working for Rio Tinto and then Telstra. Then I was working for Corporate Express around 1995 when someone dropped a folder on my desk that said “Cognos”, so I picked it up and kind of ran with an embryonic data warehouse project at Corporate Express, built in SQL Server. So DTS, transformations, dumps out of a mainframe into text file, suck them up every night, build a cube in Cognos. And we took that live, and I thought okay this is really cool, I’d like to do more of this, so I jumped into consulting. That would have been 1999. Since then I’ve moved from BI report development to requirement definitions though to solution architecture and then more strategic consulting, and now to EMC Greenplum. That summarises about 20 years in IT.

SS: In data warehousing there’s a “warehouse first, marts second” approach associated with Inmon and a “marts first, radiate out” approach associated with Kimball. Do you have a view on those design philosophies?

MB: I think Inmon, as the father of data warehousing, came up with the concept of the corporate information factory. The philosophy as I understand it is that you want to build this machine that can answer any question that anyone could possibly ask about the business. So to do that you start off with the data architect and he maps out the entities that are of interest to the organisation—what matters and then what the attributes of those entities are at a high level, then comes up with a conceptual data model. The goal is that you encapsulate in this model everything that matters to the organisation. You turn it into a logical data model, then into a physical data model, then you start building the data warehouse, and eventually you can answer any question out of it. Data marts come along as a secondary spin-off. The underlying warehouse is big and complex, and it’s assumed that the analysts who want to work with its information are 1) only interested in a specific subset: sales, marketing, finance, HR…whatever it happens to be; and 2) that there is complexity in the model that you don’t want them to see as they’ll get really confused, or they’ll drag and drop the wrong columns into a report and start reporting numbers that are faulty. I think that approach works fantastically if the world stays still long enough for you to complete your data model and get the data warehouse built, but the reality is that that never happens. You could probably safely say that world was a bit slower 30 years ago when organisations were doing this model. The early sites tended to be companies that were so big that a project like that was able to return a business benefit simply because of the size of the organisation. The Bank of America, Citibank, AIG—some of the big guys who built the early data warehouses—benefited a lot from that approach. But I think now the game’s changed, and a 6 month turnaround time for a BI project is too slow. Marketing teams want to be able to spin off new products, sometimes one or two a month. Telcos, for example, are spinning off new plans, new product bundles, new marketing messages and solutions that require a lot of agility in the billing system and the provisioning system. Their core systems have to be a lot more agile and the data warehouse has to keep up with that. So if we’re structuring a new product bundle every month and taking it to market, the billing system may be able to keep up with that, the provisioning system may be able to keep up with that, even multiple provisioning systems, but when that all hits the data warehouse—in some of the situations I’ve seen—it becomes a screaming mess. Nobody can keep up, the warehouse can’t keep up, and then the folk who’ve rolled out the new bundle want new reports within the first week of having it in the market. They want to know “How many are we selling?”, “What’s the profitability?”, “What is the profile of the market segment that is purchasing this new product?”, and the warehouse says “We can give you that information in 6 weeks, but right now you’ve just broken all my reports with your new constructs and we’ve got to do some extra work on this data mart over here, and then we’ve got to tweak the ETL, and then we’ve got to do regression testing, impact analysis, etc. So these days, there needs to be a way of meeting the business need for both agility and reporting that keeps up with the pace of change. From what I’ve seen, the classic traditional data warehouse model is not able to do that.

SS: So business is becoming faster and organisations are becoming more complex, and people with an appetite for analysis are demanding more in terms of information, feedback, and measurement?

MB: I think it’s fair to say that the pace of business change has accelerated.  There was a time when the core business of a bank might have been savings accounts and mortgages, but within the last 20 or 30 years that has diversified into online trading, reselling insurance products, having superannuation funds to manage. So the diversity of offerings that a bank is now bringing to the market—they might be offering 50 to 100 products where in the past they offered 10 or 12. Same thing with a telco. Back in Telstra’s early days it was fixed lines, and the only thing to worry about was local calls, STD calls and then international calls, there was nothing else. Now you’ve got mobile, data over mobile, over 3G, all the different product bundles that people have, caps—you’ve got telcos churning out new caps every 6 months—new phones, new bundles, bundling fixed with mobile with home internet and even cable TV—and needing to track when someone cancels one service but is still getting the bundle discount, for example. All that kind of complexity never existed even 10 years ago. So I think the pace of change and the complexity of doing business has changed, and the classic model that we’ve had for data warehousing has not significantly changed. In fact it used to be a line we would drop to reassure the client when positioning a data warehousing project to board level executives, to say “Data warehousing has not changed in 30 years”.  In other words, you can be confident that this is a well-worn space, an area of technology that was not invented yesterday, and therefore that all the lessons have been learned, all the pain has been experienced, all of the lessons have been written up, and so building a data warehouse is a pretty safe endeavour.

SS: So presumably you don’t present that message today?

MB: No. Because I think the message isn’t accurate anymore. I see it diverging in two directions. I see your classic data warehouse which has the stringency around it. It has the discipline around things like being able to identify your data lineage, having your metadata available, being able to explain that this number on this report came from this table in the data mart, which came from this table in the core data model, which came through these transformations, from this source system, and that says 100, and this says 100, and we know that the numbers are correct. That’s been the standard approach to data warehousing, and everybody spends huge amounts of energy, time and money trying to keep the warehouse up to date so that you can confidently say to the board, or the ASX, or ASIC, or APRA, or whoever it happens to be when you put a report down in front of them, that you’ve got your traceability, your auditability, your metadata, your data lineage, and so on. The CFO, for example, has to be able to sign off on those numbers. The problem now is that we’ve taken that disciplined model—which is fantastic for reporting on financial numbers to regulators—and we’ve extended it into the domains of HR, and marketing, and across the entire enterprise data model. We wanted to fit the entire enterprise into the data warehouse. What I think we’re seeing already happen is that the locus of subjects that fit into the traditional data warehouse is shrinking to include only those where that level of rigour is required for reporting and analysis—areas such as finance, risk, and other things that go to regulators. All the other stuff, which is the other 95% of the business, will probably end up in a much more flexible, dynamic platform which we’re starting to call an analytical warehouse or analytic warehouse. It’s not your traditional data warehouse. Usually in terms of data volume it’s a lot larger, a lot more atomic, a lot more nimble, agile, and almost ad-hoc in a way. To be honest, people have always done this—they’ve just done it under the table: take a data set, throw it into an Access database, take a bigger data set, throw it into SQL Server, do something on a laptop, do something on a server that they bought on the corporate credit card down the road because the data warehouse wasn’t able to give them the data they required, or because running the query would slow down all the standard reporting. People have always done analytic warehouses; they were just under the radar.

SS: So they’re coming out of the closet in a sense?

MB: That’s right.  Because the pain of the traditional approach is now such that it’s unsustainable, and so what everybody’s been doing all along covertly, organisations are now beginning to look at and say, “We have to cater for that model.” They have to have a way of managing information that caters for ad-hoc analysis, for marketing guys running quick, nimble little models to come up with some market segment so they can generate a campaign to a particular demographic. Traditionally their only official option was the data warehouse, but they couldn’t throw enough information in, or the queries ran too long, or the joins were too complex given some third normal form industry data model. They’ve been doing in under the table, so let’s elevate that and bring it back into the fold of IT but on a platform that can handle that kind of ad-hoc dynamic: throw some web clicks in here, throw some Twitter feeds in there, mix in some data from the Australian Bureau of Statistics, get some smart people to have a look for correlations, integrate the results from the latest campaign for marketing a platinum credit card to a certain demographic, and let’s do some follow up on that. All of this dynamic information needs a home—well, it’s always had an unofficial home, on someone’s hard drive—but what I see happening is that moving back into a more recognised, structured place, without the imposed controls that had it running away from home in the first place.

SS: So it sounds like we’ve got the emergence of two domains. There’s a domain of stability—the traditional data warehouse which fundamentally hasn’t changed much in 20 to 30 years. Its set of disciplines maps well to data and business processes that also fit that description. Something like financial reporting, where the accounting standards are pretty stable and where there are widely accepted processes which are professionally constrained for getting data into that shape. Then there’s a more uncertain domain—the data’s uncertain, the analysis in uncertain, the queries are uncertain, the perspective is uncertain, and it sounds like that’s broken in a way too. You’re saying that people have been doing it under the table for a long time, but it sounds like the scale is now such that it doesn’t fit under the table anymore. The tools are breaking.

MB: That’s exactly right. The data volumes that we are beginning to deal with won’t fit on a little server under the desk anymore. You’re talking about data volumes that will explode out past many terabytes eventually and you’re not going to fit it on your laptop, and if you did your single query would take the whole weekend. Maybe a real world example would be useful. One of EMC Greenplum’s reference sites is T-Mobile in the US. They have their classic traditional data warehouse. It’s 100 terabytes of data and it has the discipline and the control and the industry standard data model. Billing information comes in, provisioning information come in, and they can report to the market out of the data warehouse their precise number of subscribers and their exact financials every month. All of that controlled information that you’ve just described. Alongside that they have an analytical warehouse which happens to be a petabyte. It’s about 10 times the size of the traditional data warehouse. Into that flows a lot—if not all—of the same data. But it also contains 900 terabytes of other data, which is market sentiment analysis, unstructured data, and all kinds of data that is structured but which is too hard to fit into the industry standard data model. The other thing that’s probably worth mentioning is that within the 100 terabyte traditional data warehouse you have a lot of summarisation of call data records—for example, after 2 or 3 months you’ve finished with your standard reports and you summarise them to one row per customer per month instead of one row per phone call or one row per text message.  This means you can still report accurately on all of your regulatory requirements, but you can’t do analysis anymore at the atomic level of detail. In the analytic warehouse, on the other hand, you might have that history for 3 or 4 years. Why would T-Mobile need that? Well, they were looking for drivers of churn, which is one of the biggest expenses for a telco. It takes 10 or 20 times the amount of money to win back a customer than it does to keep them and not have them leave. So they kicked off a project to try and identify the most significant indicators of a customer who was about to churn. To start with, they did some modelling around complaints to the call centre. Was there a pattern whereby someone who calls the call centre 3 or 4 times within 3 months was more likely to churn? That couldn’t be done on the existing data warehouse platform. Some of the data was there, but not all of it, and there wasn’t a business case to put it in there on the basis of speculation. And even if there was, had they started to run such an analysis it would have slowed down all of the business-as-usual reporting that the warehouse needed to support. Hence the analytic warehouse. First they did analysis on call centre logs looking to relate complaints to the churn indications that came from the warehouse. No correlation. Then they did a similar exercise looking at call dropout rates and call termination codes. Were dropped calls triggering churn? Again, no clear correlation. Note that they needed the unsummarised call records for this, but the data warehouse had already summarised them. So then they let a data scientist loose in the analytical workspace, unconstrained by the data model, unconstrained by performance concerns related to running obtuse queries. It turned out that the most significant precursor to a customer churning was whether or not any of the people a given customer called the most churned. It makes sense because now I’m calling them telling them I’m with a new telco. The correlation was so strong that if I churn today, the people I call the most are seven times more likely than the average customer to churn in the next 3 months. So they took that insight and operationalised it. The action they developed was that as soon as an individual churned, anybody they frequently called was targeted with an attractive offer to re-contract. They estimated the monetary benefit to the organisation from this was $70 million. Rob Strickland, The CTO of T-Mobile at the time still tells this story. He says ‘Here is my data warehouse. I’m spending millions and millions of dollars on maintaining this thing, and what it gives me is a bunch of standard reports. And yet over here, from my sandpit that cost me one million, comes the insight that was of $70 million benefit to the organisation.” They’re spending tens of millions of dollars on the data warehouse, but it’s the one million dollars on the analytic warehouse that’s differentiating them in the market.

SS: There’s a distinction worth recognising there between the process that generated the insight and then the subsequent operationalisation of that insight. It may have cost more than a million dollars to operationalise the insight—to get it out to call centres, to marketing and product teams, to come up with and execute the right campaigns and performance manage the whole thing—all of those things cost money too. But the discovery process to get to the insight was itself relatively cheap. Interesting too that many of the instinctively explanatory metrics didn’t turn out to matter in this case: complaints to the call centre, dropped calls, network, phone, plan, demographics. It turned out to be the social network.

MB: As measured by the call data records. One of the outcomes of that learning was the benefit of keeping atomic level call data for months or years rather than summarising it—which you have to do in a traditional warehouse because of the cost of storage.

SS: Because you don’t know from what perspective you’re going to analyse it?

MB: Correct. You may come up with a question in 3 months’ time that requires atomic level data over 2, 3, 5 years. If that’s been summarised you can’t get it back. This is part of the big data story. Organisations are starting to realise that there’s real gold in this atomic level data if they can find a way to keep it long enough, cheaply enough, in order to enable analytics be run.

SS: That in some sense addresses the ‘volume’ component of big data. What would you say about the ‘velocity’ and ‘variety’ components?

MB: Storage and computational power are both becoming cheaper, and technologies like Hadoop are making those two factors a lot easier to wrestle with. I won’t say that Hadoop is easy, because it’s not, it’s a very embryonic technology. But the vision is to be able to take large data sets that previously were ignored—that come under the label of ‘exhaust data’—application logs, machine logs, detailed stuff that you would never bother collecting, because why would you? Organisations are now beginning to see that it’s not actually that expensive to keep this data. And not only that, really valuable information can be harvested from it. One organisation we work with, for example, captures all of the application logs from a certain application it uses and mines them for indicators of internal fraud. Are any employees or other operators going into areas of the system they shouldn’t be? There was time when that log file information wouldn’t be collected because it was just too expensive to store, let alone analyse. The internal audit approach in such a regime relies on occasional random sampling. But now you can actually have definitive indications from the application logs. So you’re saving time, money, and the imprecision of having someone look randomly, because as soon as an alert comes up it can be managed by exception. It frees up the team who used to spend all day looking at random samples to be more productive.

SS: What are the barriers to an uptake of exploratory analytics as an accepted activity? Cost is one, and you’ve addressed that by pointing out that cheaper solutions are now available.

MB: I think that the IT architect in particular is going to have to embrace the new paradigm. A traditional IT architect is all about structure, reusability, and building things in a certain way so that they benefit the entire enterprise. This drives things like single enterprise views of architecture and standardisation on as few technologies as possible. It tends to weigh against speculative, exploratory and unconventional ways of doing things. So if the business approaches IT and says, “Look, I don’t know exactly what we’re going to do but we think there are insights in our data and we know we need a few terabytes to play with. Can you give me space?”, they’d be very lucky to get funding. That’s not considered a business case. At the same time, analysts probably need to be a bit more aware of how to put their requirements into the right language for architects: What level of security does it need? What level of availability does it need? What level of backup and recovery does it need if any? How much data volume? What are the impacts on the network going to be? What will be the flow on effects to production applications? Standard non-functional requirements—security, availability, backup, recovery. There are 6 or 7 of those that are your standard application/solution non-functional requirements and they’re the kind of things that an architect needs to be able to tick the box on.

SS: Finally, what attracted you to your current role?

MB: I see Greenplum as probably the most innovative and ambitious of the new breed of analytical database technologies. At EMC, Pat Gelsinger in particular looked at the other technologies out there and chose Greenplum because it was able to run on commodity hardware, it had flexibility within the data model to handle unstructured, semi-structured, and structured data, row based and column based storage in the same table—a whole load of stuff that everybody else is trying to catch up with at the moment.  So I see it as early days in the market for an innovative, disruptive technology.

Related Analyst First posts:

The recent IAPA discussion panel on ‘Aligning IT and Analytics to deliver sustainable innovation’, plus a later conversation with fellow panellist, EMC-Greenplum’s James Horton, prompted me to sketch some thoughts on what an Analytics Lab ought to do. The lab is the natural home for Analysts engaged in the narrower definition of Analytics:

Purpose

The Analytics Lab is an innovation factory which constantly evaluates data, quantitative methods and tools looking for sources of competitive advantage.

It evaluates:

  • Data: structured and unstructured, sourced from both inside and outside the organisation, established and new.
  • Methods: data transformation, and then data mining, machine learning, statistical, mathematical, and other analytical methods.
  • Tools: as appropriate to method, from programming languages through to GUI applications, from commodity and open source through to commercial tools.
  • Analysts: the lab enables the organisation to evaluate the technical abilities and innovative propensities of its analysts, as well as those on offer from external service providers, without many of the interfering factors present in operationally hardened IT environments.

Its outputs are:

  • Insights
  • BI prototypes
  • Instantiation candidates

It also:

  • Identifies data and knowledge gaps: Analysing data and generating insights brings to light new data needs and exposes gaps in knowledge which may impact the business. Additional data may need to be sourced, gathered through survey, collected by tweaking an existing business process, or purchased from a third party. Additional analyses and subject matter expertise may be required to close knowledge gaps.
  • Resolves disharmonies: All businesses struggle with ‘different views of the truth’, and it’s often the crunching of data which brings these to light. Disharmonies might be within or between data sets, or between conventional wisdom and the drivers of a model. They could relate to anything from actual observations to tacit assumptions. Resolving such disharmonies—harmonisation—involves identifying, scoping, validating, and correcting them.

These last two are not the core business of Analytics, but they’re important activities, and doing Analytics naturally leads to them. Most organisations don’t explicitly provision for them, but arguably they should. The lab is as good a home for them as any other.

Beneficiaries

The Analytics Lab services all levels of business, but in different ways:

  • Senior Management: through the provision of strategic insights.
  • Middle Management and Knowledge Workers: through one-off and/or prototyped BI analyses.
  • Frontline Workers: through the identification of instantiation candidates, i.e. deployable operational analytics.

Context

Many analyses typically need to be tried before those which merit instantiation are discovered. Furthermore, “instantiation” doesn’t necessarily mean a repeatable process. It could simply mean the communication of a one-off insight, e.g. “revenue growth is unmistakeably slowing in all but one customer segment” or “the most reliable predictor of a customer’s propensity to churn is their social network membership.” Such insights are typically complex, valuable, but not “actionable” in any deterministic, automatable way.

Other findings are suited to more regularised delivery, for example as managerial decision support through business intelligence.

Some analytical results, in order to be fully leveraged, need to be integrated into frontline business processes. Predictive models which predict customer acquisition or churn, for example, might require integration in sales, marketing, call centre, channel management and customer support processes.

Approach

Conduct disciplined, exploratory analyses which repeatedly cycle through the following sorts of questions:

Data questions:

  • Is there structure in the data (patterns, trends, relationships, networks, segments, clusters, indicators, drivers, outliers, anomalies)?
  • Are there new insights in the data?
  • Which models are viable?
  • Which variables are important?
  • Which variables do we control?
  • What are the implications for revenue, cost, risk?
  • What data do we want that we don’t have? How could we get it?

Deployment questions:

  • What are the implications of this insight?
  • Who is our internal customer for this insight?
  • Would this analysis be valuable if provided on an ongoing basis? To whom?
  • Into which existing or envisioned business processes should this insight be instantiated?

Harmonisation questions:

  • Where are there disharmonies in tacit or explicit data and assumptions?
  • Which projects, processes and decisions are affected by these disharmonies?
  • How do we validate and resolve these disharmonies?

Key infrastructure

Infrastructure can usefully be separated into the ‘electronic infrastructure’ of hardware and software and the ‘human infrastructure‘ of people, relationships, management and incentives.

Electronic infrastructure

  • Secure, off-network ‘sandpit area’
  • Big storage, big memory, scalable to big data
  • Eclectic analytical toolset: commodity, open source, commercial, experimental, in-house
  • Snapshots, copies, feeds of all manner of available data sources: pre-ETL, pre-warehouse, post-warehouse, external, web, social media, unstructured. In the context of the lab, the data warehouse is just another source system.
  • De-emphasis on repeatable technical processes and compliance with production IT architecture
  • Insulated from IT Service Level Agreements and other production / core system / business-as-usual constraints

Human infrastructure

  • Human Resources:
    • Analysts: Data scientists
    • Management: Validate analysis objectives, ensure that analysts remain focused, performance manage the innovation process.
  • Relationships:
    • Sponsorship from Executive
    • Cross-functional relationships with business units: both ‘push’ (business unit as customer) and ‘pull’ (business unit as subject matter expert)
    • Close relationship with Strategy function
    • ‘Caveat utilitor’ relationship with IT for data provision and tool support
    • Various relationships with service providers: vendors, consultants, training and mentoring providers, industry expertise, academia if appropriate
  • Performance Management:
    • Innovation / Research metrics
    • Risk metrics
    • Sentiment metrics
    • Dimensions of opportunity: Internal, Competitor, Market, Customer, Product, Channel

#gloves #wash #breaking bad

Related Analyst First posts:

The relationship between IT departments and analytics teams has at times been hostile. Why? Both have a strong focus on enabling business and have responsibilities for data and its use. There is an apparently obvious requirement to achieve alignment yet many organisations and departments have struggled to do so. In this panel discussion we will explore the basis for these difficulties, and hear some practical approaches aimed at overcoming them.

That was the subject of an Institute of Analytics Professionals of Australia (IAPA) panel discussion last night in Canberra, facilitated by ACT Chapter Head, Peter O’Hanlon. Alongside me on the panel were:

  • Murray Alston – Enterprise and Solution Architect (Australian Customs and Border Protection Service)
  • James Horton – Director Solutions and Strategy (EMC Greenplum)
  • Warwick Graco – Senior Director Operational Analytics (Australian Taxation Office)

In preparing for the discussion I made the following notes:

For what does Analytics require IT support?

  • Provision of analytics infrastructure—hardware and software
  • Provision of data

What sort of Analytics are we talking about?

  • Finding: Exploratory or “lab” analytics
  • Building: Operational analytics, IT instantiations of the outcomes of analytics

Why the hostility?

I see two main causes:

  • Misunderstanding: adopting the wrong risk management framework
  • Misincentive: adopting a siloed, self-interested risk management framework

Both of these result in suboptimal trade-offs between risks and rewards. From the IT point of view it’s easier to imagine downsides than upsides when it comes to Analytics. Giving analysts access to large volumes of organisational data with powerful tools to ask complex questions raises all sorts of concerns about privacy, data security, network availability, application stability, and so on. These can be largely mitigated through matching the right electronic infrastructure to analytical activities—particularly, recognising that ‘Finding’ activities (innovations) can take place in an off-network Analytics Lab or ‘sandpit environment’ and are separate from ‘Building’ activities (instantiations). Then there are the risks of misinterpreting data, making mistakes in the analysis process, and misinforming decisions. These are worth worrying about, but it’s also important to ask about the unseen upsides. What about the risks of not doing Analytics? There are trade-offs, but I’d prefer the IT department that gives its analysts unfettered access to data and tools in a sandpit environment—with the caveat that it takes no responsibility for misuse or misinterpretation of that data—to the IT department which won’t release data to analysts for fear of blame in case they make a mess.

These problems aren’t unique to Analytics but it is the most pathological case. The problems scale according to the degree of reliance on electronic infrastructure and data, which is why Finance functions recognise many of them. Resolving them—that’s to say, ‘aligning IT and Analytics’—suggests an enterprise risk management function which sits above both functions and can weigh and trade off risks and rewards. Perhaps this is ultimately the CEO’s role.

Where does Analytics belong?

There are two dimensions here:

  • Technical, in which sense Analytics sits “between IT and the business”
  • Functional, in which Analytics sits “not between IT and the business”

It’s common to infer the functional from the technical: because Analytics uses lots of sophisticated software and analysts are technically savvy, it must be IT. This is a mistake. Analytics is done by analysts, but we never say that Analytics should therefore sit “between HR and the business”. Where it should sit in a functional sense will depend mostly on organisational culture and context. Examples include:

  • Business Intelligence / Decision Support
  • Intelligence and sensemaking
  • Strategy
  • Innovation
  • Research & Development
  • Knowledge Management
  • but not IT

Oil & Water

Related Analyst First posts:

Tagged with:
 

So far from making us more profligate with information, perhaps the Goddess of ‘big data’ will spur us to be smarter in data selection, and ensure more intelligence is embedded within our data extraction, transformation and reporting processes.

Greg Taylor‘s comments on the ‘Knowing what you’re missing‘ post are spot on. One of the clear implications of the big data explosion—technical challenges aside—is that manual analysis methods simply can’t scale to the volume and velocity at which potentially relevant data is being generated. As such, analytics (particularly of the machine learning variety) is ever more vital. One of my rules of thumb in consulting is that any OLAP cube is a standing business case for predictive modelling. As I put it in the ‘Advanced analytics and OLAP‘ post:

OLAP makes multidimensional data exploration about as fast and intuitive as it can be when a human is doing the driving. This means being able to arrange on screen, in two dimensions (perhaps taking advantage of colour and shape to visualise a third and fourth), relatively small subsets and arithmetical summaries of data. Advanced analytics, however, automates exploration. Only data mining methods can look at all dimensions simultaneously, at all levels, in combination. And they can do this in unsupervised (looking for natural structure in the data) or supervised (inferring input-outcome relationships) modes.

Greg elaborates:

  • So as Analysts search more broadly for relevant data to meet the decision making requirements of management, perhaps they need to increasingly ask themselves: how will this piece of information fit within the network of predictive functions which explains the business?
  • How might Analysts apply Occam’s razor to ensure only information which contributes predictive understanding is included, given the exponential growth in the potential data sources that could be used? One logical approach is for Analysts to undertake more experimental testing of variables (and transformations) for their explanatory power with respect to business outcomes.

As the earlier post reported, status quo electronic infrastructures aren’t ready for big data, and new technologies and disciplines are evolving rapidly to close the gap. But even more substantive changes are required of organisations’ human infrastructures. The key transition that business users of data need to make in the big data context is from the default of consuming more data to the practice of consuming data of higher value. This means becoming analytically literate and learning how to trust and leverage analysts. The key role that analysts must play in supporting decision makers is to understand what constitutes higher value, and to seek it out and communicate it. The key role for IT functions and BI managers is to enable analysts to enable decision makers.

Dilbert.com

Related Analyst First posts:

MADlib, the open source, in-database analytics package has gone beta.

Like Analyst First, MADlib benefits from the support of the good folk at EMC-Greenplum. In particular, EMC-Greenplum’s Australian team is responsible for developing the support vector machines, Latent Dirichlet Allocation (a technique for doing topic discovery in unstructured text) and sparse vectors modules.

The business case for the Analyst First model in the big-data space has thus taken a leap.

Tagged with:
 

The Canberra Chapter of Analyst First is off and running as of three hours ago, thanks to sponsorship by EMC Greenplum, the valuable time and energy of Graham Williams, the most famous data miner in Canberra, and now the head of the Canberra chapter of A1, and the 61 people attending, who together made it one of the largest Analytics events in Canberra to date.

Slides here, and a video of the talks will follow soon.

We look forward to active participation in A1 from Canberra’s Analytics community.

Set your Twitter account name in your settings to use the TwitterBar Section.