Meanwhile, also from Rexer:
R continued its rise this year and is now being used by close to half
of all data miners (47%). R users report preferring it for being free, open
source, and having a wide variety of algorithms. Many people also cited R’s
flexibility and the strength of the user community
A few thoughts on Big Data, and how it matters, as well as where the hype does more harm than good. Also some thoughts on what I call “Big Crunch”, Big Data’s Siamese twin, and one I am well acquainted with, more so than Big Data proper.
First of all, the hype. A colleague in the industry spoke to me recently regarding an all too common scenario.
The colleague is effectively the analytics manager of the business. The trouble is, the business does not really have much of an idea about what data analytics is, apart from the fact that it has something to do with big brand software, which the company has in abundance. There is far less idea that it requires direct engagement and understanding from management, and most importantly that it requires clean, accessible and appropriate data before anything can happen, and that this requires a clear, concerted effort, and that this usually consumes the bulk of investment in time, money and political capital. Certainly, there is no realization that what the business needs done is easily served by open source and commodity software, but requires investment in good staff, and appropriate, coordinated effort from IT and management.
So far, so typical A1 case study, could really be anyone, and probably an adequate description of at least 100 cases in the city of Sydney alone. The Big Data twist is that a senior manager has been sending the colleague material on Big Data, and how important it is. Again, this is an increasingly common occurrence.
Naturally, the colleague is exasperated. First of all, the business is not really one that collects or benefits from terabytes per day of data. Secondly, they have not even got their small data right It is neither clean, nor reliable, nor ready for analysis. Finally, the manager in question has done little to get appropriately educated or involved in analytics, but bought deeply into the Big Data hype, seemingly without understanding it too much either.
And here endeth yet another, all too typical cautionary tale.
But Analyst First is more about “Though Might Want To” than about “Thou Shalt Not”
So here are the “Thou Shalt” Analyst First commandments on Big Data:
Hear ye o senior manager/sponsor of Analytics!
Before you worry about Big Data, or dare speak its name:
1. Get your Small Data right first. Organised, cleaned, and analysed.
2. Be engaged in analytics. Investing in analytics is like investing in a Gym membership, or an education – you don’t just spend the money and make it someone else’s problem. It only works if you are directly engaged.
3. Invest in the the best Human Infrastructure you can
4. Squeeze all the value you can out of your small data first.
5. Use Big Crunch on small data before moving to Big Data.
So point 5. brings me to Big Crunch.
Ok, admittedly, Big Crunch is actually a feature of an upcoming post, but the intro and teaser is here (Just as A1 itself started life as a teaser at an R presentation to MelbURN almost 2 years ago)
So, what is Big Crunch ?
Big Crunch is my name for the very cool stuff you can do with small data given heaps of computation. Much of my recent work has made extensive use of tools such as re sampling, and its specific manifestation in Random Forests, k-fold cross validation, and other advanced predictive modelling techniques.
I have over the last couple of years developed a number of new, tortuously computationally complex (but deliciously parallelisable) techniques for measurement of such beasts as robust predictive variable importance, generalised nonlinear correlation and forecast performance. I use these to squeeze information out of data sets big, and, more importantly, small.
Incidentally, in practice “small” includes huge data sets with tiny classes, where the task is classification, and the business problem may be customer retention, acquisition or fraud detection.
Big Crunch applies extreme statistical sophistication to small data sets, just the very thing you need to get the very limited amount of information out.
These methods really do squeeze the informational juice out of data the way simple, mean estimating, linear models and Pearson correlation measures cannot ever hope to do.
Indeed, much of my recent work in areas as diverse as geophysical analysis, retail price elasticity modelling and customer acquisition in entertainment can be described as Big Crunch. In all cases, we see dramatic improvements in robustness, performance and reliability. The biggest challenge is to explain the difference between less reliable, classical measures such as Rsquared on training data, and far more reliable, seemingly magical OOB error on random forests, and how these serve as building blocks for far more sophisticated solution.
Finally, the Big Crunch mini-FAQ.
Is “Big Crunch” actually just “Big Data” ? Yes, if you like, in a way. Similar principles usually apply.
Can/should you Hadoop/MapReduce it ? Sure. Highly parallelisable as a rule.
Can you deploy Big Crunch on commercial systems ? Sure, why not. But do you really need to ? (In fact, we just deployed a Big Crunch solution in SAS, after prototyping in R.)
Is Big Crunch difficult ? It shouldn’t be. Not if you have the right people. They are around. And your people can become the right people with a bit of mentoring. A stats degree helps.
Is Big Crunch expensive ? In people, moderately. In systems and software: hey, Hadoop is open source. And so is R. And the same applies to Big Data too. So no.
“For the first time, the number of users of free/open source software exceeded the number of users of commercial software. The usage of Big data software grew five-fold. R, Excel, and RapidMiner were the most popular tools, with Statsoft Statistica getting the top commercial tool spot.”
The sample is arguably skewed, but arguably the skew over-represents users at the forefront of the field.
It would be interesting to compare the numbers of several previous years in a single chart.
Yesterday night saw two back-to-back events at Deloitte in Melbourne:
The A1 meeting went well, attended by approximately 20. Arranged by Yuval Marom, the founder and convener of MelbURN, and chaired by the dynamic Richard Fraccaro, head of Melbourne’s A1 chapter.
The presentation consisted of an update of what A1 has achieved since it was founded and revealed at MelbURN in October last year. The recent AIPIO presentation, the Canberra launch, the Sydney Chapter think tank and this website/knowledge repository all rated a mention.
A healthy discussion followed on the nature and needs of analytics education, prompted by a question regarding the role of academia.
But this was not the main event.
Following right after, and attended by 45 or so, came the MelbURN event, with the title “Experiences with using SAS and R in insurance and banking”, presented with human and unassuming polish by Hong Ooi, statistician at ANZ Bank.
This was, without exaggeration, one of the best R presentations I have ever seen.
Hong taught the audience some very important things about banking and finance, rigorous statistics, data representation and a masterful use of the R language, and key R packages such as plyr.
More importantly, he provided not one, but multiple case studies, and in each a comparison of R and SAS, as well as ways of combining the two together. This included calling R from SAS, and using R to generate SAS code.
Most striking for me was the comparison of SAS with R in a live, corporate financial context, and the presentation of R as a viable, robust, industrial strength option, with some unique advantages, and admitted weaknesses.
I hope that Hong can present this again to the Sydney Users of R Forum (SURF)
His presentation slides can be found here.
A video of the presentation will hopefully be available soon.
Stephen and I will be in Melbourne until Saturday if any A1 blog readers want to meet before then.
Consider two visions of Business Analytics.
The analytics function is small. Tiny in fact. Just an analyst and a PC, and maybe a junior data wrangler. This team provides ad-hoc reporting, Analytics and insights to the CEO. Results are usually delivered verbally, as snippets of email or as PowerPoint slides. These results influence support and shape the CEO’s strategic decisions, leading to significant changes in the direction of the organisation. Systems costs are minimal, analysis is performed using Microsoft Office, PostGres, R and RapidMiner. The value add of the function is obvious to the customer (the CEO), the CEO supports any data acquisition requests. There is no complicated “deployment”, there are no stakeholders, there is no internal sales or change management process. Analytics supports the most important decisions in the organisation in the most simple and direct way. Analytics only relies on IT to obtain data and to run its tools of the trade.
The Analytics function is huge. A team of analysts, data wranglers, managers and communicators, continuously crunching more or less the same data, to create predictive models of individual behavior (churn, acquisition, credit risk etc.), validate these models, and “deploy them”. The system relies on expensive brand name software for ETL, modelling, deployment and campaign management. Data acquisition is a regular fight with IT, as is deployment, which requires IT understanding, buy-in and non-interference. There are also complex human value chains involving process workers internally and at the coal-face, who are to be relied on to correctly execute actions directed by predictive modelling, to collect and pass on resulting data, and to support the ongoing use of predictive modelling. Such a function supports or replaces decisions made by junior staff, in the hope of adding marginal value to existing business processes. Analytics is inherently dependent on, and quite often mistaken for, IT.
Which is more valuable?
Which is more quickly deployable?
Which introduces more risk?
Which is more likely to be noticed and supported by senior management?
Which is more transformational?
Which is more expensive?
Which is more typical?
This is not an “either/or” choice. There is obviously a need for both strategic and operational analytics. But which one should you do first? Which one is more likely to succeed, add value and thrive? And which one is most likely to break free of the shackles of IT?
The choice seems obvious, yet Business Analytics seems synonymous with IT centric operational analytics almost everywhere you look. Now why would that be?
About usAnalyst First is a new approach to analytics, where tools take a far less important place than the people who perform, manage, request and envision analytics, while analytics is seen as a non-repetitive, exploratory and creative process where the outcome is not known at the start, and only a fraction of efforts are expected to result in success. This is in contrast with a common perception of analytics as IT and process.
Tags in a CloudAIPIO analyst first Analyst First Chapters analytics analytics is not IT arms race environments big data business analytics business intelligence cargo cults collective forecasting commodity and open source tools complexity data decision automation decision support educated buyer EMC-greenplum forecasting HBR holy trinity human infrastructure incentives intelligence model of analytics investing in data lean startup literacy management culture MBAnalytics operational analytics organisational-political considerations Philip Russom Philip Tetlock prediction markets presales R Robin Hanson Strategic Analytics tacit data TDWI Tom Davenport uncertainty uneducated buyer vendors why analyst first