Currently viewing the tag: "resampling"

A few thoughts on Big Data, and how it matters, as well as where the hype does more harm than good. Also some thoughts on what I call “Big Crunch”, Big Data’s Siamese twin, and one I am well acquainted with, more so than Big Data proper.

First of all, the hype. A colleague in the industry spoke to me recently regarding an all too common scenario.

The colleague is effectively the analytics manager of the business. The trouble is, the business does not really have much of an idea about what data analytics is, apart from the fact that it has something to do with big brand software, which the company has in abundance. There is far less idea that it requires direct engagement and understanding from management, and most importantly that it requires clean, accessible and appropriate data before anything can happen, and that this requires a clear, concerted effort, and that this usually consumes the bulk of investment in time, money and political capital. Certainly, there is no realization that what the business needs done is easily served by open source and commodity software, but requires investment in good staff, and appropriate, coordinated effort from IT and management.

So far, so typical A1 case study, could really be anyone, and probably an adequate description of at least 100 cases in the city of Sydney alone. The Big Data twist is that a senior manager has been sending the colleague material on Big Data, and how important it is. Again, this is an increasingly common occurrence.

Naturally, the colleague is exasperated. First of all, the business is not really one that collects or benefits from terabytes per day of data. Secondly, they have not even got their small data right It is neither clean, nor reliable, nor ready for analysis. Finally, the manager in question has done little to get appropriately educated or involved in analytics, but bought deeply into the Big Data hype, seemingly without understanding it too much either.

And here endeth yet another, all too typical cautionary tale.
But Analyst First is more about “Though Might Want To” than about “Thou Shalt Not”
So here are the “Thou Shalt” Analyst First commandments on Big Data:

Hear ye o senior manager/sponsor of Analytics!
Before you worry about Big Data, or dare speak its name:

1. Get your Small Data right first. Organised, cleaned, and analysed.
2. Be engaged in analytics. Investing in analytics is like investing in a Gym membership, or an education – you don’t just spend the money and make it someone else’s problem. It only works if you are directly engaged.
3. Invest in the the best Human Infrastructure you can
4. Squeeze all the value you can out of your small data first.
5. Use Big Crunch on small data before moving to Big Data.

So point 5. brings me to Big Crunch.
Ok, admittedly, Big Crunch is actually a feature of an upcoming post, but the intro and teaser is here (Just as A1 itself started life as a teaser at an R presentation to MelbURN almost 2 years ago)

So, what is Big Crunch ?
Big Crunch is my name for the very cool stuff you can do with small data given heaps of computation. Much of my recent work has made extensive use of tools such as re sampling, and its specific manifestation in Random Forests, k-fold cross validation, and other advanced predictive modelling techniques.

I have over the last couple of years developed a number of new, tortuously computationally complex (but deliciously parallelisable) techniques for measurement of such beasts as robust predictive variable importance, generalised nonlinear correlation and forecast performance. I use these to squeeze information out of data sets big, and, more importantly, small.

Incidentally, in practice “small” includes huge data sets with tiny classes, where the task is classification, and the business problem may be customer retention, acquisition or fraud detection.

Big Crunch applies extreme statistical sophistication to small data sets, just the very thing you need to get the very limited amount of information out.
These methods really do squeeze the informational juice out of data the way simple, mean estimating, linear models and Pearson correlation measures cannot ever hope to do.

Indeed, much of my recent work in areas as diverse as geophysical analysis, retail price elasticity modelling and customer acquisition in entertainment can be described as Big Crunch. In all cases, we see dramatic improvements in robustness, performance and reliability. The biggest challenge is to explain the difference between less reliable, classical measures such as Rsquared on training data, and far more reliable, seemingly magical OOB error on random forests, and how these serve as building blocks for far more sophisticated solution.

Finally, the Big Crunch mini-FAQ.

Is “Big Crunch” actually just “Big Data” ? Yes, if you like, in a way. Similar principles usually apply.
Can/should you Hadoop/MapReduce it ? Sure. Highly parallelisable as a rule.
Can you deploy Big Crunch on commercial systems ? Sure, why not. But do you really need to ? (In fact, we just deployed a Big Crunch solution in SAS, after prototyping in R.)
Is Big Crunch difficult ? It shouldn’t be. Not if you have the right people. They are around. And your people can become the right people with a bit of mentoring. A stats degree helps.
Is Big Crunch expensive ? In people, moderately. In systems and software: hey, Hadoop is open source. And so is R. And the same applies to Big Data too. So no.

Set your Twitter account name in your settings to use the TwitterBar Section.