Big Data’s causation and correlation issue

Causation versus correlationIf Big Data came in a box, it would be stamped, “Warning: Correlation does not imply causation” on the side. There’s a common thread among Big Data stories, often told as exciting tales of wonder, that correlation somehow approximates causation. It sometimes gets expressed in oblique arguments that more data is better and in stories of the search for the perfect algorithm.

It isn’t that simple, as we wrote a while back: in some cases, like when choosing wine, small data actually matters far more than big. It can come down to simply whether the wine buyer likes wine with heavy tannins or not…so much for bouquet, texture and fruit.

Causation versus correlation

If you’re new to this area, I should explain that causality means A causes B, where correlation, on the other hand, means that A and B tend to be observed at the same time. These are very, very different things when it comes to Big Data but often the difference gets glossed over or ignored. Whether correlation is “good enough” to act without knowing the cause for something depends entirely on the problem being solved and the risks of being wrong.

Is correlation good enough? It depends…

Gil Press, writing in Forbes, explains this idea very well in his review of the recently published, widely commented bookBig Data: A Revolution that Will Transform How We Live, Work, and Think:

“For many everyday needs, knowing what not why is good enough.” The book is full of such examples from making better diagnostic decisions when caring for premature babies to which flavor Pop-Tarts to stock at the front of the Walmart store before a hurricane. Big data can help answer these questions, but they never required “knowing why.” Big data analysis can be about correlations OR causation—it all depends, as it has always been, on what question we are asking, what problem we are solving, and what goal we are trying to achieve.

Going off the roadGil isn’t the only one making this distinction. Algorithms by themselves don’t tell you what data means and without human input or direction (in the form of hypotheses or data discrimination), can actually steer understanding of data in the wrong direction. Data science, it seems, requires a healthy sense of skepticism.

This is exactly what makes data scientists so hard to find…it isn’t about the ‘big-ness’ of data or the algorithm’s perfection. It is about knowing a great deal about the data so that the true meaning can be coaxed out, not squeezed out by a mindless process.

It will disappoint many to hear that there isn’t always an expensive, industrial solution to ever larger amounts of data. Instead, it often comes down to having great governance of data so that metadata (data about data) can be fully understood and taken into account.

Examples of correlation versus causation

Getting it wrong can be expensive as shown in Freakonomics example of mistaking correlation for causation that almost led the State of Illinois to send books to every child in the state because studies showed that books in the home correlated to higher test scores. Later studies showed that children from homes with many books did better even if they never read, leading researches to correct their assumptions with the realization that homes where parents buy books have an environment where learning is encouraged and rewarded. Correlation versus causation in plain view. Illinois didn’t have money to waste going in the wrong direction and neither does today’s enterprise.

A simple explanation

Khan Academy, probably the best broad-based learning site on the Internet, has this great video lesson the the difference between correlation and causation that is a great reminder of the limitations of data:


Tags: , , ,

Categories: Data Analytics / Big Data

Author:Chris Taylor and Jeanne Roué-Taylor

He's a techie marketer, she's a French lawyer. When we write together, anything can happen.

Subscribe to the blog

Subscribe and receive an email when new articles are published

4 Comments on “Big Data’s causation and correlation issue”

  1. July 8, 2013 at 1:05 pm #

    I find this exceptionally curious – “There’s a common thread among Big Data stories, often told as exciting tales of wonder, that correlation somehow approximates causation.”

    To be blunt I have never seen an experienced analytics or BI team make that mistake – this is statistics 101 and frankly isn’t a big data issue. It is a foundational competency of any analytics practitioner.

  2. July 12, 2013 at 11:48 pm #

    Not sure if this would solve the issue of how we could make the system understand to use relevant data when there zillions of unused information on the cloud

  3. July 13, 2013 at 3:51 am #


    Put Integrative Thinking (Roger Martin, Rotman in Toronto and his book ‘Opposable Mind’ together with Axiomatic Design (Nam Suh and his Theory of Innovation) together and watch INNOVATION TAKE OFF!

    However, there has to be a SHOW-ME in a domain where there is both a NEED and a VISION. A paper submission is in process to WSPC co-authored with my ten-year Mentor Professor Emeritus Gunnar Sohlenius (KTH Stockholm) entitled

    This, however, will NOT fly (see pun below) until a SHARED VISION evolves within a cluster that have the prerequisite:

    Then, don’t just publish, but perform as a response to SHOW-ME demand by an ELEVATOR PITCH.

  4. September 2, 2013 at 3:39 pm #

    Very nice blog post. I absolutely love this site. Continue the good work!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: