Here’s the original text of my latest news story for Nature.
A new statistical method discovers hidden correlations in complex data.
The American humorist Evan Esar once called statistics the science of producing unreliable facts from reliable figures. A new technique now promises to make those facts a whole lot more dependable.
Brothers David Reshef of the Broad Institute of MIT and Harvard in Cambridge, Massachusetts, Yakir Reshef of the Weizmann Institute of Science in Rehovot, Israel, and their coworkers have devised a method to extract from complex sets of data relationships and trends that are invisible to other types of statistical analysis. They describe their approach in a paper in Science today .
“This appears to be an outstanding achievement”, says statistician Douglas Simpson of the University of Illinois at Urbana-Champaign. “It opens up whole new avenues of inquiry.”
Here’s the basic problem. You’ve collected lots of data on some property of a system that could depend on many governing factors. To figure out what depends on what, you plot them on a graph.
If you’re lucky, you might find that this property changes in some simple way as a function of some other factor: for example, people’s health gets steadily better as their wealth increases. There are well known statistical methods for assessing how reliable such correlations are.
But what if there are many simultaneous dependencies in the data? If, say, people are also healthier if they drive less, which might not bear any obvious relation to their wealth (or might even be more prevalent among the less wealthy)? The conflict might leave both relationships hidden from traditional searches for correlations.
The problems can be far worse. Suppose you’re looking at how genes interact in an organism. The activity of one gene could be correlated with that of another, but there could be hundreds of such relationships all mixed together. To a cursory ‘eyeball’ inspection, the data might then just look like random noise.
“If you have a data set with 22 million relationships, the 500 relationships in there that you care about are effectively invisible to a human”, says Yakir Reshef.
And the relationships are all the harder to tease out if you don’t know what you’re looking for in the first place – if you have no a priori reason to suspect that this depends on that.
The new statistical method that Reshef and his colleagues have devised aims to crack precisely those problems. It can spot many superimposed correlations between variables and measure exactly how tight each relationship is, according to a quantity they call the maximal information coefficient (MIC).
A MIC of 1 implies that two variables are perfectly correlated, but possibly according to two or more simultaneous and perhaps opposing relationships: a straight line and a parabola, say. A MIC of zero indicates that there is no relationship between the variables.
To demonstrate the power of their technique, the researchers applied it to a diverse range of problems. In one case they looked at factors that influence people’s health globally in data collected by the World Health Organization. Here they were able to tease out superimposed trends – for example, how female obesity increases with income in the Pacific Islands, where it is considered a sign of status, while in the rest of the world there is no such link.
In another example, the researchers identified genes that were expressed periodically, but with differing cycle times, during the cell cycle of yeast. And they uncovered groups of human gut bacteria that proliferate or decline when diet is altered, finding that some bacteria are abundant precisely when others are not. Finally, they identified which performance factors for baseball players are most strongly correlated to their salaries.
Reshef cautions that finding statistical correlations is only the start of understanding. “At the end of the day you'll need an expert to tell you what your data mean”, he says. “But filtering out the junk in a data set in order to allow someone to explore it is often a task that doesn't require much context or specialized knowledge.”
He adds that “our hope is that this tool will be useful in just about any field that is amassing large amounts of data.” He points to genomics, proteomics, epidemiology, particle physics, sociology, neuroscience, earth and atmospheric science as just some of the scientific fields that are “saturated with data”.
Beyond this, the method should be valuable for ‘data mining’ in sports statistics, social media and economics. “I could imagine financial companies using tools like this to mine the vast amounts of data that they surely keep, or their being used to track patterns in news, societal memes, or cultural trends”, says Reshef.
One of the big remaining questions is about what causes what: the familiar mantra of statisticians is that “correlation does not imply causality”. People who floss their teeth live longer, but that doesn’t mean that flossing increases your lifespan.
“We see the issue of causality as a potential follow-up”, says Reshef. “Inferring causality is an immensely complicated problem, but has been well studied previously.”
Biostatistician Raya Khanin of the Memorial Sloan-Kettering Cancer Center in New York acknowledges the need for a technique like this but reserves judgement about whether we yet have the measure of MIC. “I’m not sure whether its performance is as good as and different from other measures”, she says.
For example, she questions the findings about the mutual exclusivity of some gut bacteria. “Having worked with this type of data, and judging from the figures, I'm quite certain that some basic correlation measures would have uncovered the same type of non-coexistence behavior,” she says.
Another bioinformatics specialist, Simon Rogers of the University of Glasgow in Scotland, also welcomes the method but cautions that the illustrative examples are preliminary at this stage. Of the yeast gene linkages, he says “one would have to do more evaluation to see if they are biologically significant.”
1. Reshef, D. N. et al. Science 334, 1518–1524 (2011).