A group of researchers have come up with a new data mining tool that aims to not just find trends in data, but find which combinations of data produce the simplest and most reliable correlations. The implications are huge, especially when looking at some of the new giant datasets coming out:
But what if there are many simultaneous dependencies in the data? Suppose that you are looking at how genes interact in an organism. The activity of one gene could be correlated with that of another, but there could be hundreds of such relationships all mixed together. To a cursory inspection, the data might look like random noise.
“If you have a data set with 22 million relationships, the 500 relationships in there that you care about are effectively invisible to a human,” says Yakir Reshef.
Of course, it’s all statistical voodoo that computes a valu called the “Maximal Information Coefficient”, or MIC. They’ve already had some success in the above example:
In another example, the researchers identified genes that were expressed periodically, but with differing cycles, during the cell cycle of brewer’s yeast (Saccharomyces cerevisiae). They also uncovered groups of human gut bacteria that proliferate or decline when diet is altered, finding that some bacteria are abundant precisely when others are not. Finally, the team identified performance factors for baseball players that are strongly correlated to their salaries.
Of course, not everyone is convinced of the usefulness of the algorithm, but no doubt it’ll be a good published paper at an upcoming event. Hopefully after some more eyes look over it, it’ll find a good home in the visualization and analysis toolbox.
via Tangled relationships unpicked : Nature News & Comment., via insideHPC