Stories from December 20th, 2011

Data Mining with the Maximal Information Coefficient

A group of researchers have come up with a new data mining tool that aims to not just find trends in data, but find which combinations of data produce the simplest and most reliable correlations.  The implications are huge, especially when looking at some of the new giant datasets coming out:

But what if there are many simultaneous dependencies in the data? Suppose that you are looking at how genes interact in an organism. The activity of one gene could be correlated with that of another, but there could be hundreds of such relationships all mixed together. To a cursory inspection, the data might look like random noise.

“If you have a data set with 22 million relationships, the 500 relationships in there that you care about are effectively invisible to a human,” says Yakir Reshef.

Of course, it’s all statistical voodoo that computes a valu called the “Maximal Information Coefficient”, or MIC.  They’ve already had some success in the above example:

In another example, the researchers identified genes that were expressed periodically, but with differing cycles, during the cell cycle of brewer’s yeast (Saccharomyces cerevisiae). They also uncovered groups of human gut bacteria that proliferate or decline when diet is altered, finding that some bacteria are abundant precisely when others are not. Finally, the team identified performance factors for baseball players that are strongly correlated to their salaries.

Of course, not everyone is convinced of the usefulness of the algorithm, but no doubt it’ll be a good published paper at an upcoming event.  Hopefully after some more eyes look over it, it’ll find a good home in the visualization and analysis toolbox.

via Tangled relationships unpicked : Nature News & Comment., via insideHPC

Science , ,

 
Stories from October 6th, 2011

Spatiotemporal Data Mining in R

If you’re an R user, you may want to check out the 2-part series at R-bloggers on visualizing spatiotemporal datasets.

There are many visual methods used to identify patterns in space and time. I’ve discussed some in prior threads and will show a few others briefly here. One of the most difficult questions I often hear from others regarding markov type approaches, is how to identify states to be processed.

It’s a pretty technical article full of some great information.  I’ve never used R myself, but I can tell I’m going to have to learn.

via Spatiotemporal Data Mining: 2 | (R news & tutorials).

Science ,

 
Stories from June 20th, 2011

Engineering Data Analysis with R and ggplot2


Lots of people wouldn’t really consider tools like ‘ggplot’ true visualization tools, but in some disciplines it’s exactly what’s called for: Simple visualization with no fuss.  A talk given by Hadley Wickman, and available online, discusses its use along with the popular statistics package R.

Data analysis, the process of converting data into knowledge, insight and understanding, is a critical part of statistics, but there’s surprisingly little research on it. In this talk I’ll introduce some of my recent work, including a model of data analysis. I’m a passionate advocate of programming that data analysis should be carried out using a programming language, and I’ll justify this by discussing some of the requirement of good data analysis (reproducibility, automation and communication). With these in mind, I’ll introduce you to a powerful set of tools for better understanding data: the statistical programming language R, and the ggplot2 domain specific language (DSL) for visualisation.

via Engineering Data Analysis (with R and ggplot2) – a Google Tech Talk given by Hadley Wickham | (R news & tutorials).

Science , , ,

 
Stories from January 5th, 2010

Kitware Releases ParaView 3.6.2

Kitware has just released a minor point release of ParaView with two very important additions:  a new Python trace, and new statistical algorithms.

ParaView’s Python interface was revamped; an exciting new extension to the interface is Python trace. The goal is to generate human readable, not overly verbose, Python scripts that mimic a user’s actions in the GUI. The “Python Trace” article in the October Source, discusses this functionality in greater detail.

ParaView 3.6.2 also includes a collection of statistical algorithms to compute: descriptive statistics (mean, variance, min, max, skewness, kurtosis); compute contingency tables; perform k-means analysis; examine correlations between arrays; and perform principal component analysis on arrays.

The statistical features alone are a huge feature, worthy of a major release.  You can read more details in the ParaView Wiki, but this has been one feature seriously lacking from ParaView for some time. Great to see it continuing to grow and remain competitive!

via Kitware – News.

Science , , ,

 
Stories from September 4th, 2009

The Importance of Visualizing Data, via Anscombe’s Quartet

anscombeAn article from Jodi McDermott re-introduced me to something I haven’t heard of in years, Anscombe’s Quartet.  For the unknowing, it’s a collection of 4 small datasets that when analyzed statistically are identical (same min, max, mean, and variance).

What’s my point, you might ask? When each one of these datasets is plotted out visually, they have completely different appearances (just click on the link above and you’ll see what I mean). There are outliers where one would not expect to see them — identifying both opportunities and risks in your data depending on what you are analyzing. However, one would never see the variance in data patterns if it was not plotted in a chart or graph (or analyzed data point by data point).

via MediaPost Publications Visualizing Data 09/04/2009.

Science , ,

 
Stories from July 22nd, 2009

Genome Visualization by the NCGR Team

genesIn an effort to locate a genetic basis for schizophrenia, the National Center for Genome Resources (NCGR) in Santa Fe, New MExico established the Schizophrenia Genome Project.  Taking genetic data from 14 patients and 6 controls, they found themselves searching for 11,500 candidate genes amongst 16.7 billion bases.  How to find them?  Statistical analysis and visualization.

NCGR analysts used principal components analysis and hierarchical clustering to assess the data. The variance attributable to disease status was higher for the Illumina digital expression data than from conventional array analysis. “Visualization tools, such as Principal Component Analysis, readily separated the cases and controls, we spotted differences right away,” says Schilkey.

via Bio-IT World.

Science , , , , , ,

 
Stories from July 2nd, 2009

Find Yuppies and SugarDaddies with Townme

starvingstudents_austinAn interesting website called “TownMe” has come onto the internet that merges Google Maps with census data in a clever and interesting way.  Enter a city, and then see your city overlaid with a heatmap indicating concentrations of “Yuppies”, “Cougars”, “Sugar Daddies”, “Starving Students”, “Baby Mommas” or “Baby Daddies”, or simply “People overextending themselves on rent”.  The image above is a visualization of Starving Students in Austin, TX. It’s entertaining, and surprisingly accurate.   Each city is giving a numerical rating, and areas within the city are ranked on the side for the chosen category.

The site is still under active development, and the developers have contacted me asking for ideas and comments.  So if you have thoughts on new areas or features to consider, Post em here!

Science , , , , ,

 
Stories from June 14th, 2009

GPU Accelerated Statistical Programming in R

grangerThe statistical environment ‘R’ is used widely in biomechanical research.  The University of Michigan has managed to implement GPGPU acceleration into the toolkit in such a way that it’s nearly transparent to the user and offering some amazing speedups of frequently used algorithms.

If an algorithm involves computing the elements of a large matrix, we can often merely assign each thread executing on the GPU a portion of a row and/or column. Algorithms for which we have implemented GPU enabled versions include the calculations of distances between sets of points (R dist function), hierarchical clustering (R hclust function). Pearson and Kendall correlation coefficients (similar to R cor function), and the Granger test (‘granger.test’ in the R MSBVAR package).

The graph on the right shows the negligible impact on scaling a Granger test from 200 to 1000 random variables as the blue-line (the GPU version) remains almost flat and the red line (CPU version) rises exponentially to 5000seconds .

via R GPGPU.

Science , ,

VizWorld.com is a production of VizWorld, LLC © 2009