The brains behind one of my favorite visualization tools ParaView, the guys at Sandia National Labs, have turned their sites on new prey: Textual Analysis.   Their new tool “ParaText” can process massive collections of text in parallel across massive supercomputers, churning through massive 500-million work collections in under a day.  (War and Peace is only 560,000 words).

ParaText distributes a different subset of documents to each processor, which in turn analyses that subset. And because of their efforts to minimize communication and make ParaText scalable, the result is a tool that could be run in a variety of environments, including on a grid or cloud. It can be embedded in any application using a native C++ API, Python, or Java. A standalone ParaText MPI executable can be run via command line. Or ParaText can be deployed as a web service using a RESTful API.

ParaText is based on the existing Titan Informatics toolkit, created by Sandia and Kitware

via Textual analysis in parallel | iSGTW.