The brains behind one of my favorite visualization tools ParaView, the guys at Sandia National Labs, have turned their sites on new prey: Textual Analysis. Their new tool “ParaText” can process massive collections of text in parallel across massive supercomputers, churning through massive 500-million work collections in under a day. (War and Peace is only 560,000 words).
ParaText distributes a different subset of documents to each processor, which in turn analyses that subset. And because of their efforts to minimize communication and make ParaText scalable, the result is a tool that could be run in a variety of environments, including on a grid or cloud. It can be embedded in any application using a native C++ API, Python, or Java. A standalone ParaText MPI executable can be run via command line. Or ParaText can be deployed as a web service using a RESTful API.
ParaText is based on the existing Titan Informatics toolkit, created by Sandia and Kitware
Color me skeptical here. While words do mean things, their meaning is also fluid. For example, think about any of the idioms that you know and how a foreigner has trouble understanding them. I am sure that as an American I would have trouble understanding a German idiom based just on the meaning of the words themselves. Computers can get you in the ballpark, as Google Translate shows, but it cannot get you all the way there. Thus the articles comment “It can be used to analyze data, compare documents, find similar documents across languages, summarize or grade essays, and as part of decision-making applications” makes me skeptical.