Mining Text Can Boost Research

Article excerpt

Text mining denotes software and computing processes that apply linguistic, statistical, and machine learning techniques to discover and exploit meaning hidden in text.

But "hidden" as used here is misleading. While documents may be senseless collections of bits to a computer, we mere humans have no trouble reading and classifying them and using their information content. Most texts of interest are highly structured, fitting well-defined forms (books, letters, articles, scientific papers, conversations), and following logical, even if irregular grammatical rules, with semantic meaning that can be inferred from context, syntax, and word morphology.

Simply put, text mining enables machines to do what scientists, researchers, lawyers, librarians--what readers have been doing without conscious reflection for as long as text has existed. But text mining not only replicates human abilities, it magnifies them. It allows us to work with documents in languages that are foreign to us, to process large volume of information very quickly, and to tease out complex patterns that are indiscernible without application of statistical techniques. Text mining is a research tool, one that both weans machines from their traditional diet of rigidly structured and formatted data and greatly extends our reach.

Text mining replicates many librarianship functions. It can be used to infer taxonomies--classification schemes--suitable for diverse subject-matter domains, and it can automate classification of individual documents according to those categorizations. It can summarize texts and facilitate sophisticated searches. It can identify and extract entities--for instance, names of people, places, chemical compounds, and diseases; dates; e-mail addresses; and so on. It can handle concepts such as reputation and sentiment. And it can discern facts such as events and linkages that characterize and interrelate the discovered entities and concepts.

Uses of Text Mining

Text mining has been successfully introduced in a number of business domains. These areas are characterized by high information volume, well-defined goals, and constrained vocabularies, and set business rules. Examples of text mining include:

* Drug discovery: researchers seeks to discover the effect of chemical compounds and therapies on medical symptoms;

* Genomics: gene sequences are correlated with physical expressions;

* Warranty-claims analysis: which seeks to understand defects and their causes and to identify patterns in claims that may indicate special conditions or fraud;

* Customer-relationship management for functions such as analysis of call-center notes and automated e-mail routing and processing;

* Media monitoring for corporations that seek to manage their reputations, follow trends, and respond quickly to public perceptions;

* Law enforcement, intelligence, and counter-terrorism.

The process starts with lexical analysis and tagging: breaking a text into constituent word, phrase, entity, and concept elements; marking those elements with XML (Extensible Markup Language) tags; and generating basic statistics and indexes. This starting step typically involves use of language- and subject-domain-specific lexicons and grammars. …