Academic journal article Communications of the IIMA

Dear: A New Technique for Information Extraction and Context-Dependent Text Mining

Academic journal article Communications of the IIMA

Dear: A New Technique for Information Extraction and Context-Dependent Text Mining

Article excerpt

INTRODUCTION

The availability and increased power of computer technology, the decreased cost of disk storage space, and the connectivity provided by the Internet has created a situation where vast amounts of unstructured text-based documents are being stored electronically. To access text-based information, data mining and text mining techniques are used. Data mining techniques require data to be stored in well organized, structured formats; however, text mining techniques are able to extract useful information from unstructured document collections. Because of this, text mining techniques are useful in processing these documents. The goal of text mining is to structure document collections to improve the ability of users to retrieve and apply knowledge implicitly contained within those collections (Ikonomakis, Kotsiantis, & Tampakas, 2005). Text mining proceeds through three phases to accomplish this goal: pre-processing, pattern discovery and visualization.

Within document collections, natural language signifies meaning within a maze of synonyms and domain specific terms (Blake & Pratt, 2001). The text mining pre-processing phase cleans and analyzes document collections to transform implicit meaning into normalized and explicitly structured concepts. Pre-processing challenges include defining ways to manage the heterogeneity of terms and phrases that result from the increasingly dynamic and geographically dispersed contributions to document collections (Kwak & Yong, 2010).

The pattern discovery phase analyzes and derives distributions of concepts across document collections to help users filter and identify relevant documents. A challenge of pattern discovery is to retrieve manageable subsets of relevant documents and then alert users to further context dependent queries that may refine their initial results. A challenge of visualization, the third phase of text mining, is to support the user by providing dynamic graphs for visualizing relevant relationships among identified documents (Feldman & Sanger, 2009).

Taken together, these challenges provide the justification for this research project. Our goal is to design and explore a text mining system to quickly and easily retrieve data that has been stored in numerous text documents. Further, retrieval of this information should not rely on the end-user knowing all the various search terms under which it may have been stored. Rather, it should allow useful information to be located via search terms familiar to an end-user. The technique developed by this research will utilize two externally developed, freely-available, word context systems along with a custom search program written in F#, Microsoft's new declarative .Net language. The methodology is explained and demonstrated through a system designed to improve the semantic match between a document collection and an end-users knowledge requirements. This project designs and demonstrates techniques to bridge the gap between unstructured document collections and text-mining tools to support retrieval and evaluation of knowledge within those collections.

PROBLEM DESCRIPTION AND RELATED RESEARCH

In its simplest form, the problem addressed by this research is to develop a method that can find information stored in unstructured text documents using a query vocabulary which may not match that used in the stored documents. To illustrate this situation, assume that a new assessment office is created within a university. The purpose of the position is to coordinate the various programs offered by the colleges within the university and devise consistent assessment schemes that can be uniformly applied across all colleges. The initial task that must be performed by the new Vice-President of University Assessment is to determine how the existing assessment systems work. This requires searching through voluminous documents written by different authors in different disciplines over an extended period of time. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.