Academic journal article
By Sheikh, Mahmudul; Conlon, Sumali
Journal of International Technology and Information Management , Vol. 19, No. 4
Most electronic documents are available in the free-text format. Analyzability of the free-text information can play a crucial role in decision making through pattern discovery (Lo & Hsieh, 2003). The enormous volume of available documents can be processed semi-automatically by a closed or an open system (Banko & Etzioni, 2008). Important applications such as electronic governance (Zhang, Lin, Lin, & Hsieh, 2008) may entail the use of extracted information from these documents to gain competitiveness and public trust by enforcing transparency. An effectively designed Information Extraction (IE) system can be useful in this regard. An IE system is designed to extract the user-specified items or pre-defined events (Srihari, Li, Niu, & Cornell, 2008) from the text documents of a specific domain.
The items to be extracted are specific word(s) of sentences of the input documents. An IE system can be used to fill the fields of a table of a relational database from text documents. Examples of the fields may include the name of a company or type of business. In order to interpret the extracted information, these items can be saved into a relational database. The extracted items can also be used to fill-out forms specified by a user. The information saved in the database can be further processed to identify the relevant correlations.
Compared to the extraction of structured data performed by, for example, an ERP system (Wu, Hsieh, Shin, & Wu, 2005), extracting information from a free-text domain is a challenging problem. The reason is that only the contextually relevant target words or phrases have to be extracted. Because statistics does not consider the contexts of the words, the methods such as Naive Bayes Classifier or Average Mutual Information (Carven et al., 2000) are not effective in extracting information. Most of the state of the art IE systems use a combination of statistical and machine learning methods (Freitag, 2000). Jose Iria and Fabio Ciravegna (2006) developed an ontology learning (Suchanek, Sozio, & Weikum, 2009) and document classification method that represents language resources, such as syntactic and semantic parsers, in a way that is independent of the extraction process. Yildiz and Miksch (2007) proposed an unsupervised rule learning method (Downey, Schoenmackers, & Etzioni, 2007; Sekine & Oda, 2007) based on ontological structures. Syntactic and semantic analyses have been found to improve precision (Feldman, Aumann, Finkelstein-Landau, Hurvitz, Regev, & Yaroshevich, 2002). Xu, Uszkoreit, Li, and Felger (2008) designed a system that extracts linguistic grammar rules from a semantic seed. An alignment based pattern matching technique was developed by (Kim, Jeong, Lee, Ko, & Lee, 2008) that can extract relationships between two arguments. Jain, Ipeirotis, and Gravano (2008) developed a technique to process structured queries from the relations extracted from unstructured texts. Stevenson and Greenwood (2009) found that the IE models that use the relevant portions of a dependency tree (Wu & Weld, 2008) perform better. A rule-based decision tree was found effective in extracting information from online resumes (Bhargavi, Jyothi, Jyothi, & Sekar, 2008). The use of deep level ontology structure (Welty & Murdock, 2006) and the use of multiple ontologies from the same domain (Wimalasuriya & Dou, 2009) have been found effective in performance enhancement. Assigning weights to the syntactic features according to their co-occurrences in the related class has been found effective in determining ontologies about person-names and their geographic locations (Tanev & Magnini, 2008).
Enterprise applications of IE systems require higher accuracy and scalability (Chiticariu, Li, Raghavan, & Reiss, 2010). Similar to many other systems, Pennacchiotti and Pantel (2009) used a combination of a pattern match method with a distributional method. …