Academic journal article Library Philosophy and Practice

Metadata and Linked Data in Word Sense Disambiguation

Academic journal article Library Philosophy and Practice

Metadata and Linked Data in Word Sense Disambiguation

Article excerpt

Introduction

Word Sense Disambiguation (WSD) is referred to as an "AI-complete" problem (Mallery, 1998), i.e., a task that is relatively easy for people, but considerably more difficult for machines. If someone makes a query for a polysemous word (e.g., "plant," "bass," "mercury," etc...), how is an information retrieval system to understand which sense of the word is intended? There exist tried-and-tested methods, such as just using the most predominant sense of the word (McCarthy, Koeling, Weeds, & Carroll, 2004); or looking at the words next to the query term to determine the statistically most likely meaning (Jurafsky & Martin, 2009; Manning & Schütze, 1999); but these methods often produce less-than-satisfactory results [often around 70%] (Navigli, 2009). Furthermore, these methods have been heavily dependent on the manual creation of knowledge sources (Edmonds, 2000), which are expensive to create and subject to change, thus creating what is termed a knowledge acquisition bottleneck (Gale, Church, & Yarowsky, 1992). Linked Data technologies (Berners-Lee, 2006), however, allow us to utilize existing ontologies and lexica, which can then be exploited to improve the automatic semantic understanding of the word. This paper will examine several systems that purport to disambiguate words by using Linked Data, and some of the models these systems use to ensure interoperability.

Literature Review

The most complete treatment of the subject of WSD is arguably Agirre & Edmonds [ed.] (2007), which presents a detailed definition of the problem, along with a history thereof, and numerous algorithms which are used in practice. Kwong (2013) offers slightly more recent coverage, along with predictions as to how WSD methods will evolve in the near future. Generalists might find sufficient the survey from Navigli (2009), or the chapters covering WSD in either Jurafsky & Martin (2009) or Manning & Schütze (1999). SemEval [which was originally named Senseval (Kilgarriff, 1998)] is an ongoing evaluation project which is used as a baseline to assess various WSD methods, including many which will be examined in this paper.

Linked Linguistic Open Data (LLOD) is heavily dependent on metadata, and any consideration thereof would require an examination of its standards. A brief history of the topic of linguistic annotation can be found in Palmer & Xue (2013). Bird & Simons (2003a) and Ide, Romary, & de la Clergerie (2004) proposed sets of best practices for linguistic annotations, while Simons, Bird, & Spanne (2008) offered a more recent set of recommendations that specifically suggested language codes from ISO 639-31 be used in metadata. Ide & Pustejovsky (2010) suggested a list of best practices for language technology metadata, focusing heavily on the work of the OLAC and European Languages Resource Association (ELRA). Gracia, Montiel-Ponsoda, Cimiano, Gómez-Pérez. Buitelaar, & McCrae (2012) considered the issue of Linked Data being stored in different languages, and suggested that techniques such as ontology localization, ontology mapping, and cross-lingual ontology-based information access and presentation would help prevent information from being locked up in linguistic data silos. Gayo, Kontokostas, & Auer (2013) presented a set of best practices for multilingual linked open data, and point out that SPARQL queries can be improved if tags are identified by language. Reviews of specific linguistic annotation schemes include: the Open Languages Archives Community [OLAC] metadata set (Bird & Simons,2003b); the General Ontology for Linguistic Description [GOLD] (Farrar & Langendoen; 2003); ISOcat, a Data Category Registry (DCR) for the ISO TC 37 (terminology and other language and content resources) registry (Kemps-Snijders, Windhouwer, Wittenburg, & Wright (2009); the ISO/TC 37/SC 4 standard (Lee & Romary, 2010); the lemon (LExicon Model for ONtologies) model (McCrae, Aguado-de-Cea, Buitelaar, Cimiano, Declerck, Gómez-Pérez, . …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.