In a cohesive text, language units--whether they are words, phrases, or entire sentences--are connected through a variety of relations, which contribute to the overall meaning of the text and maintain the cohesive structure of the text and the discourse unity. Since the early ages of artificial intelligence, associative or semantic networks have been proposed as representations that enable the storage of such language units and the relations that interconnect them and that allow for a variety of inference and reasoning processes, simulating some of the functionalities of the human mind. The symbolic structures that emerge from these representations correspond naturally to graphs--where text constituents are represented as vertices and their interconnecting relations form the edges in the graph.
The last decade has brought on a number of exciting research papers that apply graph-based methods to an increasingly large range of natural language problems, ranging from lexical acquisition to text summarization. In this article, we overview several of these methods and their application to natural language processing. To reflect the fact that the algorithms and representations originate in different communities--natural language processing and graph theory--we will be using a dual vocabulary to describe these methods: networks are graphs, nodes are vertices, and links are edges.
In terms of graph-based representations, depending on the natural language-processing application, a variety of node and edge types have been used. Text units of various sizes and characteristics can be added as vertices in the graph, for example, words, collocations, word senses, entire sentences, or even entire documents. Note that the graph nodes do not have to belong to the same category. For example, both sentences and
words can be added as vertices in the same graph. Edges can represent cooccurrence (such as two words that appear in the same sentence or in the same dictionary definition), collocation (for example, two words that appear immediately next to each other or that may be separated by a conjunction), syntactic structure (for example, the parent and child in a syntactic dependency), and lexical similarity (for example, cosine between the vector representations of two sentences).
In terms of graph-based algorithms, the main methods used so far can be classified into: (1) semisupervised classification (Zhu and Ghahramani 2002; Zhu and Lafferty 2005; Toutanova, Manning, and Ng 2004; Radev 2004; Otterbacher, Erkan, and Radev 2005), where random walks or relaxation are applied on mixed sets of labeled and unlabeled nodes; (2) network analysis (Masucci and Rodgers 2006; Caldeira et al. 2006), where network properties such as diameter, centrality, and so on, are calculated; (3) graph-based clustering methods (Pang and Lee 2004, Widdows and Dorow 2002), such as min-cut methods; (4) minimum spanning-tree algorithms (McDonald et al. 2005).
In this article, we overview several graph-based approaches for natural language-processing tasks, which we broadly group into three main categories. First, we review research work done in the area of syntax, including syntactic parsing, prepositional attachment, and coreference resolution. We then describe methods used in lexical semantics, including word-sense disambiguation, lexical acquisition, and sentiment and subjectivity analysis. Finally, we review several natural language-processing applications that rely on graph methods, including text summarization, passage retrieval, and keyword extraction.
[FIGURE 1 OMITTED]
In this section we will discuss three papers, addressing methods for syntactic parsing (McDonald et al. 2005), prepositional attachment (Toutanova, Manning, and Ng 2004), and coreference resolution (Nicolae and Nicolae 2006).
Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajic (McDonald et al. …