Exploring Lexicographic Ontologies for Hierarchically Organizing the Greek Wikipedia Articles

Article excerpt

1. Introduction

Wikipedia is one of the most successful worldwide collaborative efforts to put together user-generated content, which could be used as an informational reference source for the online population. Currently, Wikipedia hosts millions of articles on a variety of topics, across different languages and has been incorporated into several computed-based applications. A crucial factor for its success is its open nature, which enables everyone edit, revise and/or question (via talk pages) the article contents. Considering the remarkable growth and the extensive use of Wikipedia, the question that rises naturally is: how can we assess the quality of Wikipedia or else how can we ensure that the content it provides is useful for its readers. In an attempt to shed light on the above issue, several researchers have proposed methods for assessing the quality of the Wikipedia articles and they have proposed methods for assisting Wikipedia editors provide qualitative and well-organized information in Wikipedia articles. Most of existing methods concentrate on the English Wikipedia (because it is the richest and the most widely used) although there have been successful attempts towards assessing and/or improving Wikipedia for other natural languages.

In this article, we study the structural quality of the Greek Wikipedia and specifically we investigate how we can organize the contents of the Greek Wikipedia so that we assist users experience successful navigations in its contents. Therefore our study is situated in the area of information management and combines tools and techniques from the field of natural language processing. In particular, we introduce a model which exploits the WordNet [8] semantic network for hierarchically organizing the Greek Wikipedia articles [I]. The motive for our study is to turn the Greek Wikipedia corpus into a structured data source and the reason for selecting WordNet as our reference guide for data structuring is the fact that it hierarchically organizes the concepts it contains based on their underlying semantic relations. The goal of our work is to experimentally demonstrate the contribution of semantic networks into the hierarchical organization of online unstructured content. In this respect, we have designed and implemented a model that automatically captures the underlying semantic relationships that hold between the Wikipedia categories and based on their identified semantic links, it organizes them into a thematic hierarchy.

For unravelling the semantics of the Wikipedia categories as well as for deriving evidence about their relations, our model explores the information encoded in WordNet, a rich source of highly structured semantic information. The contribution of WordNet is mainly pronounced in the process of disambiguating the terms used to name the Wikipedia categories, as we will discuss later in the paper. In brief, our model operates on a three-step approach: firstly, it matches the Wikipedia category names to their corresponding WordNet nodes in order to extract their senses. Then, it disambiguates the categories matching several WordNet nodes based on their estimated semantic similarity to other categories with which they co-occur in the Wikipedia articles. Having detected the semantics of each Wikipedia category, we borrow the hierarchical structure of the category names from WordNet and apply them for organizing the categories into thematic hierarchies. Based on the above steps, our model automatically assigns the Wikipedia categories into hierarchical structures and as such it facilitates the organization of the Wikipedia articles that have been classified to the corresponding categories. The experimental evaluation of our model indicates that WordNet is a valuable source for semantically organising unstructured thematic data.

The remainder of the paper is organized as follows. We begin our discussion with a brief overview of relevant works. …