Academic journal article Journal of Digital Information Management

Taxonomy-Based Document Clustering

Academic journal article Journal of Digital Information Management

Taxonomy-Based Document Clustering

Article excerpt

1. Introduction

Document clustering is one of the most important tasks in text mining. It is also one of major applications of machine learning and data mining [1]. There are many applications using document clustering techniques such as natural language processing and information retrieval [2]. All these applications are using the capability of text categorization techniques in dealing with natural language documents.

Document representation is an important issue in document clustering. It can affect the text categorization process and its performance [3]. Most research in text categorization assumes that a document consists of a Bag-Of-Words (BOW). In other words, in this representation, the smallest segment of information in text data is a word token, not a letter or a sentence. BOW is a dictionary-based representation and it ignores the spatial relationship between terms in the document.

The BOW representation is not sufficient by itself to be employed in text categorization task as a vector of features. One problem is the size of document, which is different using the BOW representation. One solution is to employ Vector Space Model (VSM), to represent BOW of the documents. VSM, which is originally a representation model in information retrieval systems, has been first proposed by Salton [4]. In this model, every word or group of words, depends on working with a single word or a phrase, called a term, represents one dimension of the feature space. In this model, every document is represented by a sequence of terms. Each term has either a binary or a weighted value. There are several weighting schemes such as Term Frequency (TF), Term Frequency-Inverse Document Frequency (TF-IDF), and Term Frequency Constraint (TFC). The length of this vector is as big as the size of the dictionary, which is the set of all distinct word occurred in the data set. The jth entry of the VSM represents the weight or score of the jth term of the dictionary in the document. This process is called term indexing. VSM representation has some disadvantages, which includes ignoring four important aspects of natural language text [5]: (i) term dependencies and correlation, (ii) text structure, (iii) grammar and language model, and (iv) the sequence of terms in the document. Some advanced vector space models, such as Latent Semantic Indexing (LSI) [6] and latent Dirichlet allocation (LDA) [7], address synonymy and polysemy in text analysis problems. For example, in LSI, the hidden semantic structure in a document collection are explored. The drawback of extended VSM approaches such as LSI are their computational expense and poor scalability.

In this paper, a new approach to representing text data is proposed. The method translates the document clustering problem into query processing. The intuition behind this approach is if a set of documents belongs to the same cluster, we can expect that they will respond similarly to the same queries, which can be any combination of terms from the dictionary. While in information retrieval, the target is to retrieve relevant document(s) to a query, in document clustering, the goal is finding relevant queries which generates high quality clusters (with low inter-cluster and high intra-cluster similarities).

While in the proposed method document clustering is translated into query processing, feature selection is also transformed to query generation problem. In this paper, we propose to generate relevant and non-redundant queries from the domain taxonomy extracted from document collection. Using this new model, the terms in BOW model are transformed to the similarity scores of Bag-Of-Queries (BOQ) model. The effectiveness of the proposed approach is evaluated by extensive numerical experiments using benchmark document data set.

The paper consists of seven sections. Following the introduction, Section 2 briefly introduces query-based document representation. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.