Academic journal article Journal of the Association for Information Systems

A Hybrid Attribute Selection Approach for Text Classification

Academic journal article Journal of the Association for Information Systems

A Hybrid Attribute Selection Approach for Text Classification

Article excerpt

Abstract

The application of text mining in organizations is growing. Text classification, an important type of text mining problem, is characterized by a large attribute space and entails an efficient and effective attribute selection procedure. There are two general attribute selection approaches: the filter approach and the wrapper approach. While the wrapper approach is potentially more effective in finding the best attribute subset, it is cost-prohibitive in most text classification applications. In this paper, we propose a hybrid attribute selection approach that is both efficient and effective for text classification problems. We apply the proposed approach to detect and prevent Internet abuse in the workplace, which is becoming a major problem in modern organizations. The empirical evaluations we conducted using a variety of classification algorithms, indexing schemes, and attribute selection methods demonstrate the utility of the proposed approach. We found that combining the filter and wrapper approaches not only boosts the accuracies of text classifiers but also brings down the computational costs significantly.

Keywords: text mining, text classification, data mining, attribute selection, Internet abuse detection

(ProQuest: ... denotes formulae omitted.)

1. Introduction

As organizations are being flooded with massive volumes of textual data-such as written documents, web pages, and emails-several of them have started to apply text mining techniques to sift through the unstructured or semi-structured data and discover useful patterns and models (Fan et al. 2006). Text mining and data mining utilize similar machine learning techniques, but work with different types of data (unstructured/semi-structured vs. structured). Text classification is an important type of text mining problem, where the class (a categorical dependent variable) of a document is predicted based on several attributes (independent variables) describing the document. Examples of text classification include junk e-mail filtering (Sakkis et al. 2003; Schneider 2003), web page classification (Chen and Hsieh 2006; Kwon and Lee 2003), anticipatory event detection (He et al. 2007), and online deception detection (Zhou et al. 2004). Internet abuse detection is another domain where text classification techniques can be applied. Various machine learning techniques can be used to automatically learn classification models (called classifiers) based on training examples with known cases of abuse and non-abuse. The learned classifiers can then be applied to predict the classes of new documents.

Support vector machine (SVM) has been found to be one of the most accurate text classifiers across the board for a large number of existing document collections (Chakrabarti 2003). But, as we argue in this paper, accuracy is only one of the performance measures for text classifiers, and there are other measures-such as attribute selection time, classifier training time, and classifier testing time-that are equally, if not more, important. The question that arises, then, is if it is possible to boost the performance of other classifiers to bring them closer to SVM accuracy levels. To address that question, we propose a hybrid attribute selection approach that combines the filter and wrapper approaches.

Attribute selection (also called feature selection)-i.e., selecting a subset of the attributes (features) that are most relevant to a classification problem-is a common preprocessing step. There are two general attribute selection approaches: the filter approach and the wrapper approach (Dash and Liu 1997; Hall and Holmes 2003; Witten and Frank 2005). In the filter approach, the attributes are evaluated by some relevance measure and filtered without invoking a learning algorithm. In the wrapper approach, the learning algorithm used to build the classifier is wrapped into the attribute selection procedure, so that multiple classifiers can be generated based on different subsets of attributes, and the subset that results in the best performance can be selected. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.