Academic journal article Journal of Digital Information Management

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Academic journal article Journal of Digital Information Management

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Article excerpt

1. Introduction

The volume of digital documents available online is progressively increasing, and text classification is the key technology used to process and organize text data. Nonetheless, a major problem in text classification is that the high dimensionality of the feature space can easily increase the number of features to hundreds of thousands. This increase can deteriorate classifier performance as a result of redundant and irrelevant terms. An effective way to solve this problem is by reducing feature space dimensionality [1-2].

Dimension reduction methods include feature extraction and selection. In feature extraction, a new feature set is generated by combining or transforming the original dataset. In feature selection, a subset is derived from the original set without transforming the feature space. Three methods are used to perform feature selection: embedded, wrapper and filter. Embedded and wrapper methods rely on a learning algorithm, but filter method is independent of such algorithms. As the filter method is sampler and has lower computational complexity than the other two methods, it has been widely used in text classification. Many filter methods have been proposed, including the improved Gini index, information gain (IG), chi-square statistics (CHI), document frequency (DF), orthogonal centroid feature selection (OCFS), and DIA association factor (DIA) [3]. In fact, the new feature selection (NFS) method proposed in the current paper is a filter method as well.

Problems of imbalanced data are often observed in text classification because the number of positive samples is usually considerably smaller than that of negative samples. Imbalanced data generally cause classifiers to perform poorly on the minority class. The final aim of text classification on imbalanced datasets involves improving the minority-class classification performance without affecting the overall classification performance of the classifier. A number of solutions have been proposed for this problem at both data and algorithmic levels. Data-level solutions include oversampling on minority-class samples and undersampling on majority-class samples. Algorithmic-level solutions include the design of new algorithms or the optimization of original algorithms to improve classifier performance.

Feature selection should be more important than classification algorithms in highly imbalanced situations [4]. Selecting a word with strong class information can improve classifier performance; for example, the word "football" usually appears in the class "sports" [5]. In this paper, we propose an NFS method that differs from many existing feature selection methods. Using this method, we select terms with class information; then, we combine the NFS method with data resampling technology to improve imbalanced classification. The effectiveness of our proposed method is proved by experiment results.

2. Method

Many feature selection methods have recently been used extensively in text classification [1-8], such as mutual information (MI), IG, and CHI. CHI and MI are considered in this study and are described as follows:

Let t be any word; then, we can define its presence and absence in class [c.sub.i] as:

[A.sub.i] is the number of documents with word tthat belongs to class [c.sub.i];

[B.sub.i] is the number of documents without word t that belongs to class [c.sub.i];

[C.sub.i] is the number of documents with word t that does not belong to class [c.sub.i]; and

[D.sub.i] is the number of documents without word t that does not belong to class [c.sub.i].

The CHI and MI metrics for a word t are defined by

CHI(t, [c.sub.i]) = N[([A.sub.i][D.sub.i] - [B.sub.i][C.sub.i]).sup.2]/([A.sub.i] + [C.sub.i])([A.sub.i] + [B.sub.i])([B.sub.i] + [D.sub.i])([C.sub.i] + [D.sub.i]) (1)

MI(t, [c.sub.i]) = log [[A.sub.i]N/([A.sub.i] + [C.sub. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.