Academic journal article Informatica Economica

Outlier Analysis of Categorical Data Using NAVF

Academic journal article Informatica Economica

Outlier Analysis of Categorical Data Using NAVF

Article excerpt

Outlier mining is an important task to discover the data records which have an exceptional behavior comparing with other records in the remaining dataset. Outliers do not follow with other data objects in the dataset. There are many effective approaches to detect outliers in numerical data. But for categorical dataset there are limited approaches. We propose an algorithm NAVF (Normally distributed attribute value frequency) to detect outliers in categorical data. This algorithm utilizes the frequent pattern data mining method. It avoids problem of giving k-outliers to get optimal accuracy in any classification models in previous work like Greedy, AVF, FPOF, and FDOD while finding outliers. The algorithm is applied on UCI ML Repository datasets like Nursery, Breast cancer mushroom and bank dataset by excluding numerical attributes. The experimental results show that it is efficient for outlier detection in categorical dataset.

Keywords: Outliers, Categorical, AVF, NAVF, FPOF, FDOD

(ProQuest: ... denotes formulae omitted.)

1 Introduction

Outlier analysis is an important research field in many applications like credit card fraud, intrusion detection in networks, medi-cal field .This analysis concentrate on detect-ing infrequent data records in dataset.

Most of the existing systems are concentrated on numerical attributes or ordinal attributes .Sometimes categorical attribute values can be converted into numerical values. This process is not always preferable. In this paper we discuss a simple method for categorical data is presented.

AVF method is one of the efficient methods to detect outliers in categorical data. The mechanism in this method is that, it calcu-lates frequency of each value in each data at-tribute and finds their probability, and then it finds the attribute value frequency for each record by averaging probabilities and selects top k- outliers based on the least AVF score. The parameter used in this method is only "k", the no. of outliers. FPOF is based on frequent patterns which are adopted from Apriority algorithm [1]. This calculates fre-quent patterns item sets from each object. From these frequencies it calculates FPOF score and finds the least k- outliers as the least FPOF scores. This method takes more time to detect outliers comparing with AVF. The parameters used in it are σ, a threshold value to decide frequent sub sets in each data object. The next method is based on Entropy score. Greedy [2] is another method to detect outliers from categorical data. The previous approaches used to detect outliers were

2 Existing Approaches

Statistical based

This method adopted a parametric model that describes the distribution of the data and the data was mostly unvaried [3, 4]. The main drawbacks of this method are difficulty of finding a correct model for different datasets and their efficiency decreases as the no. of dimensions increases [4]. To rectify this problem the Principle component method can be used. Another method to handle high di-mensional datasets is to convert the data re-cords in layers however; these ideas are not practical for more than or equal to three di-mensions.


Distance based methods do not make any as-sumptions about the distribution of the data records because they must compute the dis-tances between records. But these make a high complexity. So these methods are not useful for large datasets. There are some im-provements exist in the distance-based algo-rithms, such as Knorr's et al. [5], they have explained that apart of dataset records belong to each outlier must be less than some threshold value. Still it is an exponential on the number of nearest neighbours.

Density Based

These methods are based on finding the den-sity of the data and identifying outliers as those lying in regions with low density. Bre-unig et al. have calculated a local outlier fac-tor (LOF) to identify whether an object con-tains sufficient neighbour around it or not[6]. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.