Advanced Data Clustering Methods of Mining Web Documents

Article excerpt


This paper examines the use of advanced techniques of data clustering in algorithms that employ abstract categories for the pattern matching and pattern recognition procedures used in data mining searches of web documents. With the rapid advances in data mining software technology now taking place, website managers and search engine designers have begun to struggle to maintain efficiency in "mining" for patterns of information and user behaviour. Part of the problem is the enormous amount of data being generated, making the search of web document databases in real time difficult. Real-time searching are critical for real time problem solving, high-level documentation searches and prevention of database security breaches.

The analysis of this problem will be followed by a detailed description of weaknesses in data mining methods, with suggestions for a reduction of pre-processing to improve performance of search engine algorithms, and recommendation of an optimum algorithm for this task. The first investigators who gave a serious thought to the problem of algorithm speed were persons conducting researching in the area of database searches. The field is still in its infancy; most of the tools and techniques used for data mining today come from other related fields such as pattern recognition, statistics and complexity theory. Only recently have the researchers of these various fields been interacting to solve mining and timing issues. (Oliver & Armin, 2002)

Overview of the Methodology

The methodology employed in this paper will be experimental analysis, with the objective of testing the feasibility of abstract category data clustering algorithms for a real world web application. In order to perform this test, a group of five linear time clustering algorithms will be applied to a sample group of online web documents, simulating the activities of a web search engine looking for similar words, phrases or sequences in a large database set of web articles, publications and records. The six techniques compared will be the K-Means, Single Pass, Fractionation, Buckshot, Suffix Tree and AprioriAll clustering algorithms.

The procedure will be to measure the execution time of the test algorithms in clustering data sets consisting of whole documents, excerpts and key words of a fixed quantity and size.

Purpose of the Study

The purpose of this study is to conduct research that will analyze and improve the use of data clustering techniques in creating abstract categories in algorithms, allowing data analysts to conduct more efficient execution of large-scale searches. Increasing the efficiency of the search process requires a detailed knowledge of abstract categories, pattern matching techniques, and their relationship to search engine speed.

Data mining involves the use of search engine algorithms looking for hidden predictive information, patterns and correlations within large databases. The technique of data clustering divides datasets into mutually exclusive groups. The distance between groups is measured with respect to all the available variables, versus variables that are specific predictors, to produce "abstract categories" for analysis. Search engine algorithms and user audit trails are complex, leading to time-consuming quests for specific information. It is anticipated that the proposed study will identify the most efficient and effective data clustering algorithms for this purpose.

Background of the Problem

Data Clustering is a technique employed for the purpose of analyzing statistical data sets. Clustering is the classification of objects with similarities into different groups. This is accomplished by partitioning data into different groups, known as clusters, so that the elements in each cluster share some common trait, usually proximity according to a defined distance measure. Essentially, the goal of clustering is to identify distinct groups within a dataset, and then place the data within those groups, according to their relationships with each other. …