Discovering Interesting Association Rules in the Web Log Usage Data

Article excerpt

Introduction

Due to the immense volume of Internet usage and web browsing in recent years, log files generated by web servers contain enormous amounts of web usage data that is potentially valuable for understanding the behaviour of website visitors. This knowledge can be applied in various ways, such as enhancing the effectiveness of websites through user personalization or developing directed web marketing campaigns (Anand, Mulvenna & Chavielier, 2004; Cooley, Mobasher, & Srivastava, 1997).

Data mining methods, which, by definition, are suitable for automatic extraction of potentially interesting information from very large databases, are used to extract knowledge from the web usage log files. One of the popular data mining methods that has been used for this purpose is association rule finding (Kosala & Blockeel, 2000).

Originally, association rule mining algorithms were applied for the analysis of transactional databases (Agrawal, Imielinski, & Swami, 1993).

An association rule is defined as follows:

Let I = i{[i.sub.l]..., [i.sub.n]} be a set of items, and T = {[t.sub.l]...,[t.sub.m]} a set of transactions, where each transaction [t.sub.i] consists of a subset of items in I. An association rule is then an implication of the form:

X [right arrow] Y, X [member of] I, Y [member of] I, X [intersection] Y = [empty set]

An item set X has support s in T if s% of the transactions in T contains X.

An item set X is frequent if its support is higher than the user specified minimum support.

The rule X [right arrow] Y holds in T with confidence c if c%o of transactions in T that contain Xalso contain Y.

The problem of mining association rules is to generate all association rules that consist of frequent item sets and the confidence greater than the user-specified minimum confidence.

While association rule finding algorithms are complete in that they find all rules that satisfy defined constraints, they often result in a large set of rules that is difficult to exploit and find those rules that are truly interesting to the user. Various methods have been proposed to help deal with this issue.

For example, a query language called "Mine Rule", originally developed for querying inductive databases, can be applied to mining the set of generated association rules (Meo, Luca Lanzi, Matera, Careggi, & Esposito 2004). Furthermore, various methods have been proposed to prune the set of generated rules and discard irrelevant rules (Jaroszewicz & Simovici, 2002; Liu, Hsu, & Ma, 1999). Another area of research focuses on finding various association rule 'interestingness measures', which help find the rules that give maximally useful information to the user in the set of generated association rules (Tan, Kumar, & Srivastava, 2004). Some of the proposed association rule interestingness measures are all-confidence (Omiecinski, 2003), collective strength (Aggarwal & Yu, 1998), conviction and lift (Brin, Motwani, Ullman, & Tsur, 1997).

When applying association rule mining to web usage data, a web resource of a particular website is usually considered an item, while a website visitor session is considered a transaction of items. Here, a website visitor session is a set of web resources that a visitor requested during one event of browsing the website (Anand et al., 2004).

Although various interestingness measures and rule pruning methods have been applied to association rule mining of web usage data, extracting useful information from the set of generated association rules remains a difficult task (Geng & Hamilton, 2006; Huang, 2007).

Web usage data is specific and differs from the market basket data in the sense that it contains a large number of tightly correlated items (web resources or web pages) due to the link structure of a website. Web pages that are tightly linked together often occur in the same transaction, which is why the generated set of association rules contains a high number of so-called "hard" association rules that have very high confidence, but are not truly interesting to the user. …