Principles of Data Mining. (Book Reviews)

Article excerpt

David HAND, Heikki MANNILA, and Padhraic SMYTH. Cambridge, MA: MIT Press, 2001. ISBN 0-262-08290-X. xxxii+546 pp. $50.00 (H).

This book is a welcome addition to the steadily growing list of books that provide a statistical perspective on data mining, and it is likely to be of value to statisticians interested in data mining (and much more so for computer scientists and other non statisticians). The authors state in the Preface that "rather than discuss specific data mining applications at length... we have instead focused on the underlying theory and algorithms that provide the 'glue' for such applications." Accordingly, a substantial portion of the book is devoted to an exposition of the underlying principles of data mining. This is followed by a comprehensive introductory overview of the techniques. The content of the book is divided into three parts: Fundamentals, Data Mining Components, and Data Mining Tasks and Algorithms.

Fundamentals acquaints the reader with the elements of statistical thinking. It covers measuring, summarizing, and visualizing data and basic notions of statistical inference. This is largely elementary and familiar material for statisticians. Data Mining Components describes a common framework for analyzing data mining algorithms in terms of their basic components: task (e.g., regression, classification), structure (e.g., neural networks, linear discriminants), score functions (e.g., squared error loss, misclassification rate), optimization/search methods (e.g., gradient descent, greedy search) and data management (where the data will reside and how they will be accessed). Data Mining Tasks and Algorithms describes a whole range of data-mining techniques in the context of the general framework developed earlier. Individual chapters are devoted to density estimation and clustering, classification, regression, pattern discovery (uncovering local patterns in data), and retrieval by content (the problem of inex act queries, such as those faced by web-search engines).

A numerous books on data mining are available today. Along with this book, several others, including those by Bishop (1995); Duda, Hart, and Stork (2001); Hastie, Tibshirani, and Friedman (2001); and Ripley (1996), provide well-written systematic coverage of various data-mining topics, with an emphasis on underlying statistical principles. A major difference between this book and the others is in its organization. Foundational principles are developed first, and then techniques are presented separately, whereas most of the others present the principles intermingled with the techniques. The authors seek to cover an ambitiously large range of techniques after having already devoted a major portion of the book to developing the foundations. Consequently, the latter part of the book consists of introductory expositions of various techniques, which, though well-presented, often lack sufficient detail and realistic examples. A few promising methods, like support-vector machines and boosting, receive only a passing mention. …