A Tools-Based Approach to Teaching Data Mining Methods

Article excerpt

Introduction

Data mining is the process of discovering useful and previously unknown information and relationships in large data sets (Campos, Stengard, & Milenova, 2005; Tan, Steinbach, & Kumar, 2006). Accordingly, data mining is the purposeful use of information technology to implement algorithms from machine learning, statistics, and artificial intelligence to analyze large data sets for the purpose of decision support.

The field of data mining grew out of limitations in standard data analysis techniques (Tan et al., 2006). Advancements in machine learning, pattern recognition, and artificial intelligence algorithms coupled with computing trends (CPU power, massive storage devices, high-speed connectivity, and software academic initiatives from companies like Microsoft, Oracle, and IBM) enabled universities to bring data mining courses into their curricula (Jafar, Anderson, & Abdullat, 2008b). Accordingly, Computer Science and Information Systems programs have been aggressively introducing data mining courses into their curricula (Goharian, Grossman, & Raju, 2004; Jafar, Anderson, & Abdullat 2008a; Lenox & Cuff, 2002; Saquer, 2007).

Computer Science programs focus on the deep understanding of the mathematical aspects of data mining algorithms and their efficient implementation. They require advanced programming and data structures as prerequisites for their courses (Goharian et al., 2004; Musicant, 2006; Rahal, 2008).

Information Systems programs on the other hand, focus on the data analysis and business intelligence aspects of data mining. Students learn the theory of data mining algorithms and their applications. Then they use tools that implement the algorithms to build mining models to analyze data for the purpose of decision support. Accordingly, a first course in programming, a database management course, and a statistical data analysis course suffice as prerequisites. For Information Systems programs, a data centric, algorithm understanding and process-automation approach to data mining similar to Jafar et al. (2008a) and Campos et al. (2005) is more appropriate. A data mining course in an Information Systems program has an (1) analytical component, (2) a tools-based, hands-on component ,and (3) a rich collection of data sets.

(1) The analytical component covers the theory and practice of the lifecycle of a data mining analysis project, elementary data analysis, market basket analysis, classification and prediction (decision trees, neural networks, naive Bayes, logistic regression, etc.), cluster analysis and category detection, testing and validation of mining models, and finally the application of mining models for decision support and prediction. Textbooks from Han and Kamber (2006) and Tan et al. (2006) provide a comprehensive coverage of the terminology, theory, and algorithms of data mining.

(2) The hands-on component requires the use of tools to build projects based on the algorithms learned in the analytical component. We chose Microsoft Excel with its data mining add-in(s) as the front-end and Microsoft's Cloud Computing and SQL Server 2008 data mining computing engines as the back-end. Microsoft Excel is ubiquitous. It is a natural front-end for elementary data analysis and presentation of data. Its data mining add-in(s) are available as a free download. The add-in(s) are automatically configured to send data to Microsoft's Cloud Computing engine server. The server performs the necessary analysis and receives analysis results back into Excel to present them in tabulated and chart formats. Using wizards, the add-in(s) are easily configured to connect to a SQL Server 2008 running analysis services to send data and receive analysis results back into Excel for presentation. The add-in(s) provide a rich wizard-based, uniform graphical user interface to manage the data, the data mining models, the configurations, and the pre and post view of data and mining models. …