Academic journal article Journal of Information Systems Education

A Realistic Data Cleansing and Preparation Project

Academic journal article Journal of Information Systems Education

A Realistic Data Cleansing and Preparation Project

Article excerpt

1. INTRODUCTION

It is common in database courses to use cleansed data stored in well-defined and simplified relations on common domains for lectures and assignments on query languages including SQL. Examples of these well analyzed and abridged 'toy' domains include employee/department/project and student/course/enrollment (Wagner et al., 2003). Simplified relations of common domains are also employed regularly in popular database textbooks (Hoffer et al., 2008; Elmasri et al., 2010).

For database modeling, students are frequently asked to model a simplified application and design relation schemas accordingly. These relations are then populated with data prepared by either the instructors or the students. For example, in a Web-based multi-media database project, the instructor provided rather specific instructions for designing and populating the database (Holliday et al., 2009). On the other end, in another project, students were required to propose and model their own fictitious applications and populate the designed database in Oracle with their own data (Tuttle, 2002). In either case, the data tend to be relatively clean, well-defined, structured, small in size, single sourced, artificial and simplified.

By contrast, data sources in real-world applications can be dirty, ambiguous, poorly structured, complicated and voluminous (Zhang et al., 2003). Increasingly, companies use a diverse collection of data sources to support their wide range of old and new applications. Unlike traditional clean and internal data, these data sources may be created and provided by external entities. They may not be designed to target a given application of a specific company. In fact, with the advances of Web services, the data producers may not know who may use the data and in what ways they are used. They may design the data format to be generic enough to accommodate the basic shared needs of a large set of clients. Thus, for a given data consumer, before storing in its own database, the data may need to be cleansed, disambiguated, filtered and formatted in order to satisfy application requirements and formats.

To highlight the importance of data cleansing and elaborate existing approaches for improving data quality, Hellerstein (2008) identified four sources of error in databases: data entry, measurement, data distillation and data integration. These errors occur frequently. As a result, significant amount of work is usually spent on data preparation for many data-centric projects. In a survey of 187 data mining projects, 64% indicated that they spent more than 60% of their time on data preparation and cleaning (KDSurvey, 2003). Zhang, Zhang and Yang indicated that "in practice, it has been generally found that data cleaning and preparation takes approximately 80% of the total data engineering effort" (Zhang et al., 2003, pp.375). Likewise, data cleansing is mentioned in MSIS 2000 and MSIS 2006, the recent model curricula and guidelines for graduate degree programs in information systems (Gorgone et al., 2000; Gorgone et al., 2006). It is also an integral part of the well-known Extraction, Transformation, and Loading (ETL) process that is frequently covered in data warehousing courses (Rahm and Do, 2000).

Using highly simplified, clean and well-defined common domain 'toy' datasets for database courses has the advantages of being easy to teach, use and learn. They provide much value to the student learning process (Wagner et al., 2003) and are thus the main staples of popular database textbooks (for example, Hoffer et al., 2008; Elmasri et al., 2010) and project assignments in database courses. However, they do not prepare IS students to deal with the complexity of real-world data very well. As a result, additional supplementary materials and project assignments on cleansing and preparing realistic data sources can be highly beneficial and effective.

Yet, data cleansing and preparation have neither been well discussed in database textbooks nor well reported in technical papers on database education. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.