Academic journal article Informing Science: the International Journal of an Emerging Transdiscipline

Data Quality in Linear Regression Models: Effect of Errors in Test Data and Errors in Training Data on Predictive Accuracy

Academic journal article Informing Science: the International Journal of an Emerging Transdiscipline

Data Quality in Linear Regression Models: Effect of Errors in Test Data and Errors in Training Data on Predictive Accuracy

Article excerpt

Introduction

There is strong evidence (e.g., Laudon, 1986; Morey, 1982; Redman, 1992, 1995, 1996) that data stored in organizational databases have a significant rate of errors. The effect of data errors on the outputs of computer-based models has been investigated by a number of researchers (e.g., Ballou and Pazer, 1985; Ballou et al., 1987; Bansal et al., 1993). This investigation builds on this prior research by examining the effect of data quality on linear regression models. A financial application of a linear regression model is used to examine this question.

Data errors may affect the predictive accuracy of linear regression models in two ways. First, the training data used to build the model may contain errors. Second, even if training data are free of errors, once a linear regression model is used for forecasting a user may input test data containing errors to the model.

In general, when claims about the predictive accuracy of linear regression models are made, it is assumed that data used to train the models and data input to make predictions are free of errors. In this study we relax this assumption by asking two questions: (1) What is the effect of errors in test data on predictions made using linear regression models? and (2) What is the effect of errors in training data on predictions made using linear regression models? The first question is focused on the effect of data errors when the model is used for forecasting. The second question is focused on the effect of data errors during model construction.

An understanding of the effect of data errors on linear regression models is particularly important because the availability of inexpensive software packages for personal computers makes the development and use of linear regression models by end-users feasible. Researchers have argued that end-user computing has increased the potential for data errors in computer applications (Boockholdt, 1989). As end users develop applications, it is possible that fewer data validation methods such as logic tests and control totals will be in place and it is likely that less rigorous testing will occur before applications are used in production (Corman, 1988; Davis, 1984; Davis et al., 1983; Panko, 1998).

The remaining sections of this paper present (1) a review of relevant prior research on data quality, (2) a brief explanation of linear regression models, (3) a description of the linear regression models constructed in the study, (4) a discussion of the methodology of two experiments, (5) the results of two experiments and (6) conclusions.

Background

Data quality is generally recognized as a multidimensional concept (Wand and Wang, 1996; Wang and Strong, 1996). While no single definition of data quality has been accepted by researchers working in this area, there is agreement that data accuracy, currency, completeness, and consistency are important areas of concern (Agmon and Ahituv, 1987; Ballou and Pazer, 1985; Davis and Olson, 1985; Fox et al., 1993; Huh et al., 1990; Madnick and Wang, 1992; Wand and Wang, 1996; Wang and Strong, 1996; Zmud, 1978). This investigation adopts the conceptualization of data quality proposed by Ballou and Pazer (1985) that includes four dimensions: accuracy, timeliness, completeness, and consistency. This study is primarily concerned with data accuracy, defined as conformity between a recorded data value and the corresponding actual data value.

Prior research has found that organizational databases are not in general free of errors (e.g., Laudon, 1986; Morey, 1982; Redman, 1992, 1995). Between one and twenty percent of data items in critical organizational databases are estimated to be inaccurate (Laudon, 1986; Madnick and Wang, 1992; Morey, 1982; Redman, 1992).

Data quality problems have been found to affect the accuracy and timeliness of economic data published by the United States government (Hershey, 1995; Morgenstern, 1963). …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.