Evaluating the Technical Quality of Educational Tests Used for High-Stakes Decisions

Article excerpt

Tests are used for making decisions about students, including decisions about promotion, in-grade retention, and high school graduation. This article provides criteria for evaluating the technical quality of tests used for high-stakes decisions in education. The intent is to provide educators with information for evaluating their testing program.

In the past decade, there has been an increase in tests used in Pre K--12 education; these tests are used to assist in making high-stakes decisions that have important and long-lasting effects on students. For example, tests are used to identify students who need remedial work, for in-grade retention decisions, and for high school graduation eligibility. Tests are used in some districts to identify students who are ready to begin their education (kindergarten readiness) as well as those who are ready to exit the system with a high school diploma (high school graduation tests). Some educators attribute this interest in testing as an outcome of the educational reform--accountability movement (Linn, 1998). It seems that this interest in high-stakes testing will not decrease in the near future, as is evidenced by the central role of testing in the Bush Education Plan.

The focus of this article is on large-scale Pre K--12 educational tests that are used to make high-stakes educational decisions about students. Teachers in classroom settings administer many tests, some formal and some informal. In fact, Stiggins (1991) estimated that teachers spend more than 30% of their time in some form of assessment of their students. Although much of what I present in this article could (and probably should) apply to teacher-developed classroom assessments, my intent is to focus on technical quality indicators for tests that have major consequences for students in their progress through the Pre K--12 educational system.


Large-scale, high-stakes assessments are designed to serve a variety of purposes, including acting as a barometer of the effects of efforts to improve education through the educational reform movement. This movement is directed, in part, toward allocating responsibility to schools to deliver demonstrable educational performance of students. On the basis of accountability models, schools have been rated and ranked on their ability to deliver the educational product they are accountable for achieving: student performance on prearticulated content standards. These test results can be useful in instructional planning, especially if the timing of the assessment administration and score reporting result in meaningful and useful information for instruction planners, including teachers, parents, and students. Knowing what students know and are able to do as a function of instruction is important both for improving instruction and for providing appropriate educational opportunities and experiences for students. These tests can (and should) be useful for these and other appropriate educational purposes. Test results can be invaluable in making informed decisions about student learning, instructional programs, and school performance.

However, in order for the results of these tests to be useful in making sound and appropriate decisions, the technical quality of these tests must be adequate to support their use for these purposes. In the rush to meet educational reform and accountability goals, some policy makers, and others who have the responsibility to make test selection decisions for use in high-stakes testing programs, have neglected the issue of test quality. There is good reason to be concerned about the validity of decisions made about students, instructional programs, and schools when the decisions are based on ineffective, unfair, and inappropriate tests. Although it is possible to make good decisions on the basis of sound assessments, it is almost guaranteed that bad decisions will be the result of making those decisions on the basis of poorly constructed, inappropriate, technically weak assessments. …