"MAKE-OR-BREAK EXAMS GROW, BUT BIG Study Doubts Value" intoned a front-page New York Times headline in December 2002. The article continued, "Rigorous testing that decides whether students graduate, teachers win bonuses, and schools are shuttered does little to improve achievement and may actually worsen academic performance and dropout rates, according to the largest study ever on the issue:" Thus a deeply flawed study was catapulted to national prominence. More important, its conclusions were opposite those found through rigorous scientific studies.
The report in question, authored by Arizona State University researchers Audrey Amrein and David Berliner, purported to examine student-performance trends on national exams in states where legislators have attached "high stakes" to test scores. High-stakes testing has become a lightning rod as more and more states adopt accountability measures in response to the mandates of the federal No Child Left Behind Act. While it is crucial to analyze and debate the wisdom of such policies, the discussion must be informed by evidence of the highest quality. The controversial nature of high-stakes testing has led to the hurried release and dissemination of research that lacks scientific rigor, of which the Amrein and Berliner study is one of the more egregious examples.
This says much about the standards for research in education today. The situation is so contentious that in 2000 the National Research Council found it necessary to convene a panel to decide which scientific principles should apply to educational research-the kind of question that other fields of social science settled long ago. In the case at hand, Amrein and Berliner trumpet the fact that their report was reviewed by a panel of four scholars based at other schools of education, yet this should only be a source of greater concern. Sharing a paper with sympathetic colleagues is no substitute for a system of blind peer review- a bedrock principle of scientific research.
Here we closely examine Amrein and Berliner's underlying data and methodology. Our results are astonishing: if basic statistical techniques are applied to their data, it reverses nearly every one of their conclusions. Later we also present the results of separate research on accountability that we conducted for a June 2002 Federal Reserve Bank of Boston conference. Rigorous analysis reveals that accountability policies have had a positive impact on test scores during the past decade.
The Unscientific Method
Amrein and Berliner identified 28 states where test scores are used to determine various consequences, such as bonuses for teachers, the promotion of students, or allowing children to transfer out of a failing school. These stakes go beyond less controversial accountability measures such as publishing test scores in the newspaper. The states range from Georgia and Minnesota--where the only penalty is experienced by students who fail a high-school graduation exam--to North Carolina and Texas, where the authors found a total of six stakes each, stakes that affect both schools and students.
Once Amrein and Berliner identified the high-stakes states, they looked at changes in the average scores students earned on the National Assessment of Educational Progress (NAEP). Choosing this test as a basis for considering the impact of high-stakes tests on students in the 4th and 8th grades (ages 9 and 13, respectively) is a sensible idea, because the validity and reliability of NAEP, often called the "nation's report card," are well accepted. It is a test for which students cannot easily be prepped and, since the performance of individual school districts, schools, or students is not reported, there is little incentive to cheat or even to prepare for the test. It also provides a neutral standard for assessing the effects of state policies. But if the Arizona State teams decision to look at NAEP scores was correct, less can be said for their other analytical choices. …