On the Importance of Using Balanced Booklet Designs in PISA

On the Importance of Using Balanced Booklet Designs in PISA

The effect of using a balanced compared to an unbalanced booklet design on major PISA results was examined. The responses of 39,573 students who participated in the PISA-E 2006 assessment in Germany were re-analyzed. Using an unbalanced booklet design instead of the original booklet design led to an increase in mean reading performance of about six points on the PISA scale and altered the gender gap in reading to different degrees in the 16 federal states of Germany. For students with an immigration background, the reading performance was significantly higher for the unbalanced design than for the original design. For the unbalanced design, the relationship between self-reported effort while taking the test and reading performance was higher compared to the original design. The results underline the importance of using a balanced booklet design in PISA in order to avoid or minimize bias in population parameters estimates.

Key words: booklet design, testing, large-scale assessment, item response theory, Programme for International Student Assessment


Large-scale assessments (LSAs) of student achievement aim to measure what populations of students know and can do in specified content areas. In LSAs large samples of students are assessed. Analyses of the observed responses make it possible to draw valid conclusions about the achievement levels in the underlying population of students. Many countries or federal states within countries run national LSAs. In the United States of America, for example, the National Assessment of Educational Progress (NAEP; e.g., Jones, & Olkin, 2004) has been conducted from 1969 on. The attainment of the German national educational standards is also assessed with LSA methodology (e.g., Stanat, Pant, Böhme, & Richter, 2012). In addition to the national initiatives, several international LSAs were initiated. One of the first was the Pilot Twelve-Country Study (Foshay, Thorndike, Hotyat, Pidgeon, & Walker, 1962) which was conducted in the year 1960. Some of the best known current international LSAs are the Programme for International Student Assessment (PISA; e.g., OECD 2010), the Trends in International Mathematics and Science Study (TIMSS; e.g., Mullis, Martin, Ruddock, O'Sullivan, & Preuschoff, 2009), and the Progress in International Reading Literacy Study (PIRLS; e.g., Mullis, Martin, Kennedy, Trong, & Sainsbury, 2009). The importance of national and international LSAs has increased steadily over the last decades. Today, these studies represent a core aspect of many educational systems around the globe.

Typically, LSAs strive to obtain reliable and valid information about student achievement in one or several content domains of interest. From a measurement point of view, the aspects of student achievement focussed on by LSAs are generally conceptualized as complex constructs. Often, differentiations in terms of subdimensions and/or facets are made. These subdimensions and/or facets represent aspects like cognitive processes, content areas, situations, or other systematizations of the respective content domain (e.g., OECD, 2009a for PISA 2009). Due to the complexity of the constructs at stake, large sets of items are required for their adequate operationalization. As an example, in PISA, about 1 50 to 200 items are employed to measure students' literacy in reading, mathematics, and science in each assessment. Such large numbers of items cannot be presented as a whole to each student within a realistic amount of testing time. Therefore, in LSAs, students are generally randomly given one of several test forms called booklets. Each booklet contains a subset of the complete item pool that can be sensibly answered by a student within a reasonable testing time.

The way the items are assigned to the booklets is specified by a booklet design (Frey, Hartig, & Rupp, 2009; Gonzales & Rutkowski, 2010; Yousfi & Böhme, 2012). …

