Academic journal article Psychological Test and Assessment Modeling

Modeling Test Context Effects in Longitudinal Achievement Data: Examining Position Effects in the Longitudinal German PISA 2012 Assessment

Academic journal article Psychological Test and Assessment Modeling

Modeling Test Context Effects in Longitudinal Achievement Data: Examining Position Effects in the Longitudinal German PISA 2012 Assessment

Article excerpt

(ProQuest: ... denotes formulae omitted.)

During the last decades, longitudinal studies have become increasingly popular in the psychological and educational sciences and have been adopted in many recent largescale studies of student achievement, such as NEAP, NEPS, and PISA (Ramm et al., 2006). Such studies place high demands on the psychometric quality of the test, which is repeatedly assessed in order to derive achievement scores that can be compared across time. However, the IRT models routinely employed in large-scale assessments assume that the probability of observing a correct response depends only on the item characteristics (i.e., item parameters) and the students' proficiencies. As a consequence, the influences of the context in which items are presented to the students are neglected, although Brennan (1992), for example, listed a number of contextual characteristics that are likely to affect students' test scores. Well-known examples of test context effects (TCE) include position effects (PE; Leary & Dorans, 1985) and effects of domain orders (DOE; Harris, 1991). PEs refer to the phenomenon of items becoming more difficult, the later they are presented in a test. DOEs apply to tests consisting of items from different domains, such as mathematics, science, and reading. DOEs manifest themselves in changes in item responses as reactions to the sequence of domains that precedes a specific item or a section of the test. Other examples of TCEs are effects of difficulty, caused by the sequencing of items or sections, such as easy-to-hard versus hard-to-easy sequences, effects of testing time, and effects of the ordering of response options, among others. A complete list of all possible kinds of TCEs is hard or even impossible to derive. Nevertheless, large parts of current research agree that PEs are the most prevalent types of TECs and affect almost all school achievement tests (Leary & Dorans, 1985).

TCEs can be regarded as a threat to the validity of inferences about changes in proficiency levels for two reasons. First, in many longitudinal studies, the booklet design is changed across assessments, so that the test scores derived on the different measurement occasions are differently impacted by TCEs. In this scenario, changes in the test design are the sole reason for biased change estimates. As a consequence, group differences in proficiency gains should not be sensitive to TCEs because all individuals are equally affected by the changes in the assessment design. Second, TCEs can be conceived as individuals' reactions to the features of the test form provided (e.g., Debeer, Buchholz, Hartig, & Janssen, 2014), so that the strengths of reactions could differ across time even when the test design remains unchanged. In this scenario, the reason for changes in TCEs is located on the person side, so that group differences in proficiency gains are biased when the size and/or pattern of TCEs changes across time in a group-specific way. Of course, in real applications, changes in TCEs could be due to both reasons (i.e., changes in the assessment design, and changes in individuals' reactions to the test).

In the present article, we investigate TCEs in the German longitudinal extension of the PISA 2012 assessment. This study involved a second assessment of a subsample of the 9th graders who participated in the PISA 2012 study and who were retested when they were in the 10th grade. We propose an IRT model that makes it possible to assess TCEs operating on the level of item clusters, which are the main building blocks of test designs employed in large-scale assessments of student achievement, such as PISA. To this end, we specify booklet effects (i.e., effects of test forms) that are assumed to vary between item clusters. The model furthermore makes it possible to specify these effects to vary between measurement occasions, and between student groups (e.g., school types). These analyses shed light on many highly relevant questions. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.