A Reply to Loerke, Jones, and Chow (1999) on the "Psychometric Benefits" of Linked Items
Pope, Gregory A., Harley, Dwight D., Journal of Instructional Psychology
The Loerke, Jones, and Chow (1999) article entitled "Psychometric benefits of soft-linked scoring algorithms in achievement testing" contains many methodological and psychometric errors. Some of these include: a) ignoring basic assumptions of classical test theory and item response theory, b) incorrectly using and interpreting item and test statistics such as point-biserial correlation coefficients and KR-20 reliability coefficients. It is essential that education professionals are made aware of these problems and appreciate that the findings of the study cannot be used as "evidence" that linked items have psychometric benefits.
The article entitled: "Psychometric benefits of soft-linked scoring algorithms in achievement testing" by Loerke, Jones, and Chow (1999) has some fundamental problems that education professionals should be aware of. The main problem is that the authors do not take into consideration basic assumptions of classical test theory when concluding that any types of "linked items" have psychometric benefits. The authors of the article used item and test statistics inappropriately, and then make inaccurate claims that they have demonstrated empirically the "reliability superiority" of "soft-linked items" over "hard-linked items". We believe it is important to point out the problems with the article so that the readers do not take home the wrong conclusions about so called "linked items".
Linked items are groups of multiple choice type items that are related to each other on a test (e.g., "Use your answer from question 20 to answer question 21"). The authors make the distinction between "soft" and "hard" linked items, a distinction that is less important than the obvious psychometric assumptions that linked items violate. A significant problem with the Loerke, Jones, and Chow (1999) article is that the results of the paper are obvious without needing to conduct any analyses. Having two or more soft-linked items on a test will mean that the related items are more likely to be scored as correct because an incorrect response to the first question does not mean the other questions will be scored as incorrect. Additionally, selection of the keyed response to the second, third, or fourth item in the link chain will also result in the item being scored as correct. It is therefore expected that an average of the point-biserial correlation coefficients of the soft-linked items will be higher than the correlations for the hard-linked items because more items in the soft-linked group will positively correlate with the total test score. This result is common sense and does not need to be empirically tested.
The authors also misinterpret the function of a point-biserial correlation, drawing conclusions about item and test reliability rather than discrimination. Point-biserial correlation coefficients are discrimination indices (i.e., how well does the item discriminate between examinees of different ability levels) not measures of reliability. As a result statements made in the paper discussing the "reliability superiority" of soft-linked items to hard linked items are completely unfounded. In terms of the treatment of reliability in the paper, the authors did not calculate the KR-20 reliability indices appropriately given that the linked items would need to be treated as testlets.
Having outlined some of the specific problems with the Loerke, Jones, and Chow (1999) paper, the theoretical problems with the paper can now be described. The basis of classical test theory is that an observed score equals a true score plus error (O = T + E). One of the fundamental assumptions of classical test theory is that the error terms of each test item are uncorrelated (Nunnally & Bernstein, 1994). This is achieved by making the items locally independent of one another (i.e., every item on a test must be an independent observed measurement of ability). …