Academic journal article Measurement and Evaluation in Counseling and Development

Characterizing Measurement Error in Scores across Studies: Some Recommendations for Conducting "Reliability Generalization" Studies

Academic journal article Measurement and Evaluation in Counseling and Development

Characterizing Measurement Error in Scores across Studies: Some Recommendations for Conducting "Reliability Generalization" Studies

Article excerpt

T. Vacha-Haase (1998) proposed her "reliability generalization" methodology to characterize (a) typical score reliability for a measure across studies, (b) the variability of score reliabilities, and (c) what measurement protocol features predict the variability in score reliabilities across administrations. The present article provides recommendations on how to conduct these studies.

**********

As noted by Gronlund and Linn (1990), "Reliability refers to the results obtained with an evaluation instrument and not to the instrument itself. Thus, it is more appropriate to speak of the reliability of 'test scores' or the 'measurement' than of the 'test' or the 'instrument'" (p. 78). This view is echoed in Standards 2.1 and 2.2 of the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (AERA/APA/NCME, 1999) testing standards. Similar sentiments were expressed by the APA Task Force on Statistical Inference (Wilkinson & APA Task Force on Statistical Inference, 1999), which recommended that authors "provide reliability coefficients of the scores for the data being analyzed even when the focus of their research is not psychometric" (p. 596).

Part of the logic for reporting score reliability in all studies relates directly to the Task Force mandate to also report effect sizes because "interpreting the size of observed effects requires an assessment of the reliability of the scores" (Wilkinson & APA Task Force on Statistical Inference, 1999, p. 596). As Reinhardt (1996) observed,

reliability is critical in detecting effects in substantive research. For example, if a dependent variable is measured such that the scores are perfectly unreliable, the effect size in the study will unavoidably be zero, and the results will not be statistically significant at any sample size, including an incredibly large one. (p. 3)

Accordingly, appropriate interpretation of observed effect sizes should invoke an examination of the reliability of the obtained scores.

Unfortunately, as Henson, Kogan, and Vacha-Haase (2001) noted, "the incorrect but common phraseology concerning the 'reliability of the test' leads many to incorrectly assume that reliability inures to tests rather than scores" (p. 407). This misperception fails to honor the fact that an instrument may yield scores with varied reliabilities on repeated administrations (cf. Crocker & Algina, 1986; Dawis, 1987; Gronlund & Linn, 1990; Pedhazur & Schmelkin, 1991; Thompson 1994; Vacha-Haase, 1998). Thompson and Vacha-Haase (2000) discussed the "etiology of [this] endemic misspeaking about reliability" and noted the dangers of using "the phrase 'the reliability of the test' as a telegraphic shorthand in place of truthful but longer statements (e.g., 'the reliability of the test scores')" (p. 178).

This process may be legitimate when the reliability induction occurs for a sample that is comparable with the inducted sample in terms of both composition and variability. As Crocker and Algina (1986) observed,

Rather than present the reliability of obtained scores, researchers at times report coefficients from test manuals or previous studies as relevant for their data. However, as Pedhazur and Schmelkin (1991) noted, "such in formation may be useful for comparative purposes, but it is imperative to recognize that the relevant reliability estimate is the one obtained for the sample used in the study under consideration" (p. 86). Vacha-Haase, Kogan, and Thompson (2000) called the use of prior coefficients for present data "reliability induction," because this decision reflects a generalization of a specific instance (e.g., test manual coefficient) to a more general state of affairs (e.g., future scores from the test).

reliability is a property of the scores on a test for a particular group of examinees. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.