Reliability Generalization: An HLM Approach

Article excerpt

Hierarchical data structures have been identified in educational and psychological measurement, and statistical approaches are developed to partition the score variances at multiple levels of the data hierarchy. On basis of the classical test theory and the current HLM literature, the reliability index has been constructed for unconditional and conditional models to facilitate generalization of the reliability computing in various test settings.

**********

Reliability is an important index in educational and psychological measurement. According to a joint committee of the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education (1985), "Reliability refers to the degree to which test scores are free from errors of measurement" (p. 19). Since the decrease of measurement errors is often associated with an increase of the measurement consistency in various circumstances, "reliability generalization may provide an important tool for characterizing score equality" (Vacha-Haase, 1998, p. 16). In this study, the purpose is to discuss conditional and unconditional hierarchical models accounting for measurement errors at different levels that are essential to the generalization of reliability. Since the test score reliability depends on many conditions of the test and subjects, empirical factors need to be introduced at the multiple levels to describe these conditions and facilitate generalization of the reliability computing in different settings.

Literature Review

The classical test theory represents one of the cornerstones in educational and psychological measurement (Lord & Novick, 1968; Pedhazur & Schmelkin, 1991). Pedhazur and Schmelkin (1991) recollected: "Since it was proposed by Spearman (1904), the tree-score model, or what has come to be known as classical test theory, has been the dominant theory guiding estimation of reliability" (p. 83). Specifically, Novick, Jackson, and Thayer (1971) elaborated,

 
   In the classical test theory model, the observed score X on a person is 
   taken to have expectation x, the true score for that person. The error 
   score is defined by e = x - [tau]. The corresponding random variables 
   defined over persons are related by the equation 
 
   (1.1) X = T + E 
 
   with  [epsilon] (E|[tau]) = 0 (p. 261) 

Regarding the reliability computing, Novick, Jackson, and Thayer (1971) added,

 
   The reliability (intraclass correlation) of a test is defined as 
 
  (1.3) [[rho].sup.2.sub.XT] = [[sigma].sup.2.sub.T]/[[sigma].sup.2.sub.x] = 
  [[sigma].sup.2.sub.T]/([[sigma].sup.2.sub.T] + [[sigma].sup.2.sub.E]) = 
  [[rho].sub.XX.sup.1] 
 
  where X and X' are parallel measurements. (p. 261) 

In a test containing multiple items, student responses to each test item can be treated as an indicator of the true score. Thus, the responses to a set of items comprise multiple indicators of the individual performance. The hierarchical data structure is illustrated by the fact that the item responses are nested within each student. In addition, factors at the student level can be employed to reflect different test conditions, such as the differences in student demographics, past experiences, as well as the instructional coverage of the test contents. Hence, considerations of the multilevel factors are essential to a proper generalization of the reliability assessment in various learning and/or testing environments.

Vacha-Hasse (1998) searched the PsycINFO database for articles published from 1984 to July 1997, and conducted a meta-analysis on issues of reliability generalization. She noted,

 
   Of the articles reviewed for the present study, 65.76% made absolutely no 
   reference to reliability. At the other extreme, authors of only 13.06% of 
   the articles reported reliability coefficients for the data analyzed in the 
   respective studies. …