Academic journal article Journal of the Indian Academy of Applied Psychology

Evaluation of Inter-Rater Agreement and Inter-Rater Reliability for Observational Data: An Overview of Concepts and Methods

Academic journal article Journal of the Indian Academy of Applied Psychology

Evaluation of Inter-Rater Agreement and Inter-Rater Reliability for Observational Data: An Overview of Concepts and Methods

Article excerpt

(ProQuest: ... denotes formulae omitted.)

Although practitioners, researchers and policymakers often used the two terms IRA and IRR interchangeably, but there is a technical distinction between the terms agreement and reliability (LeBreton & Senter, 2008; de Vat, Terwee, Tinsely & Weiss, 2000). In general, IRR is defined as a generic term for rater consistency, and it relates to the extent to which raters can consistently distinguish different items on a measurement scale. However, some measurement experts defined it as the measurement of consistency between evaluators regardless of the absolute value of each evaluator's rating. In contrast, IRA measures the extent to which different raters assign the same precise value for each item being observed. In other words, IRA is the degree to which two or more evaluators using the same scale assigns the same rating to an identical observable situation. Thus, unlike IRR, IRA is a measurement of the consistency between the absolute value of evaluator's ratings. The distinction between IRR and IRA is further illustrated in the hypothetical example in Table 1 (Tinsley & Weiss, 2000).

In Table 1, the agreement measure shows how frequently two or more evaluators assign exactly the same rating (e.g., if both give a rating of "4" they are in agreement), and reliability measures the relative similarity between two or more sets of ratings. Therefore, two evaluators who have little to no agreement could still have high IRR. In this scenario, Raters 1 and 2 agree on the relative performance of the four teachers because both assigned ratings increased monotonically, with Teacher A receiving the lowest score and Teacher D receiving the highest score. However, though they agreed on the relative ranking of the four teachers, they never agreed on the absolute level of performance. As a consequence, the level of IRR between Raters 1 and 2 is perfect (1.0), but there is no agreement (0.0). By contrast, Raters 3 and 4 agree on both the absolute level and relative order of teacher performance. Thus, they have both perfect IRR (1.0) and IRA (1.0).

Another way to think about the distinction that IRA is based on a "criterion-referenced" interpretation of the rating scale: there is some level or standard of performance that counts as good or poor. On the other hand, IRR is based on a norm-referenced view: the order of the ratings with respect to the mean or median defines good or poor rather than the rating itself. Typically, IRA is more important in high-stake decisions about performance and planning whereas IRR is more frequently used in the research studies where only interest is the consistency of rater's judgments about the relative levels of performance (Gwet, 2012).

Measurement of key indices

The following methods are commonly used to calculate IRR/IRA indices:


If a measurement procedure consistently assigns the same score to individuals or objects with equal values, the instrument is considered reliable. In other words, the reliability of a measure indicates the extent to which it is without bias and hence insures consistent measurement across time and across the various items in the instrument. It is an indication of the stability (or repeatability) and consistency (or homogeneity) with which the instrument measures the concept and helps to assess the "goodness" of a measure (Shekharan & Bougie, 2010; Zikmund, 2003).

... (1)

Percent Agreement

The percentage of absolute agreement is the simplest to understand (Altman, 1991). One simply calculates the number of times raters agree on a rating, then divides by the total number of ratings. Thus, this measure can vary between 0 and 100%. Other names for this measure include percentage of exact agreement and percentage of specific agreement. It may also be useful to calculate the percentage of times the ratings fall within one performance level of one another (e.g., count as agreement cases in which rater one gives Teacher-A 4 points and rater two gives Teacher-A 5 points). …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.