Academic journal article Composition Studies

Immodest Witnesses: Reliability and Writing Assessment

Academic journal article Composition Studies

Immodest Witnesses: Reliability and Writing Assessment

Article excerpt

For many writing program administrators, the assessment concept reliability presents considerable challenges. We may shy away from what seems to be a dauntingly complex psychometric concept requiring sophisticated statistical analyses. We may worry that achieving consistency in scoring requires us to simplify what we assess, thereby narrowing our construct of writing. We may fear that the norming traditionally employed to ensure consistent scoring inhibits meaningful, engaged reader responses to student writing. Certainly as a writing program administrator I have had these concerns. And my response has been, I think, typical: I have held my nose and calculated and reported reader agreement rates.

We may think of such a response as acceding to a stringent psychometric expectation, but this narrow insistence on inter-rater reliability is likely to confound psychometricians. The psychometric concept of reliability, even in its most traditional form, is far broader than inter-rater agreement. It also includes intra-rater agreement: the extent to which a single rater scores consistently. And it considers instrument reliability, which measures internal consistency and consistency among parallel forms of assessments.1 Ironically, then, we in our writing programs tend to focus on a narrower approach to reliability than we may feel has been imposed on us.

Moreover, the general acceptance in writing programs of inter-rater reliability tethers us to a classical psychometric tradition in which reliability is defined as consistency. Although teachers and scholars in our field tend to view writing and reading as rhetorical acts, many of our programs operate on assessment concepts and practices that demand highly controlled, ¿rhetorical approaches to reading and writing. Most writing programs I have taught in and visited use some variant of the holistic scoring model developed by the Educational Testing Service (ETS) in the 1970s: design a rubric; train scorers to apply it to student work using "anchor" papers (or portfolios) and practice sessions; double score some or all of the student artifacts; reconcile discrepant scores; calibrate as necessary (see Neal; White; Wilson). The goal, though often framed in terms of fairness, is really consistency: scorers should arrive at the same score, irrespective of their own reading preferences and habits (Huot 88). This concession to reliability-as-consistency may be a gesture of self-preservation: certainly I have felt that my writing programs could not afford to be perceived as ¿«reliable. But if we believe that writing and reading are fundamentally rhetorical acts, we should ask if arhetorical reliability-asconsistency comes at too high a price.

Fortunately, as I will show, reliability-as-consistency is not our only option. Reliability is a rich and multivalent concept within and across various disciplinary discourses, and it need not be jettisoned or grudgingly accepted. Indeed, an interdisciplinary inquiry into theories of reliability begins to yield a theory of rhetorical reliability that is consonant with widely held beliefs in composition studies about the nature of reading and writing. This theory, I believe, can reinvigorate our assessment work and allow us to frame (to borrow Peggy O'Neill's term) reliability in ways that articulate and advance rhetorical understandings of writing and writing assessment.

Although writing program administrators and writing assessment scholars need to think beyond inter-rater reliability, I take this concept as my focus in this article both because it has mesmerized writing assessment in composition studies (see Broad What, Elliot; Huot; O'Neill; White) and because even this narrow but important concept requires more careful theoretical investigation than we have devoted to it. In our programs and in our writing assessment literature, we tend to think of inter-rater reliability as a scoring problem: How can we achieve sufficiently high agreement rates? …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.