Computerized testing is rapidly becoming pragmatic in admissions, licensure and certification, education placement, and guidance settings. This testing method offers many practical advantages to examinees, such as frequently available administration, instantaneous scoring and reporting, and greater test reliability (Bennett, Steffen, Singley, & Morley, 1997). In addition, computerized testing allows quick and easy item analyses, which in educational practice can serve to improve and guide instruction, evaluate items, and improve the properties of the overall test (Hsu & Yu, 1989). However, the arguably main benefit of computerized testing is one that often goes unsaid: money. In contexts where the training of the raters is very extensive, and the number of examinees to score is large, the use of expert raters may be extremely expensive if not impossible (Clauser, Margolis, Clyman, & Ross, 1997).
Yet, in computerized testing we still have the same problem traditionally associated with paper-and-pencil testing, that is, limitations in the use of item formats. Despite the fact that multiple-choice (MC) items are the most frequently used in computerized testing they still carry with them the same "baggage" of criticisms that they receive in paper-and-pencil-based situations. The main problem with MC items is their tendency to test mainly at the knowledge level of the cognitive domain (Putnam, 1992). In addition, MC items can be quite susceptible to guessing and testwiseness strategies (Towns & Robinson, 1993; Rogers & Yang, 1996). Braswell & Kupin (1993) also note that although MC items are reliable indicators of examinee ability and achievement, complementary formats can provide a more appropriate target for instruction.
In response to these criticisms of MC items, many test constructors have been using the power of computerized testing to create new types of "hybrid" computer scorable MC items which were not previously possible in the traditional paper-and-pencil setting. Many open-ended response type items have now been incorporated into computer-administered tests because they can now be scored automatically by using computer algorithms. Some of the best examples of these are the "mathematical-expression" item used by the Educational Testing Service for use on the Graduate Records Exam (Bennett et al., 1997), "computerized long-menu" items (Schuwirth, Van de Vleuten, Stoffers, & Peperkamp, 1996), and "automated scoring algorithms for complex assessments" under development by the National Board of Medical Examiners (Clauser et al., 1997). In addition, positive results have been reported for computer scoring of essays (Page & Petersen, 1995), assessment of architectural problem solving (Bejar, 1991), computer programming (Braun, Bennett, Frye, & Soloway, 1990), and hypothesis formation (Kaplan & Bennett, 1994). However, to date, most of these items are still prototypes under development.
The Student Evaluation Branch of Alberta Education has responded to the criticisms of MC items by developing a number of computer-scorable item formats. The first stage of this development was the numerical-response (NR) item format, a format which requires the examinee to produce a numeric answer, and record this answer in a numeric field rather than simply choosing the correct answer from a list of alternatives. This information is then scanned, and the recorded value is computer scored by comparing the value recorded by the student to the computer's key. The fact that examinees are required to recall as opposed to recognize the correct answer drastically reduces the biases due to testwiseness (Towns & Robinson, 1993). Furthermore, the chances of guessing the correct answer is essentially zero, thus effectively reducing two of the main concerns associated with MC items. Currently, this format is being used for a wide range of question types with calculation based sequencing and matching items being the most common. …