Does the Gender of Examiners Influence Their Marking?
Greatorex, Jackie, Bell, John, Research in Education
Three awarding bodies - Assessment and Qualifications Alliance (AQA), London Qualifications and Oxford, Cambridge and RSA Examinations (OCR) - in England administer the General Certificate of Secondary Education (GCSE) examination and A levels. GCSE and A level are national assessments normally taken in a series of subjects by sixteen-year-olds and eighteen-yearolds respectively.
Within the field of educational assessment there is a large literature about sex bias studies, for example Spear (1984) and Gipps and Murphy (1994). Studies have researched the sex bias of whole test or individual questions and/or associated mark schemes. The concern is that particular groups, whether they are gender, ethnic or other types of groups, gain lower marks than other groups. Exercising professional judgement is necessary to determine when a disparity is a bias. A major issue for many years was that men gained higher marks than women did on the high-stakes Scholastic Aptitude Test (SAT) which is used to determine college admissions and scholarship awards in the United States (see Lynn and Mau, 2001). The items might be biased due to the emphasis that is placed upon particular skills or the context in which the problem is set. For example, Dwyer (1976, in Murphy, 1978) found that the SAT about mathematics was biased towards males owing to the inclusion of more geometry than algebra problems. Graf and Ridell (1972) found that the same mathematical problem set in a male-friendly and in a female-friendly context resulted in different levels of achievement by the two sexes. Wood (1978) found that girls did better on GCE O-level1 examination questions about females or female-stereotyped contexts, e.g. a girl's ordeal at a dinner dance. Boys performed better on questions about the Crimean War and/or a man looking back on a boyhood spent near a railway line. But Boaler (1994) found that it was the realism of the context as well as the extent of the sex stereotyping in the examination question that affected girls' performance. The literature given above tends to refer to individual questions, but Stobart et al. (1992) show that there is evidence that in GCSE and O levels different types of assessment and different subjects differentially affect the achievement of the sexes. Girls tend to outperform boys in all subjects except the sciences and girls tend to do well on course work but boys tend to do well on objective (multiple-choice) tests. They warn that 'Equal outcomes should not therefore be contrived by manipulation of assessment techniques' (Stobart et al., 1992, p. 261).
Other studies have focused upon another form of bias. Gipps (1994) explains that bias can occur when the score given by an examiner is consciously or unconsciously affected by factors other than the candidate's achievement, e.g. sex, ethnic origin, school, handwriting. In the context of UK higher education Newstead and Dennis (1990) examined inter-rater reliability for blind and non-blind grading. They found that there was no sex bias (i.e. no favouritism towards students of one sex) in the grading of undergraduate students. Baird (1996) investigated sex bias in marking in Chemistry and English Literature A level examinations, using a blind marking approach. She found that marks were not affected by the sex of the examinee. In the case of 'live' GCE and GCSE examinations blind marking would be a considerable logistical challenge.
Examiners can also award marks for answers that illustrate skills, knowledge and/or values irrelevant to the test but valued by the markers. Even in a blind marking scenario the sex of the examinee might be inferred from candidates' handwriting (girls' is perceived to be neater) and stereotypes can come into play. The Scottish Examining Board (1992) investigated marker practices in English and History using scripts that varied in the achievement of the centre (school or college), handwriting, gender and ethnic origin of the candidate. The only significant effect found was that the typed scripts gained lower marks than the handwritten scripts. Green et al. (2003) found a similar effect: typed scripts (of eleven-year-olds) were marked down compared with their handwritten counterparts. In Baird's (1996) study candidates' marks were not affected by the sex of the examinee or the style of their handwriting. Some of the results of one of the most recent studies about examiner bias are published in Francis et al. (2003). The research project involved fifty History and fifty Psychology lecturers (from higher education) marking anonymised essays, one by a male student and one by a female. The lecturers were not moderated, to allow their personal criteria for awarding high marks/classifications to emerge. Indeed there was about 50 per cent agreement between the respondents on the appropriate classification for the essays. There was a slight tendency for female lecturers to prioritise writing ability, the use of evidence and wide reading, whereas males were more concerned about a 'slapdash approach', lack of preparation and lack of critical analysis (Francis et al., 2002). The research also included interviews with the lecturers: it emerged that they thought there were differences between the genders in terms of ability and academic essay writing. It is not clear from this research whether lecturers of either or both sexes preferred male or female styles of essay writing.2 The literature indicates that the more information an individual has about someone else the less stereotypes come into play (Delia, 1972) and so if the assessor knows the student any sex bias may be reduced (Bradley, 1984).
Additionally there is the issue of whether the examiners' characteristics affect their marking. In testing English as a Second Language experienced examiners are sometimes found to be more lenient than inexperienced examiners (Ruth and Murphy, 1988; Weigle, 1998). Of course, there is evidence that the severity and leniency of marking by examiners of the same status in UK examinations may vary, for example Newton (1996). Francis et al. (2002) in the study described above found that marking severity did not vary greatly according to the gender of the examiner. But it has been found that examiner behaviour varies with different groups, e.g. according to professional background, subject specialism and gender (Hamp-Lyons, 1990; Vann et al., 1991). This is presumably due to each group having a unique frame of reference.
Gender schema theory explains that sex is biologically determined and dichotomous, but gender is socially constructed and continuous. Sex-typed (masculine or feminine) individuals tend to view the world from a gendered perspective and to identify themselves as a stereotypical male or female, depending upon the general stereotype of an ideal male or female in their culture (Bern, 1987). Morgan (2002) argues that generally within sociological work masculinity and femininity appear to be an assumed extension of the sex of those studied, and there is little work which attempts to separate masculinity and femininity from sex. The educational assessment literature also sometimes confounds sex and gender, as sex often appears to be used as a proxy for gender. Even studies of examiner bias where arguments are based on gender being socially constructed go on to use biological sex of examiner as a proxy for gender, e.g. Francis et al. (2002). However, in other contexts it has been found that the masculinity/femininity of the rater and the task performed can affect the raters' judgements or ratings - for example, see Murray (1976).
The first issue researched here is whether male and female examiners (of GCSE English, History and Food Technologies examined in the summer session of 2001) respond differently to the performance and answers of candidates of different sexes. The second issue researched here is whether the masculine and/or feminine examiners respond differently to the performance and answers of candidates of different sexes. In this article both the biological sex and the masculinity and/or femininity of examiners (of GCSE English, History and Food Technologies) will be considered. The masculinity and/or femininity of the examiners is described in and measured by the Bem Sex Role Inventory (BSRI), which measures the masculinity and femininity of individuals. Respondents are categorised as:
1 Masculine (high on the masculine scale and low on the feminine scale).
2 Feminine (low on the masculine scale and high on the feminine scale).
3 Androgynous (high on the masculine and feminine scales).
4 Undifferentiated (low on the masculine and feminine scales).
Each scale constitutes a list of traits to which the participants respond with an indication of how well each trait describes them. Some of the traits are not part of the masculine and feminine scales - they are 'fillers'.
The Bem Sex Role Inventory was developed in the 1970s (Bem, 1974, 1979) and a reasonable question is whether its use is still valid. Auster and Ohm (2000) found that the masculinity scales and femininity scales still have some validity. The Inventory is a valid way of capturing the sex typing as a series of masculine and feminine traits. But it is limited as a proxy for gender, since it measures a series of gendered personality traits (Blanchard Fields et al., 1994). Personality traits are not the whole story. For instance, West and Zimmerman (1987) argue that gender is something people do in their social interaction rather than a set of traits residing with individuals. From this perspective gender is fundamentally about social interaction and relationships. There may be other components which constitute gender - for example, attitudes, stereotypes, behaviour, social relations, abilities and interests (Ashmore, 1990). The aim of this study is to investigate any relationship that may or may not exist between the self-perception of masculinity and femininity of examiners and their marking of male and female examinees.
There were four samples of examiners and associated samples of candidates' work marked by those examiners. The first three samples constituted the unit (examination paper) level marks awarded by examiners to the candidates in three GCSE subjects (English, Food Technologies and History). English was used because there was a fairly even number of male and female examiners, History because there were more male than female examiners and Food Technologies because there were no male examiners but candidates of both sexes. The fourth was a sample of the item (question) level marks awarded by some examiners to candidates in an English examination.
The BSRI was mailed to all examiners who had marked the GCSEs in the summer 2001 session. These examiners were told that the individual results of the BSRI would remain confidential to the Research and Evaluation Division (RED) of University of Cambridge Local Examinations Syndicate (UCLES) and that they would be reported as a group. The sex distribution of examiners who marked each paper and returned the questionnaire is given in Table 1. The response rate was 70 per cent for English, 65 per cent for Food Technologies and 61 per cent for History. Awarding bodies store a variety of operational data, including the marks achieved at the unit level by each candidate in each qualification and the candidates' sex. These data were matched with the examiners for each paper to undertake the analysis described below.
Awarding bodies do not store operational information about the item-level marks gained by GCSE candidates as a matter of course (for instance, scores for multiple-choice questions might be stored). These data were collected from the scripts, as detailed above. When the item-level data were collected (in the autumn term of 2001) from the English scripts many of the scripts were unavailable for research purposes, as they were being used for operational purposes, which take priority. This meant 133 of 142 examiners were sampled. The seventh script from each examiner's marking was taken from scripts, which are stored in examiner, centre and then candidate number order. The data at the item level were keyed into a database. If there were a small number of scripts the seventh was replaced by the fourth. This was repeated twenty times for each examiner. If there were too few scripts for an examiner for this process to be followed then they were excluded from the sample. The lack of scripts available meant that the sample of examiners' scripts was often taken from one centre. However, there were some examiners who had an allocation of only one centre. Experience from other examination research studies has found that this method of sampling does not introduce systematic biases. But the sample of examiners in Table 1 was not necessarily an unbiased sample, because of non-response. Based upon the mean marks awarded and the standard deviation of these marks, both samples of English examiners were representative of the examiners who marked English in the summer 2001 session.
A multi-level model was used to explore the scores awarded by the examiners at the unit and item level in relation to (1) the sex of the examiner and the examinees, (2) the sex-role orientation of the examiner and the sex of the examinees. It should be noted that all marks analysed in this article are the raw marks awarded by the examiners and not the final marks which have been subjected to quality control procedures.
An analysis of the masculine and feminine scores derived from the BSRI revealed one significant difference (Greatorex and Bell, 2002). The female History examiners had a lower average score on the femininity index than the female examiners in the other two subjects. This explains why the BSRI has classified so few History male and female examiners as feminine (Table 2).
Item-level data were analysed, using a multi-level model. There were two limitations with the data set. First, although the candidates were offered a choice between questions 3 and 4, almost all candidates attempted question 3 (and no one in the samole of item-level data attempted question 4). This simplified the modelling process. Second, in the sample of scripts in almost all cases all the scripts marked by a given examiner were from the same centre. This means that it was possible to fit only a two-level model: examiner/centre and candidate.
In Table 3 the results of four models are presented. The first model is the simplest, using no explanatory variables. On average, candidates obtained 11.3 marks on each question. For 95 per cent of centres/examiners the marks per question varied by plus or minus three marks. (Note that because of the allocation of examiners to centres and the allocation of scripts it is not possible to obtain separate variance component estimates for examiners and centres. However, centres vary in the quality of their candidates, so the size of the variance is not unexpected.) The individual candidates vary by plus or minus six marks after allowing for centre/examiner variation. In the second model parameters for the individual questions are fitted. The parameter for question 1 is set to 0 and for question 2 the parameter estimate indicates that it is 0.9 marks more difficult than question 1. Question 3 is 0.5 marks more difficult than question 1.
In model III the examiner effect was added and all parameter estimates are significant, except that there was no examiner sex difference. In model IV an interaction between candidates' biological sex and question 2 was added. The parameter estimate indicates that male candidates tended to obtain 0.5 of a mark less on question 2, 'Explain how each writer, through his content and use of language, engages the interest of the reader.' (The question paper was accompanied by two articles about tornadoes which the candidates were instructed to read.) Models using the masculinity and femininity scores were also fitted in place of the sex of the examiner, and there were no significant results.
A similar model process was carried out with the total mark for the question paper (the unit level). The results for English at the unit level are presented in Table 4. Examiner sex, masculinity and femininity were not significant. This is similar to the results of the item-level analysis but is based on a larger sample of 104 examiners. (The item-level analysis was based on sixty-eight examiners.) In this case, the status of the examiner was significant. Team Leaders tended to be relatively more generous than Assistant Examiners. A 95 per cent confidence interval for the examiner level variation is ±1.6 marks, and examiner variation accounts for 4 per cent of the total variations. Issues relating to sex, masculinity, femininity and seniority are explored in more detail in Greatorex and Bell (2002).
The same analysis was carried out for the History and Food Technologies GCSE marks at the unit level. The results are not presented, as there were no significant findings.
For Food Technologies, History and English further models were used to explore the interaction between (1) examiner's sex and candidate's sex, (2) examiner's sex-role orientation and candidate's sex, but they are not given here, as there was no interaction effects and there were no significant differences.
The results given above for the item and unit-level data were based upon an analysis of the raw marks gained by the candidates before quality procedures like scaling were followed. The results from the analysis should be interpreted with this caveat in mind.
Although significant sex differences were found, care is needed in interpreting them because of the effects of tiering. (The English paper was a higher-tier paper and the Food Technologies paper was a lower-tier paper.) This suggests that as a general rule the sex and sex-role orientation of examiners and interactions between candidate's sex and examiner sex do not affect the marks that candidates gain at the unit level. Moody (1999) argues that examining is a male-dominated culture. But the results of this study indicate that this has not resulted in bias against girls or boys in the marking. Given the literature and the quality procedures of awarding bodies, this is a positive but not unexpected finding.
However, there were two significant differences found in English:
1 There was a small sex bias on one item.
2 The status of the examiners was a significant factor: the more senior examiners (Team Leaders) were more generous than the Assistant Examiners.
The difference that was made to candidates' marks was very small, so the bias on this item is unlikely to have affected candidates' grades. It is possible that the differences in achievement are due to question 2 being more girl-friendly, as to answer it candidates might put themselves in the place of the author and/or different audiences. Question 2 was accompanied by two articles about tornadoes which the candidates were instructed to read. Dr Madsen Pirie, President of the Adam Smith Institute, claims that GCSE examinations questions are more girl-friendly than the questions in its predecessor, the O-level examinations (Pirie, 2001). The example given by Pirie is that a GCSE question may require empathy, e.g. 'How might you have felt as a child growing up in Nazi Germany?' But an O-level question demanding fact and understanding would have asked candidates to outline the main arguments presented in the 1689 Bill of Rights and the Act of Settlement of 1701 and the effect they may have had on Catholics. Given that different questions can be girl/boy-friendly and lead to inequality in achievement at the item level, such questions should, arguably, be avoided unless they are an integral part of the subject and removing them would reduce the validity of the examination (Wood, 1991). Additionally, it can be argued that boy/girl-friendly questions perpetuate gender stereotypes. On the other hand, Kiwan et al. (1999) found that candidates used schemas (familiar scenarios stored in memory) to answer and understand examination questions, especially in a stressed situation. So arguably showing males and females in non-stereotypical roles makes questions less accessible.
The finding that Team Leaders are more generous markers than the Assistant Examiners suggests that Team Leaders are more confident about giving candidates the benefit of the doubt. This finding might be explained in terms of the experience of the examiners, since, on average, Team Leaders are more experienced than Assistant Examiners. This accords with the research findings of Ruth and Murphy (1988) and Weigle (1998), cited earlier.
In conclusion, sex and gender bias in marking is something which should be monitored in GCSE marking but is unlikely to be found to an extent that affects grades. It appears that question papers should continue to be scrutinised for girl/boy-friendly questions, which should arguably be avoided. Differences in the severity and leniency of marking seem to be due to factors other than the examiner's sex and sex-role orientation and/or the candidates' sex. Indeed, the greatest source of variance was the candidates' achievement, which is as it should be.
For many years there has been concern about whether sex bias exists in various assessments. A literature review reveals little if any sex bias in UK national assessments at GCSE. Oxford, Cambridge and RSA examiners, for three case study subjects, completed the Bem Sex Role Inventory, a self-report inventory which measures the extent to which respondents are sex-typed. The responses were used to investigate (1) whether there was any sex bias in the marking, (2) the relation between the sex of the examiners and their marking. In the English examination one question was biased by half a mark in favour of girls, and it was found that the more senior the examiner the more generous the marking. These findings concur with results in other areas like testing English as a second language. It is concluded that sex bias should be monitored but is unlikely to be found.
Key words Examination, GCSE, Marking, Sex bias, Masculinity, Femininity.
1 Ordinary levels were generally taken by sixteen-year-olds in England before the GCSE was introduced as a replacement.
2 This information is available at www.unl.ac.uk/ipse/academicwriting.htm.
Ashmore, R. D. (1990), 'Sex, gender and the individual', in L. A. Pervin (ed.), Handbook of Personality: Theory and Research, New York: Guilford Press.
Auster, C. J., and Ohm, S. C. (2000), 'Masculinity and femininity in contemporary American society: a reevaluation using the Bern Sex Role Inventory', Sex Roles 43, 499-528.
Baird, J. (1996), 'What's in a name? Experiments with Blind Marking in A-level Examinations', paper presented at the British Psychological Society conference, London, 17-18 December.
Bem, S. (1974), 'The measurement of psychological androgyny', Journal of Consulting and Clinical Psychology 42, 155-62.
_____(1979), 'The theory and measurement of androgyny: a reply to Pedhazur-Tetenbaum and Locksley-Colten critiques', Journal of Personality and Social Psychology 37, 1047-54.
_____(1987), 'Masculinity and femininity only exist in the mind of the perceiver', in J. M. Reinisch, L. Rosenblum and S. Sanders (eds), Masculinity/Femininity: Basic Perspectives, Oxford: Oxford University Press.
Blanchard Fields, F., Suhrer-Roussel, L., and Hertzog, C. (1994), 'A confirmatory factor analysis of the Bem Sex Role Inventory: old questions, new answers', Sex Roles 30, 423-57.
Boaler, J. (1994), 'When do girls prefer football to fashion? An analysis of female underachievement in relation to "realistic" mathematics contexts', British Educational Research Journal 20, 551-64.
Bradley, C. (1984), 'Sex bias in the evaluation of students', British Journal of Social Psychology 23, 147-53.
Delia, J. (1972), 'Dialects and the effects of stereotyping on interpersonal attraction and cognitive processes in impression formation', Quarterly Journal of Speech 58, 285-97.
Dwyer, C. A. (1976), 'Test Content in Mathematics and Science: the Consideration of Sex', paper presented at the annual meeting of the American Educational Research Association, San Francisco, in R. J. L. Murphy (ed.), 'Sex differences in examination performance: do these reflect differences in ability or sex-role stereotyping?', Educational Review 30 (1978), 259-63.
Francis, B., Read, B., and Robson, J. (2002), 'First-class or Failure? Assessing the Assessment of Undergraduate Writing (with Particular Attention to Gender)', paper presented at the British Educational Research Association conference in Exeter, 12-14 September.
Francis, B., Read, B., Melling, L., and Robson, J. (2003), 'Lecturers' perceptions of gender and undergraduate writing style', British Journal of Sociology of Education 24, 357-73.
Gipps, C. V. (1994), Beyond Testing: Towards a Theory of Educational Assessment, London: Palmer Press.
Gipps, C., and Murphy, P. (1994), A Fair Test? Assessment, Achievement and Equity, Milton Keynes: Open University Press.
Graf, R. G., and Ridell, J. C. (1972), 'Sex differences in problem solving as a function of problem context', Journal of Educational Research 65, 451-2.
Greatorex, J., and Bell, J. F. (2002), 'What Makes a Senior Examiner?' Paper presented at the British Educational Research Association conference, Exeter, 12-14 September. Available at www.ucles-red.cam.ac.uk.
Green, S., Johnson, M., O'Donovan, N., and Sutton, P. (2003), 'Changes in Key Stage 2 Writing from 1995 to 2002', paper presented at the UK Reading Association conference, Cambridge, 11-13 July. Available at www.ucles-red.cam.ac.uk.
Hamp-Lyons, L. (1990), 'Second language writing; assessment issues', in B. Kroll (ed.), Second Language Writing: Research Insights for the Classroom, Norwood NJ: Ablex.
Kiwan, D., Pollitt, A., and Ahmed, A. (1999), paper presented at the British Psychological Society conference, London, 20-1 December. Available from www.uclesred.cam.ac.uk.
Lynn, R., and Mau, W-C. (2001), 'Gender differences on the Scholastic Aptitude Test, the American College Test and college grades', Educational Psychology 21 (2), 133-6.
Moody, J. (1999), 'Jobs for the Boys? An Investigation into the Under-representation of Women in Senior Examining Positions with OCR', unpublished M.Ed, dissertation, University of Bristol.
Morgan, D. H. J. (2002), Gender - Key Variables. Available from http://qb.soc. surrey.ac.uk/resources/keyvariables/morgan.htm.
Murphy, R. J. L. (1978), 'Sex differences in examination performance: do these reflect differences in ability or sex-role stereotyping?' Educational Review 30, 259-63.
Murray, B. (1976), 'Androgyny and Sex-role Stereotypes: Women's Health and Self-perceptions and Perceptions of Psychological Health in Others', doctoral dissertation, California School of Professional Psychology. Dissertation Abstracts International 37,1444B, University Microfilms No. 76-79,645.
Newstead, S. E., and Dennis, I. (1990), 'Blind marking and sex bias in student assessment', Assessment and Evaluation in Higher Education 15, 132-9.
Newton, P. E. (1996), The reliability of marking of GCSE scripts: Mathematics and English', British Educational Research Journal 22, 405-20.
Pirie, M. (2001), 'How exams are fixed in favour of girls', Spectator, 20 January.
Ruth, L., and Murphy, S. (1988), Designing Writing Tasks for the Assessment of Writing, Norwood NJ: Ablex.
Scottish Examining Board (1992), 'Investigation into the Effects of the Characteristics of Candidates and Presenting Centres on Possible Marker Bias', internal report, Edinburgh: Scottish Examination Board.
Spear, M. G. (1984), 'Sex bias in science teachers' ratings of work and pupil characteristics', European Journal of Science Education 6, 369-77.
Stobart, G., Elwood, J., and Quinlan, M. (1992), 'Gender bias in examinations: how equal are the opportunities?' British Educational Research Journal 18, 261-76.
Vann, R. J., Lorenz, F. O., and Meyer, D. M. (1991), 'Error gravity: faculty response to errors in the written discourse of nonnative speakers of English', in L. Hamp-Lyons (ed.), Assessing Second Language Writing in Academic Contexts, Norwood NJ: Ablex.
Weigle, S. (1998), 'Using FACETS to model rater training effects', Language Testing 15, 263-87.
West, C., and Zimmerman, D. H. (1987), 'Doing gender', Gender and Society 1, 125-51.
Wood, R. (1978), 'Sex differences in answers to English Language comprehension items', Educational Studies 4, 157-65.
_____(1991), Assessment and Testing, Cambridge: Cambridge University Press.
Jackie Greatorex and John Bell University of Cambridge Local Examinations Syndicate
This research is based on work undertaken by the University of Cambridge Local Examinations Syndicate for Oxford Cambridge and RSA Examinations (OCR). The opinions expressed in the article are those of the authors and are not to be taken as the opinions of the University of Cambridge Local Examinations Syndicate (UCLES) or any of its subsidiaries.
Address for correspondence
Jackie Greatorex, Research and Evaluation Division, University of Cambridge Local Examinations Syndicate, 1 Hills Road, Cambridge CB1 2EU. E-mail firstname.lastname@example.org…
Questia, a part of Gale, Cengage Learning. www.questia.com
Publication information: Article title: Does the Gender of Examiners Influence Their Marking?. Contributors: Greatorex, Jackie - Author, Bell, John - Author. Journal title: Research in Education. Issue: 71 Publication date: May 2004. Page number: 25+. © Manchester University Press Nov 1996. Provided by ProQuest LLC. All Rights Reserved.