A test is considered biased when basing assumptions and decisions on the result could unfairly advantage particular groups of people as the score captures unintended demographic constructs such as gender, socio-economic status and ethnicity. Hence, in narrower terms a test is defined as biased when its score reflects some additional socio-demographic variables on top of the intended construct. For example, individuals of a particular ethnic group are likely to systematically underperform on a reading test not because they are less able to read but because they are less familiar with the context of the items included in the test. This test is considered biased as apart from the student's reading ability it also measures factors related to ethnicity.
Against the backdrop of a growing need for multilanguage variants of achievement and personality tests, Fons J. R. van de Vijver and Ype H. Poortinga list three types of bias in relation to test adaptation. These include construct bias, method bias and item bias which are explained in Adapting Educational and Psychological Tests for Cross-Cultural Assessment (2005). Construct bias refers to the fact that constructs may differ across cultural groups. For instance, intelligence in the majority of intelligence tests appear to be defined as comprising reasoning and logical thinking plus some degree of acquired information and memory. Research in non-Western societies, however, has shown that notions of intelligence are wider and also involve social aspects.
Method bias has two subtypes: instrument bias and administration bias. In mental testing a well-known cause for instrument bias is how familiar subjects are with stimuli and responses. The authors cited a study examining the perceptual skills of children in the U.K. and Zambia. The task was to reproduce figures by means of paper and pencil, iron-wire modeling, which is common in Zambia, plasticine and hand positions. As expected, the U.K. students did better in paper-and-pencil representations and the Zambian ones in iron-wire modeling. Administration bias arises from communication difficulties between tester and testee. Item bias, or differential item functioning, most importantly stems from weak translations and difference in word connotations. A key feature in the definitions of item bias is that it is described as conditional on the level of a trait or ability. For example, American students were likely to score higher on the question "Who is the president of the United States?" than Dutch children.
The Journal of the American Enterprise Institute (July/August 2007) reported on a debate in the United States about test bias in relation to SAT assessments, the standardized tests for college admissions. Some argued that the tests were biased in favor of privileged children and described it as a "wealth test." The SAT tests were originally devised as a way of picking out talented students, regardless of race, faith or wealth. It was hoped that poorer students from inner city neighborhoods would benefit from the assessments. However, many were disappointed and criticized the tests for being a "negative force." The Huffington Post (August 17, 2011), discussed concerns that SAT assessments had a racial bias and highlighted a study from Harvard which suggested that questions favored white students because they used language more familiar to these students than non-white groups. This claim was rejected by Laurence Bunin, Senior Vice President of the SAT program at the College Board, who said: "The test is a fair test that helps mirror what's going on in this country."
Joy Matthews-López, an expert in the field of educational testing, explains that with the best intentions, bias can be still found in test materials. She cites the example of sports-related questions in the assessment of math skills and argues that males taking the test receive fairer treatment because they often have skills necessary to answer such questions, while females may find it more difficult and therefore the question could be seen as biased. Matthews-López also examines the statistical procedure known as DIF - Differential Item Function, a method used to screen standardized tests for violations of fairness. In particular, she asked if a question might be unfair to a particular section of the population based on their race or gender. Matthews-López has called for more research to be carried out in this field in addition to her important research. Previously she worked as a measurement statistician for the Educational Testing Service.