The main purpose of this study was to develop a Rasch Measurement Physical Fitness Scale (RMPFS) based on physical fitness indicators routinely used in Hong Kong primary schools. A total of 9,439 records of students' performances on physical fitness indicators, retrieved from the database of a Hong Kong primary school, were used to develop the Rasch scale. Following a series of iterative Rasch analyses that adopted the "data should fit the model" approach, four physical fitness indicators (i.e., 6-min run, 9-min run, l-min sit-ups, and dominant handgrip) were successfully calibrated to form the RMPFS. The RMPFS and its scale indicators showed fit to the Rasch model sufficient for the intended purposes of measuring the overall fitness of children. The overall physical fitness measure reflects children's fitness on three key core components of physical fitness (i.e., cardio-respiratory fitness, muscular endurance, and muscular strength). Advantages of the RMPFS are discussed, and recommendations for future research follow. The findings of this study provide a better knowledge basis for interpreting children's physical fitness assessment results.
Key words: Rasch measurement, physical fitness, primary school, data should fit the model
Given the important role that physical fitness should play in children's lives, fitness assessment/testing is intuitively a crucial part of physical education, which aims to promote a healthy and physically active lifestyle. However, fitness testing in schools has being criticized over decades, and even its necessity for children has been seriously questioned (Liu, 2008). The special issue on youth fitness testing published in Measurement in Physical Education and Exercise Science (MPEES) in 2008 thoroughly discussed different perspectives on youth fitness testing. For example, a pedagogical perspective argued that fitness tests should be implemented as formative evaluation. Then fitness testing results should be informative for teaching and learning in physical education (Silverman, Keating, & Phillips, 2008). In terms of promoting physical activity, fitness assessments are expected to provide accurate measures, carrying important information about children's health-related fitness levels. Therefore, they could optimize the effectiveness of physical education (Welk, 2008). Moreover, there is no doubt that the use and interpretation of fitness assessment have important educational, pedagogical, and psychological consequences (Mahar & Rowe, 2008). In summary, the editors and authors of the MPEES special issue agreed that youth fitness testing can serve a useful purpose in school settings if used in the correct way. This article aims to extend this "correct way" discussion by shedding some light on how to achieve objective physical fitness measurement based on fitness testing scores.
Accurate measures of youth fitness are needed by both researchers and educators, regardless of their purposes (Mahar & Rowe, 2008). The routine practice in traditional approaches is that different components of physical fitness (e.g., body composition, cardio-respiratory fitness, flexibility, muscular endurance, and muscular strength) are assessed using different indicators, and children's abilities in each of these components are reported and interpreted using raw scores (in meters, kilograms, seconds, etc.) or percentile ranks. However, raw scores might not provide a valid measure, because they have little inferential value (Wright, 1997; Wright & Mok, 2000). The validity of raw scores in representing fitness levels in this approach is based on an unquestioned assumption; namely, the raw scores are accepted implicitly as being equal interval. Unfortunately, the raw scores themselves (unless used to derive further criterion measures, e.g., estimated V[O.sub.2]max based on scores in the 6/9-min. run test) actually indicate only the ordering of the children's performances but have little inferential value about the size of the differences among scores in terms of "fitness." While meters indicate equal amounts of difference on the length or distance scales, it is an act of faith to conclude that meters indicate equal difference on the cardio-respiratory fitness scale. Meters have only ordinal meaning when they are used as the score units in the 6-min run test; therefore, they might not yield valid measures of the underlying fitness component.
Another deficiency associated with the traditional approaches to physical fitness assessment is that the interpretation of results of physical fitness assessment in norm-referenced framework is often not accurate or comparable because of the sample dependence and indicator dependence of assessments, where ranks or percentiles are provided in interpreting students' performances on physical fitness indicators. Those ranks or percentiles provide only an inexact basis for comparison among students and, rather, should be regarded as indicators of students' relative strengths and weaknesses (Williams, Harageones, Johnson, & Smith, 2000). However, use of raw numbers/counts and the allocation of norm-referenced ratings do not allow for the direct assessment of children's fitness against some objective fitness standard in which the measurement and interpretation of students' fitness levels is independent of sample and indicator.
Furthermore, it is time consuming to use the traditional approach to administering all fitness tests a whole class with 40 or more students. Since a single total score might not provide a meaningful summary of different fitness indicators, multi-faceted profiles that contain scores for each component of physical fitness are often regarded as more appropriate (Marsh, 1993). A consequent by-product is that assessment tasks in the physical education curriculum increase teachers' workloads and occupy resources that could be put into teaching. There is little doubt that physical fitness is a multi-faceted concept, but the extent to which any set of multi-dimensional indices used in traditional approaches should disqualify a uni-dimensional fitness index still remains open for discussion, as well as evidence-based empirical investigation. The question addressed in this article is to what extent is it possible to generate a uni-dimensional index of physical fitness, which provides interval scale fitness measures for children, independent of sample and indicator, for estimating differences between groups of children and for tracking changes in fitness levels over time.
The Rasch model (Andrich, 1988; Rasch, 1960) provides ways to address the deficiencies inherent in traditional approaches to physical fitness assessment. First, Rasch analysis can transform non-linear raw scores into logit scale measures that have constant interval meaning and provide objective and linear measurement from ordered category responses (Linacre, 2000, 2006a, p. 12). Second, the feature of "parameter separation" or "invariance of parameters" (Bond & Fox, 2007, p. 71; Wright & Masters, 1982, p. 34) of the Rasch model implies that the calibration of fitness indicators is sample distribution free and the calibration of persons is indicator distribution free along the fitness continuum. The sample-distribution-free calibration of fitness indicators means that the difficulty estimates of indicators (e.g., 6-min run, 1-min sit-ups, etc.) should be invariant, within measurement error, no matter which sample is used to calibrate those indicators. The indicator-distribution free calibration of persons means that the fitness estimate of any person should remain invariant, within measurement error, no matter which particular fitness indicators are used to measure that person's fitness. Therefore, direct person-person, item-item, and person-item comparisons can be conducted easily, based on their locations on the common logit scale. Finally, an overall fitness measure can be provided for a student, even if he/she had not performed on all of the physical fitness indicators that have been calibrated onto the fitness trait continuum.
Unlike more general multi-dimensional or Item Response Theory (IRT) models and other (true score) statistical techniques that adopt a "the model fits the data" approach, manipulating the different parameters to accommodate the idiosyncrasies of any dataset, the Rasch model requires that "data fit the model" (Andrich, 2004) for the purpose of achieving objective measurement. This is one of the key differences between Rasch-based studies and other quantitative studies in the human sciences. The Rasch model is held as being able to solve the basic measurement problem common to all social sciences (Andersen, 1995), and it has been applied in sport sciences and physical education studies by a growing number of researchers whose reviews provide more detail (e.g., Strauss, Busch, & Tenenbaum, 2007; Tenenbaum, Strauss, & Busch, 2007). For example, Rasch analysis has been utilized to calibrate physical function or competence (Zhu & Kurz, 1994), perception of sports games (Kang & Kang, 2006), and difficulty levels of physical fitness indicators (Zhu & Safrit, 1993). Studies have applied the Rasch model to develop or evaluate instruments used in exercise studies. Hands and Larkin (2001) studied children's performance on different motor tasks and developed two separate uni-dimensional Rasch scales of motor abilities for boys and girls, respectively. Zhu, Timm, and Ainsworth (2001) modified an exercise barriers instrument and validated it using the Rasch model framework. Heesch, Masse, and Dunn (2006) used Rasch analysis to re-evaluate three commonly used scales, including the Physical Activity Enjoyment Scale, the Benefits of Physical Activity Scale, and the Barriers to Physical Activity Scale. Busch et al. (2009) used a mixed Rasch model to investigate the construct validity of the German general motor fitness and coordination test for children. They found that two qualitatively different …