 Academic journal article Psychological Test and Assessment Modeling

# Application of Evolutionary Algorithm-Based Symbolic Regression to Language Assessment: Toward Nonlinear Modeling

Academic journal article Psychological Test and Assessment Modeling

# Application of Evolutionary Algorithm-Based Symbolic Regression to Language Assessment: Toward Nonlinear Modeling

## Article excerpt

(ProQuest: ... denotes formulae omitted.)

Linear regression is one family of linear mathematical functions that has been widely used in predictive modelling and achieved some degree of success in language assessment. Linear regression seeks to create mathematical solutions by which to predict output values from input values. A simple linear regression model can be mathematically expressed as follows:

γ = α + β χ 1 (1)

, where

γ = output value or dependent variable,

χ = input value or independent variable;

β = slope, and

α = intercept.

The χ value is chosen to predict γ with as high accuracy as possible. In practice, however, some data points often fall far from the linear regression line. These data points are known as "outliers," and the presence of multiple outliers can affect the linearity of data and consequently worsen the model's fit and predictive power (Keith, 2006). As such, outliers are generally pruned in expert-informed predictive models (Hair, Black, Babin, & Anderson, 2010); otherwise, the yielded equation usually provides a relatively imprecise profile of the relationship between input and output data. This process destroys (valid) data, and although nonlinear data distant from the linear regression line may appear to be chaotic, random, or "useless," in reality it reflects the influence of networks of interrelated variables likely with meaningful interactions, which remain unexplored by linear models (Alamir, 1999).

Furthermore, destroying outlying data is typically not enough to render linear regression models highly accurate. A quick survey of the available literature shows that the average precision of regression models, as indicated by their correlation coefficients, is approximately 0.40, suggesting an inherently nonlinear (or less linear) relationship between the variables examined. Linear models may be able to predict part of the data near the regression line, but will estimate the large proportion of data lying distant from the line with significant imprecision (Keith, 2006).

As a final issue, most studies applying linear regression do not attempt to test their postulated models with new data sets to examine whether their findings can be replicated (Keith, 2006). While this is not an intrinsic problem of regression modelling, lack of validation samples can question the credibility of the models yielded in linear regression analysis.

This set of methodological problems suggests that many of the conclusions drawn from linear regression studies in language assessment research may be oversimplified and imprecise. Rather than defining imprecise linear models or omitting data points that cause "error," researchers can use a flexible data analysis technique to pinpoint the structure of both the linear and nonlinear elements of the data and test it across an unseen sample (Koza, 2010). Evolutionary algorithm-based (EA-based) symbolic regression, also called symbolic function identification, seeks to identify influential independent variables by discovering the mathematical functions that fit the data (Fogel & Corne, 1998). Symbolic regression builds models using the symbolic functional operators selected by the researcher, and then by applying a genetic programing algorithm which results in the selection of input variables and a final set of models (Koza, 2010). To choose the best model from this set, the researcher can use a number of fit and importance statistics (Schmidt & Lipson, 2009).

To demonstrate the application of EA-based symbolic regression, this study uses data from a listening test, a vocabulary test, a grammar test, and the metacognitive awareness listening questionnaire (MALQ) (Vandergrift & Goh, 2012). Although both lexicogrammatical knowledge and metacognitive strategies have been posited to predict performance in listening comprehension tests (e.g., Buck, 2001; Goh, 2000; Vandergrift & Goh, 2012), neither set of variables has received enough empirical examination. …

Search by...
Show...

### Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.