Magazine article AI Magazine

Evaluation Methods for Machine Learning

Magazine article AI Magazine

Evaluation Methods for Machine Learning

Article excerpt

This year's workshop continued the discussion engaged at last year's AAAI workshop on the same topic, which had concluded that there are serious drawbacks in the way in which we evaluate learning algorithms. In particular, the participants of last year's workshop had agreed on two points. First, that our evaluation practices are too narrow: other properties of algorithms (for example, interpretability, performance under changing conditions, evaluation of the transfer mechanism, and so on) should be tested in addition to their accuracy, which, itself, should be considered more flexibly; and second, that the University of California, Irvine (UCI) datasets do not reflect the variety of problems to which learning algorithms are applied in practice. The invited talks at this year's workshop were designed to address these two issues.

Regarding the first issue, Yu-Han Chang gave an interesting talk that considered evaluation metrics designed to measure the performance of transfer learning algorithms. Rich Caruana presented his investigation of various evaluation metrics, showing that what matters more than the choice of a metric is the way in which that metric is being used. He showed that, in fact, without proper calibration the results we obtain from certain classes of metrics could be meaningless.

To address the second issue, we invited two researchers that do not develop new machine-learning algorithms but rather specialize in their application to specific areas. In particular, George Tzanetakis presented his research in audio processing and music information retrieval while Andre Kushniruk shared the lessons he learned about evaluation from his study of emerging health-care information systems. Both agreed that the issues encountered in their respective real-life domains go well beyond those considered by the UCI domains and that our evaluation practices fall short of what is really necessary in their respective fields. For example, Kushniruk found that no performance measures were available for the kind of testing he was interested in (for example, a measure of low-cost portable usability), while Tzanetakis deplored the fact that we do not "listen to our data" enough and explained how commonly used procedures such as cross-validation could not be applied to his field without serious prior thought as to what exactly they compute and what the consequential results really mean. In additions to these talks, nine papers were presented. A number of these, like Rich Caruana's invited talk, took a high-level view of performance metrics, pitting them against one another. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.