A New AI Evaluation Cosmos: Ready to Play the Game?

By Hernandez-Orallo, Jose; Baroni, Marco et al. | AI Magazine, Fall 2017 | Go to article overview

A New AI Evaluation Cosmos: Ready to Play the Game?


Hernandez-Orallo, Jose, Baroni, Marco, Bieger, Jordi, Chmait, Nader, Dowe, David L., Hofmann, Katja, Martinez-Plumed, Fernando, Strannegard, Claes, Thorissons, Kristinn R., AI Magazine


We report on a series of new platforms and events dealing with AI evaluation that may change the way in which AI systems are compared and their progress is measured. The introduction of a more diverse and challenging set of tasks in these platforms can feed AI research in the years to come, shaping the notion of success and the directions of the field. However, the playground of tasks and challenges presented there may misdirect the field without some meaningful structure and systematic guidelines for its organization and use. Anticipating this issue, we also report on several initiatives and workshops that are putting the focus on analyzing the similarity and dependencies between tasks, their difficulty, what capabilities they really measure and--ultimately--on elaborating new concepts and tools that can arrange tasks and benchmarks into a meaningful taxonomy.

**********

Through the integration of more and better techniques, more computing power, and the use of more diverse and massive sources of data, AI systems are becoming more flexible and adaptable, but also more complex and unpredictable. There is thus increasing need for a better assessment of their capacities and limitations, as well as concerns about their safety (Amodei et al. 2016). Theoretical approaches might provide important insights, but only through experimentation and evaluation tools will we achieve a more accurate assessment of how an actual system operates over a series of tasks or environments.

Several AI experimentation and evaluation platforms have recently appeared, setting a new cosmos of AI environments. These facilitate the creation of various tasks for evaluating and training a host of algorithms. The platform interfaces usually follow the reinforcement learning (RL) paradigm, where interaction takes place through incremental observations, actions, and rewards. This is a very general setting and seemingly every possible task can be framed under it.

These platforms are different from the Turing test--and other more traditional AI evaluation benchmarks proposed to replace it--as summarized by an AAAI 2015 workshop (1) and a recent special issue of the AI Magazine. (2) Actually, some of these platforms can integrate any task and hence in principle they supersede many existing AI benchmarks (Hernandez-Orallo 2016) in their aim to test general problem-solving ability.

This topic has also attracted mainstream attention. For instance, the journal Nature recently featured a news article on the topic (Castelvecchi 2016). In summary, a new and uncharted territory for AI is emerging, which deserves more attention and effort within AI research itself.

In this report, we first give a short overview of the new platforms, and briefly report about two 2016 events focusing on (general-purpose) AI evaluation (using these platforms or others).

New Playground, New Benchmarks

Many different general-purpose benchmarks and platforms have recently been introduced, and they are increasingly adopted in research and competitions to drive and evaluate AI progress.

The Arcade Learning Environment (3) is a platform for developing and evaluating general AI agents using a variety of Atari 2600 games. The platform is used to compare, among others, approaches such as RL (see, for example, Mnih et al [2015]), model learning, model-based planning, imitation learning, and transfer learning. A limitation of this environment is the reduced number of games, leading to overspecialization. The video game definition language (VGDL) (4) follows a similar philosophy, but new two-dimensional (2D) arcade games can be generated using a flexible set of rules.

OpenAI Gym (5) (Brockman et al. 2016) provides a diverse collection of RL tasks and an open-source interface for agents to interact with them, as well as tools and a curated web service for monitoring and comparing RL algorithms. The environments, formalized as partially observable Markov decision processes, range from classic control and toy text to algorithmic problems, 2D and three-dimensional (3D) robots, as well as Doom, board, and Atari games. …

The rest of this article is only available to active members of Questia

Already a member? Log in now.

Notes for this article

Add a new note
If you are trying to select text to create highlights or citations, remember that you must now click or tap on the first word, and then click or tap on the last word.
One moment ...
Default project is now your active project.
Project items
Notes
Cite this article

Cited article

Style
Citations are available only to our active members.
Buy instant access to cite pages or passages in MLA 8, MLA 7, APA and Chicago citation styles.

(Einhorn, 1992, p. 25)

(Einhorn 25)

(Einhorn 25)

1. Lois J. Einhorn, Abraham Lincoln, the Orator: Penetrating the Lincoln Legend (Westport, CT: Greenwood Press, 1992), 25, http://www.questia.com/read/27419298.

Note: primary sources have slightly different requirements for citation. Please see these guidelines for more information.

Cited article

A New AI Evaluation Cosmos: Ready to Play the Game?
Settings

Settings

Typeface
Text size Smaller Larger Reset View mode
Search within

Search within this article

Look up

Look up a word

  • Dictionary
  • Thesaurus
Please submit a word or phrase above.
Print this page

Print this page

Why can't I print more than one page at a time?

Help
Full screen
Items saved from this article
  • Highlights & Notes
  • Citations
Some of your highlights are legacy items.

Highlights saved before July 30, 2012 will not be displayed on their respective source pages.

You can easily re-create the highlights by opening the book page or article, selecting the text, and clicking “Highlight.”

matching results for page

    Questia reader help

    How to highlight and cite specific passages

    1. Click or tap the first word you want to select.
    2. Click or tap the last word you want to select, and you’ll see everything in between get selected.
    3. You’ll then get a menu of options like creating a highlight or a citation from that passage of text.

    OK, got it!

    Cited passage

    Style
    Citations are available only to our active members.
    Buy instant access to cite pages or passages in MLA 8, MLA 7, APA and Chicago citation styles.

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn, 1992, p. 25).

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn 25)

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn 25)

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences."1

    1. Lois J. Einhorn, Abraham Lincoln, the Orator: Penetrating the Lincoln Legend (Westport, CT: Greenwood Press, 1992), 25, http://www.questia.com/read/27419298.

    Cited passage

    Thanks for trying Questia!

    Please continue trying out our research tools, but please note, full functionality is available only to our active members.

    Your work will be lost once you leave this Web page.

    Buy instant access to save your work.

    Already a member? Log in now.

    Search by... Author
    Show... All Results Primary Sources Peer-reviewed

    Oops!

    An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.