Anthropic used Pokémon to benchmark its latest AI mannequin. Sure, actually.
In a weblog publish printed Monday, Anthropic mentioned that it examined its newest mannequin, Claude 3.7 Sonnet, on the Sport Boy basic Pokémon Pink. The corporate geared up the mannequin with primary reminiscence, display screen pixel enter, and performance calls to press buttons and navigate across the display screen, permitting it to play Pokémon constantly.
A novel function of Claude 3.7 Sonnet is its potential to interact in “extended thinking.” Like OpenAI’s o3-mini and DeepSeek’s R1, Claude 3.7 Sonnet can “reason” via difficult issues by making use of extra computing — and taking extra time.
That got here in helpful in Pokémon Pink, apparently.
In comparison with a earlier model of Claude, Claude 3.0 Sonnet, which did not go away the home in Pallet City the place the story begins, Claude 3.7 Sonnet efficiently battled three Pokémon gymnasium leaders and gained their badges.
Now, it’s not clear how a lot computing was required for Claude 3.7 Sonnet to succeed in these milestones — and the way lengthy every took. Anthropic solely mentioned that the mannequin carried out 35,000 actions to succeed in the final gymnasium chief, Surge.
It certainly gained’t be lengthy earlier than some enterprising developer finds out.
Pokémon Pink is extra of a toy benchmark than something. Nevertheless, there is a protracted historical past of video games getting used for AI benchmarking functions. Previously few months alone, various new apps and platforms have cropped as much as take a look at fashions’ game-playing skills on titles starting from Road Fighter to Pictionary.