Each Sunday, NPR host Will Shortz, The New York Instances’ crossword puzzle guru, will get to quiz 1000’s of listeners in a long-running phase known as the Sunday Puzzle. Whereas written to be solvable with out too a lot foreknowledge, the brainteasers are often difficult even for expert contestants.
That’s why some specialists suppose they’re a promising method to check the boundaries of AI’s problem-solving talents.
In a current research, a staff of researchers hailing from Wellesley Faculty, Oberlin Faculty, the College of Texas at Austin, Northeastern College, Charles College, and startup Cursor created an AI benchmark utilizing riddles from Sunday Puzzle episodes. The staff says their check uncovered stunning insights, like that reasoning fashions — OpenAI’s o1, amongst others — typically “give up” and supply solutions they know aren’t right.
“We wanted to develop a benchmark with problems that humans can understand with only general knowledge,” Arjun Guha, a pc science school member at Northeastern and one of many co-authors on the research, informed TechCrunch.
The AI trade is in a little bit of a benchmarking quandary in the meanwhile. A lot of the assessments generally used to guage AI fashions probe for expertise, like competency on PhD-level math and science questions, that aren’t related to the common person. In the meantime, many benchmarks — even benchmarks launched comparatively just lately — are shortly approaching the saturation level.
The benefits of a public radio quiz sport just like the Sunday Puzzle is that it doesn’t check for esoteric data, and the challenges are phrased such that fashions can’t draw on “rote memory” to unravel them, defined Guha.
“I think what makes these problems hard is that it’s really difficult to make meaningful progress on a problem until you solve it — that’s when everything clicks together all at once,” Guha stated. “That requires a combination of insight and a process of elimination.”
No benchmark is ideal, in fact. The Sunday Puzzle is U.S. centric and English solely. And since the quizzes are publicly accessible, it’s attainable that fashions educated on them can “cheat” in a way, though Guha says he hasn’t seen proof of this.
“New questions are released every week, and we can expect the latest questions to be truly unseen,” he added. “We intend to keep the benchmark fresh and track how model performance changes over time.”
On the researchers’ benchmark, which consists of round 600 Sunday Puzzle riddles, reasoning fashions corresponding to o1 and DeepSeek’s R1 far outperform the remaining. Reasoning fashions completely fact-check themselves earlier than giving out outcomes, which helps them keep away from a few of the pitfalls that usually journey up AI fashions. The trade-off is that reasoning fashions take a little bit longer to reach at options — usually seconds to minutes longer.
A minimum of one mannequin, DeepSeek’s R1, offers options it is aware of to be incorrect for a few of the Sunday Puzzle questions. R1 will state verbatim “I give up,” adopted by an incorrect reply chosen seemingly at random — habits this human can definitely relate to.
The fashions make different weird decisions, like giving a incorrect reply solely to right away retract it, try to tease out a greater one, and fail once more. Additionally they get caught “thinking” perpetually and provides nonsensical explanations for solutions, or they arrive at an accurate reply instantly however then go on to contemplate different solutions for no apparent motive.
“On hard problems, R1 literally says that it’s getting ‘frustrated,’” Guha stated. “It was funny to see how a model emulates what a human might say. It remains to be seen how ‘frustration’ in reasoning can affect the quality of model results.”
The present best-performing mannequin on the benchmark is o1 with a rating of 59%, adopted by the just lately launched o3-mini set to excessive “reasoning effort” (47%). (R1 scored 35%.) As a subsequent step, the researchers plan to broaden their testing to extra reasoning fashions, which they hope will assist to establish areas the place these fashions could be enhanced.
![These researchers used NPR Sunday Puzzle inquiries to benchmark AI 'reasoning' fashions | TechCrunch 1 NPR benchmark](https://techcrunch.com/wp-content/uploads/2025/02/Screenshot-2025-02-06-at-12.31.38AM.png?w=680)
“You don’t need a PhD to be good at reasoning, so it should be possible to design reasoning benchmarks that don’t require PhD-level knowledge,” Guha stated. “A benchmark with broader access allows a wider set of researchers to comprehend and analyze the results, which may in turn lead to better solutions in the future. Furthermore, as state-of-the-art models are increasingly deployed in settings that affect everyone, we believe everyone should be able to intuit what these models are — and aren’t — capable of.”