Monday, June 29, 2026
HomeTechnologyPeople are using Super Mario to benchmark AI now

People are using Super Mario to benchmark AI now


Thought Pokรฉmon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher.

Hao AI Lab, a research org at the University of California San Diego, on Friday threw AI into live Super Mario Bros. games. Anthropicโ€™s Claude 3.7 performed the best, followed by Claude 3.5. Googleโ€™s Gemini 1.5 Pro and OpenAIโ€™s GPT-4o struggled.

It wasnโ€™t quite the same version of Super Mario Bros. as the original 1985 release, to be clear. The game ran in an emulator and integrated with a framework, GamingAgent, to give the AIs control over Mario.

Image Credits:Hao Lab

GamingAgent, which Hao developed in-house, fed the AI basic instructions, like, โ€œIf an obstacle or enemy is near, move/jump left to dodgeโ€ and in-game screenshots. The AI then generated inputs in the form of Python code to control Mario.

Still, Hao says that the game forced each model to โ€œlearnโ€ to plan complex maneuvers and develop gameplay strategies. Interestingly, the lab found that reasoning models like OpenAIโ€™s o1, which โ€œthinkโ€ through problems step by step to arrive at solutions, performed worse than โ€œnon-reasoningโ€ models, despite being generally stronger on most benchmarks.

One of the main reasons reasoning models have trouble playing real-time games like this is that they take a while โ€” seconds, usually โ€” to decide on actions, according to the researchers. In Super Mario Bros., timing is everything. A second can mean the difference between a jump safely cleared and a plummet to your death.

Games have been used to benchmark AI for decades. But some experts have questioned the wisdom of drawing connections between AIโ€™s gaming skills and technological advancement. Unlike the real world, games tend to be abstract and relatively simple, and they provide a theoretically infinite amount of data to train AI.

The recent flashy gaming benchmarks point to what Andrej Karpathy, a research scientist and founding member at OpenAI, called an โ€œevaluation crisis.โ€

โ€œI donโ€™t really know what [AI] metrics to look at right now,โ€ he wrote in a post on X. โ€œTLDR my reaction is I donโ€™t really know how good these models are right now.โ€

At least we can watch AI play Mario.



Source link

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments

Translate ยป