Sunday, June 28, 2026
HomeTechnologyDid xAI lie about Grok 3's benchmarks?

Did xAI lie about Grok 3’s benchmarks?

Debates over AI benchmarks โ€” and how theyโ€™re reported by AI labs โ€” are spilling out into public view.

This week, an OpenAI employee accused Elon Muskโ€™s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was in the right.

The truth lies somewhere in between.

In a post on xAIโ€™s blog, the company published a graph showing Grok 3โ€™s performance on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. Some experts have questioned AIMEโ€™s validity as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a modelโ€™s math ability.

xAIโ€™s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAIโ€™s best-performing available model, o3-mini-high, on AIME 2025. But OpenAI employees on X were quick to point out that xAIโ€™s graph didnโ€™t include o3-mini-highโ€™s AIME 2025 score at โ€œcons@64.โ€

What is cons@64, you might ask? Well, itโ€™s short for โ€œconsensus@64,โ€ and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. As you can imagine, cons@64 tends to boost modelsโ€™ benchmark scores quite a bit, and omitting it from a graph might make it appear as though one model surpasses another when in reality, thatโ€™s isnโ€™t the case.

Grok 3 Reasoning Beta and Grok 3 mini Reasoningโ€™s scores for AIME 2025 at โ€œ@1โ€ โ€” meaning the first score the models got on the benchmark โ€” fall below o3-mini-highโ€™s score. Grok 3 Reasoning Beta also trails ever-so-slightly behind OpenAIโ€™s o1 model set to โ€œmediumโ€ computing. Yet xAI is advertising Grok 3 as the โ€œworldโ€™s smartest AI.โ€

Babushkin argued on X that OpenAI has published similarly misleading benchmark charts in the past โ€” albeit charts comparing the performance of its own models. A more neutral party in the debate put together a more โ€œaccurateโ€ graph showing nearly every modelโ€™s performance at cons@64:

But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That just goes to show how little most AI benchmarks communicate about modelsโ€™ limitations โ€” and their strengths.

Source link










RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments

Translate ยป