Debates over AI benchmarks โ and how theyโre reported by AI labs โ are spilling out into public view.
This week, an OpenAI employee accused Elon Muskโs AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was in the right.
The truth lies somewhere in between.
In a post on xAIโs blog, the company published a graph showing Grok 3โs performance on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. Some experts have questioned AIMEโs validity as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a modelโs math ability.
xAIโs graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAIโs best-performing available model, o3-mini-high, on AIME 2025. But OpenAI employees on X were quick to point out that xAIโs graph didnโt include o3-mini-highโs AIME 2025 score at โcons@64.โ
What is cons@64, you might ask? Well, itโs short for โconsensus@64,โ and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. As you can imagine, cons@64 tends to boost modelsโ benchmark scores quite a bit, and omitting it from a graph might make it appear as though one model surpasses another when in reality, thatโs isnโt the case.
Grok 3 Reasoning Beta and Grok 3 mini Reasoningโs scores for AIME 2025 at โ@1โ โ meaning the first score the models got on the benchmark โ fall below o3-mini-highโs score. Grok 3 Reasoning Beta also trails ever-so-slightly behind OpenAIโs o1 model set to โmediumโ computing. Yet xAI is advertising Grok 3 as the โworldโs smartest AI.โ
Babushkin argued on X that OpenAI has published similarly misleading benchmark charts in the past โ albeit charts comparing the performance of its own models. A more neutral party in the debate put together a more โaccurateโ graph showing nearly every modelโs performance at cons@64:
Hilarious how some people see my plot as attack on OpenAI and others as attack on Grok while in reality itโs DeepSeek propaganda
(I actually believe Grok looks good there, and openAIโs TTC chicanery behind o3-mini-*high*-pass@โโโ1โณโโ deserves more scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUficโ Teortaxesโถ๏ธ (DeepSeek ๆจ็น๐้็ฒ 2023 โ โ) (@teortaxesTex) February 20, 2025
But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That just goes to show how little most AI benchmarks communicate about modelsโ limitations โ and their strengths.


