A Meta exec on Monday denied a rumor that the company trained its new AI models to present well on specific benchmarks while concealing the modelsโ weaknesses.
The executive, Ahmad Al-Dahle, VP of generative AI at Meta, said in a post on X that itโs โsimply not trueโ that Meta trained its Llama 4 Maverick and Llama 4 Scout models on โtest sets.โ In AI benchmarks, test sets are collections of data used to evaluate the performance of a model after itโs been trained. Training on a test set could misleadingly inflate a modelโs benchmark scores, making the model appear more capable than it actually is.
Over the weekend, an unsubstantiated rumor that Meta artificially boosted its new modelsโ benchmark results began circulating on X and Reddit. The rumor appears to have originated from a post on a Chinese social media site from a user claiming to have resigned from Meta in protest over the companyโs benchmarking practices.
Reports that Maverick and Scout perform poorly on certain tasks fueled the rumor, as did Metaโs decision to use an experimental, unreleased version of Maverick to achieve better scores on the benchmark LM Arena. Researchers on X haveย observed starkย differences in the behaviorย of the publicly downloadable Maverick compared with the model hosted on LM Arena.ย
Al-Dahle acknowledged that some users are seeing โmixed qualityโ from Maverick and Scout across the different cloud providers hosting the models.
โSince we dropped the models as soon as they were ready, we expect itโll take several days for all the public implementations to get dialed in,โ Al-Dahle said. โWeโll keep working through our bug fixes and onboarding partners.โ


