When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

Meta pushes back on Llama 4 benchmark cheating allegations

Meta AI
Image via ​​​​​Depositphotos.com

Last week, Meta released new versions of its large language model (LLM), introducing Llama 4 Scout, Llama 4 Maverick, and Llama 4 Behemoth as part of its advanced multimodal AI system.

Scout is designed to operate on a single Nvidia H100 GPU. It offers a context window of 10 million tokens. Maverick is larger than Scout and supposedly matches the performance of OpenAI's GPT-4o and DeepSeek-V3 in coding and reasoning tasks while utilizing fewer active parameters.

The largest of the three, Behemoth, boasts 288 billion active parameters and a total of 2 trillion parameters, with Meta claiming that it surpasses models like GPT-4.5 and Claude Sonnet 3.7 on STEM benchmarks.

Shortly after the release, rumors began to spread that Meta had trained Maverick and Llama 4 on test sets, causing them to rank higher in benchmarks. The rumor was reportedly started by a supposed Meta whistleblower on a Chinese website who resigned after making the following post (translated):

After repeated training, the performance of the internal model still fails to reach open-source SOTA levels, and is even far behind them. Company leadership suggested mixing various benchmark test sets into the post-training process, aiming to produce a result that “looks okay” across multiple metrics. If the set deadline at the end of April isn’t met, they may stop further investment.

After Llama 4 was released yesterday, there were already many poor real-world performance results shared on X (Twitter) and Reddit. As someone who’s currently also active in academia, I find this practice unacceptable.

Therefore, I’ve submitted my resignation request and have explicitly asked that my name not be included in Llama 4’s Technical Report. I’ve also heard that Meta’s VP of AI resigned for the same reason.

This rumor quickly spread to X and Reddit, now prompting a response from Ahmad Al-Dahle, VP of generative AI at Meta, who denied the allegations, stating that they were "simply not true" and Meta "would never do that."

The rumor sounded believable, in no small part due to the multiple reports on X of different behaviors between the version of Maverick publicly available to developers and the version Meta showcased on LMArena.

Also, Meta itself acknowledged that the Maverick hosted on LMArena was an "experimental chat version":

Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena.

Al-Dahle provided an explanation for the "mixed quality" that has been reported across different services, stating that since the models dropped as soon as they were ready, it will take several days for all the public implementations to get "dialed in."

Report a problem with article
Framework Laptop 13
Next Article

Thanks to the Trump tariffs, Framework has paused US sales of some laptops

A Windows 11 Insider Preview banner
Previous Article

Windows 11 build 22635.5170 is out in Beta Channel with File Explorer improvements and more

Join the conversation!

Login or Sign Up to read and post a comment.

1 Comment - Add comment