
Last week, Meta released new versions of its large language model (LLM), introducing Llama 4 Scout, Llama 4 Maverick, and Llama 4 Behemoth as part of its advanced multimodal AI system.
Scout is designed to operate on a single Nvidia H100 GPU. It offers a context window of 10 million tokens. Maverick is larger than Scout and supposedly matches the performance of OpenAI's GPT-4o and DeepSeek-V3 in coding and reasoning tasks while utilizing fewer active parameters.
The largest of the three, Behemoth, boasts 288 billion active parameters and a total of 2 trillion parameters, with Meta claiming that it surpasses models like GPT-4.5 and Claude Sonnet 3.7 on STEM benchmarks.
Shortly after the release, rumors began to spread that Meta had trained Maverick and Llama 4 on test sets, causing them to rank higher in benchmarks. The rumor was reportedly started by a supposed Meta whistleblower on a Chinese website who resigned after making the following post (translated):
After repeated training, the performance of the internal model still fails to reach open-source SOTA levels, and is even far behind them. Company leadership suggested mixing various benchmark test sets into the post-training process, aiming to produce a result that “looks okay” across multiple metrics. If the set deadline at the end of April isn’t met, they may stop further investment.
After Llama 4 was released yesterday, there were already many poor real-world performance results shared on X (Twitter) and Reddit. As someone who’s currently also active in academia, I find this practice unacceptable.
Therefore, I’ve submitted my resignation request and have explicitly asked that my name not be included in Llama 4’s Technical Report. I’ve also heard that Meta’s VP of AI resigned for the same reason.
This rumor quickly spread to X and Reddit, now prompting a response from Ahmad Al-Dahle, VP of generative AI at Meta, who denied the allegations, stating that they were "simply not true" and Meta "would never do that."
The rumor sounded believable, in no small part due to the multiple reports on X of different behaviors between the version of Maverick publicly available to developers and the version Meta showcased on LMArena.
The Llama 4 model that won in LM Arena is different than the released version. I have been comparing the answers from Arena to the released model. They aren't close.
— Ethan Mollick (@emollick) April 8, 2025
The data is worth a look also as it shows how LM Arena results can be manipulated to be more pleasing to humans. https://t.co/7yCd3CiJ42 pic.twitter.com/A6Yirn04g7
Also, Meta itself acknowledged that the Maverick hosted on LMArena was an "experimental chat version":
Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena.
Al-Dahle provided an explanation for the "mixed quality" that has been reported across different services, stating that since the models dropped as soon as they were ready, it will take several days for all the public implementations to get "dialed in."
1 Comment - Add comment