Recently, Meta released Llama 4, a new family of large language models consisting of Scout, Maverick, and Behemoth. From the benchmark results, Llama 4 Maverick (Llama-4-Maverick-03-26-Experimental) came 2nd, beating models like OpenAI's GPT-4o and Google's Gemini 2.0 Flash, and trailing only behind Gemini 2.5 Pro.
But pretty soon, the cracks began to form as users noticed differences in behavior between the Maverick used in benchmarks and the one available to the public. This led to accusations that Meta was cheating, prompting a response from a Meta executive on X:
We're glad to start getting Llama 4 in all your hands. We're already hearing lots of great results people are getting with these models.
— Ahmad Al-Dahle (@Ahmad_Al_Dahle) April 7, 2025
That said, we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were…
LMArena acknowledged that Meta failed to abide by its policies, apologized to the public, and issued a policy update.
We've seen questions from the community about the latest release of Llama-4 on Arena. To ensure full transparency, we're releasing 2,000+ head-to-head battle results for public review. This includes user prompts, model responses, and user preferences. (link in next tweet)
— lmarena.ai (formerly lmsys.org) (@lmarena_ai) April 8, 2025
Early…
Now, the unmodified release version of the model (Llama-4-Maverick-17B-128E-Instruct) has been added to LMArena, and it ranks 32nd. For the record, older models like Claude 3.5 Sonnet, released last June, and Gemini-1.5-Pro-002, released last September, rank higher.

In a statement to TechCrunch, a Meta spokesperson mentioned that the Llama-4-Maverick-03-26-Experimental was specially tuned for chat and did pretty well on LMArena benchmarks, adding that the company is "excited" to see what developers will build now that an open source version of Llama 4 has been released.
7 Comments - Add comment