Move over DeepSeek: Alibaba's Qwen2.5-Max surpasses DeepSeek-V3 in benchmarks

The news headlines for the last week have been dominated by DeepSeek thanks to the launch of its new reasoning model, R1, which improves responses to queries. DeepSeek's main non-reasoning model, DeepSeek-V3 arrived in December with impressive benchmark scores of its own, but now, Chinese firm Alibaba has released Qwen2.5-Max which surpasses DeepSeek-V3, and in some tests GPT-4o-0806 and Claude-3.5-Sonnet-1022.

Similar to DeepSeek, Qwen2.5-Max is touchy about Chinese political issues, it doesn't even answer those questions, on Qwen Chat, it just says you've exceeded your quota limit when you try those queries, but answers fine when you change the topic.

Some benchmarks that Alibaba used to test its model against the competition included MMLU-Pro, which tests knowledge through college-level problems, LiveCodeBench, which assesses coding capabilities, LiveBench, which comprehensively tests the general capabilities, and Arena-Hard, which approximates human preferences.

In Arena-Hard it came first with a score of 89.4, its closest competitor was DeepSeek-V3 at 85.5. In MMLU-Pro, Claude Sonnet won with a score of 78.0 compared to Qwen2.5-Max's 76.1. It came in second place to Claude Sonnet on the GPQA-Diamond benchmark with a score of 60.1, compared to Claude's 65.0.

In LiveCodeBench it scores 38.7 compared to Claude's 38.9. Finally, in LiveBench, Qwen won with a score of 62.2 compared to DeepSeek's 60.5.

Here are some other benchmarks the firm did, but it couldn't test some models like GPT-4o and Claude due to their closed nature.

The new Qwen2.5-Max is available via an API for developers to integrate it into their platforms and for end users, it's accessible via Qwen Chat. The latter option lets you use Artifacts and do image or video generation. There is also a button to enable web search, but it says it's coming soon.

There's no doubt that researchers from US tech firms will be adding the recent Qwen2.5 research paper to their reading lists to figure out how they can further optimize their own models.