When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

Move over DeepSeek: Alibaba's Qwen2.5-Max surpasses DeepSeek-V3 in benchmarks

Qwen25-Max

The news headlines for the last week have been dominated by DeepSeek thanks to the launch of its new reasoning model, R1, which improves responses to queries. DeepSeek's main non-reasoning model, DeepSeek-V3 arrived in December with impressive benchmark scores of its own, but now, Chinese firm Alibaba has released Qwen2.5-Max which surpasses DeepSeek-V3, and in some tests GPT-4o-0806 and Claude-3.5-Sonnet-1022.

Similar to DeepSeek, Qwen2.5-Max is touchy about Chinese political issues, it doesn't even answer those questions, on Qwen Chat, it just says you've exceeded your quota limit when you try those queries, but answers fine when you change the topic.

Some benchmarks that Alibaba used to test its model against the competition included MMLU-Pro, which tests knowledge through college-level problems, LiveCodeBench, which assesses coding capabilities, LiveBench, which comprehensively tests the general capabilities, and Arena-Hard, which approximates human preferences.

Qwen25-Max

In Arena-Hard it came first with a score of 89.4, its closest competitor was DeepSeek-V3 at 85.5. In MMLU-Pro, Claude Sonnet won with a score of 78.0 compared to Qwen2.5-Max's 76.1. It came in second place to Claude Sonnet on the GPQA-Diamond benchmark with a score of 60.1, compared to Claude's 65.0.

In LiveCodeBench it scores 38.7 compared to Claude's 38.9. Finally, in LiveBench, Qwen won with a score of 62.2 compared to DeepSeek's 60.5.

Here are some other benchmarks the firm did, but it couldn't test some models like GPT-4o and Claude due to their closed nature.

Qwen25-Max

The new Qwen2.5-Max is available via an API for developers to integrate it into their platforms and for end users, it's accessible via Qwen Chat. The latter option lets you use Artifacts and do image or video generation. There is also a button to enable web search, but it says it's coming soon.

There's no doubt that researchers from US tech firms will be adding the recent Qwen2.5 research paper to their reading lists to figure out how they can further optimize their own models.

Report a problem with article
The RTX 5080 graphics card by Nvidia
Next Article

Forget the 5070, early review benchmarks show even RTX 5080 can't beat the 4090

GOG Dreamlist
Previous Article

GOG launches Dreamlist, a new community wishlist to help bring back classic games

Join the conversation!

Login or Sign Up to read and post a comment.

1 Comment - Add comment