Microsoft: Our Phi-4-reasoning takes on larger models, matches DeepSeek-R1 performance

Microsoft has released Phi-4-reasoning and Phi-4-reasoning-plus, 14B-parameter small language models fine-tuned for complex reasoning tasks.

Microsoft Logo

Today, Microsoft announced Phi-4-reasoning, a 14B-parameter small reasoning model that is said to deliver strong performance on complex reasoning tasks. Microsoft trained this new model via supervised fine-tuning of Phi-4 on a curated set of "teachable" prompts generated using o3-mini. Microsoft also introduced Phi-4-reasoning-plus, a 14B-parameter variant of Phi-4-reasoning that delivers even better performance by generating longer reasoning traces.

According to Microsoft's whitepaper, these new Phi-4-reasoning models outperform several larger open-weight models, such as DeepSeek-R1-Distill-Llama-70B, and even match the performance levels of the full DeepSeek-R1 model on certain benchmarks. They are also said to outperform Anthropic's Claude 3.7 Sonnet and Google's Gemini 2 Flash Thinking models on all tasks except GPQA and Calendar Planning.

Microsoft Phi-4-Reasoning

The impressive claimed performance of Phi-4-reasoning suggets that careful data curation for supervised fine-tuning (SFT) is effective for reasoning language models, and performance may be further improved using reinforcement learning (RL).

Phi-4-reasoning has several limitations as well. First, the Phi-4 model primarily works with English text. Second, it is mainly trained on Python using common coding packages. Third, it has a context length of just 32k tokens. Additional limitations can be found in the whitepaper.

Introducing Phi-4-reasoning, adding reasoning models to the Phi family of SLMs.

The model is trained with both supervised finetuning (using a carefully curated dataset of reasoning demonstration) and Reinforcement Learning.

📌Competitive results on reasoning benchmarks with… pic.twitter.com/p2FkjD4qfu
— Ahmed Awadallah (@AhmedHAwadallah) May 1, 2025

Microsoft stated that these new Phi-4-reasoning models are designed to accelerate research on language models. They are expected to be useful for developing AI applications in memory- or compute-constrained environments, latency-bound scenarios, and reasoning-intensive tasks.

Interested developers can check out these new models at Hugging Face and Azure AI Foundry.