Apple's AI models still trail behind OpenAI's GPT-4o despite latest update

At WWDC 2025, Apple unveiled its latest AI advancements, including a new Foundation Models framework for developers.

At WWDC 2025, Apple announced several updates related to Apple Intelligence for both developers and consumers. With the new Foundation Models framework, developers can now bring AI experiences to their apps that work offline in a privacy-preserving way and are available free of charge. The Foundation Models framework is built on Apple’s own in-house-developed AI models.

Apple also unveiled a new generation of language foundation models. According to Apple, these updated models are faster, more efficient, and offer improved tool use, better reasoning capabilities, multimodal support for image and text inputs, and support for 15 languages.

Apple Intelligence includes two foundation models:

A 3-billion-parameter model that runs on-device using Apple Silicon.
A server-based mixture-of-experts model optimized for Private Cloud Compute.

Apple noted that the on-device 3B language model is not designed to be a general-purpose chatbot. Instead, it is intended to perform text-related tasks such as summarization, entity extraction, text understanding, refinement, short dialogues, and creative content generation, among others.

The big question is how well Apple’s models perform compared to other leading models on the market. Rather than using standard AI benchmarks, Apple shared results from its own internal evaluations of fundamental language and reasoning capabilities.

According to Apple’s text-based evaluations, its on-device 3B model performs favorably against the slightly larger Qwen-2.5-3B and competitively against the larger Qwen-3-4B and Gemma-3-4B in English. Its server-based model performs slightly better than Llama-4-Scout but falls short compared to Qwen-3-235B and OpenAI’s proprietary GPT-4o.

In evaluations involving image input, Apple’s on-device model outperforms InternVL and Qwen, and performs competitively against Gemma. While Apple’s server model beats Qwen-2.5-VL, it underperforms when compared to Llama-4-Scout and GPT-4o.

These results highlight how far Apple still has to go in foundational AI capabilities. It seems Apple compared its models to GPT-4o to make its performance appear relatively decent. If Apple were to compare its results against OpenAI’s latest O-series models or Google’s Gemini 2.5 Pro, the gap would likely appear much wider. It will be interesting to see how Apple navigates the AI era with its in-house capabilities in the years ahead.