Back in October 2024, OpenAI announced the Realtime API, enabling developers to build low-latency, multimodal experiences in their apps. Since then, thousands of developers have used the Realtime API to build natural speech-to-speech experiences in their apps and services.
Today, OpenAI announced gpt-realtime, its most advanced speech-to-speech model that is better at following complex instructions, calling tools with a lower error rate, and generating speech that is more natural and expressive. OpenAI claims that this new model is better at interpreting system messages and developer prompts.
When the Realtime API was launched last year, it came with 6 different voices, and later two more were added. Today, OpenAI is announcing two new voices: Marin and Cedar. Along with the new voices, the existing 6 voices have also been updated to make them sound more natural.
OpenAI mentioned that this new gpt-realtime model can better understand audio with greater accuracy, and it performs better on benchmarks as well:
- Big Bench Audio: gpt-realtime scored 82.8% accuracy, beating the previous realtime model from December 2024, which scored 65.6%.
- MultiChallenge audio benchmark: gpt-realtime scored 30.5%, a significant improvement over the previous model from December 2024, which scored 20.6%.
- ComplexFuncBench audio eval: gpt-realtime scored 66.5%, while the previous model from December 2024 scored 49.7%.
In addition to the new model and voices, OpenAI also announced several updates to the API. The Realtime API now supports remote MCP servers, image inputs, and phone calling through Session Initiation Protocol (SIP). Finally, developers can now save and reuse prompts.
Despite these improvements, OpenAI has reduced the price of the Realtime API. The new gpt-realtime API is 20% cheaper when compared to gpt-4o-realtime-preview at $32 / 1M audio input tokens ($0.40 for cached input tokens) and $64 / 1M audio output tokens.
With these meaningful performance improvements and a surprising price drop, OpenAI is positioning gpt-realtime as a compelling choice for developers building the next generation of voice-first experiences.