Google announces Gemini 3.1 Flash-Lite model for high-volume developer workloads

Google has announced Gemini 3.1 Flash-Lite, its most cost-efficient and fastest model in the Gemini 3 series. This new model is targeted toward high-volume developer workloads and is already available in preview for developers via the Gemini API in Google AI Studio and Vertex AI for enterprises.

⚡ Excited to announce Gemini 3.1 Flash-Lite! We’ve set a new standard for efficiency and capability to give developers our fastest, most cost-effective Gemini 3 model yet.

We engineered this model with thinking levels, allowing it to handle high-volume queries instantly, while… pic.twitter.com/h6kRl8LkJq
— Jeff Dean (@JeffDean) March 3, 2026

Gemini 3.1 Flash-Lite is priced at $0.25 per 1M input tokens and $1.50 per 1M output tokens, undercutting larger models for developers and enterprises that need to make a lot of API calls without a large budget.

Google says 3.1 Flash-Lite beats Gemini 2.5 Flash on speed, with a 2.5× faster Time to First Token and a 45% increase in output speed, based on Artificial Analysis benchmarking. With this improved overall speed, Google is claiming that this model is highly suitable for real-time, high-frequency experiences.

The Gemini team wrote the following regarding the Gemini 3.1 Flash-Lite model launch:

3.1 Flash-Lite can tackle tasks at scale, like high-volume translation and content moderation, where cost is a priority. And it can also handle more complex workloads where more in-depth reasoning is needed, like generating user interfaces and dashboards, creating simulations, or following instructions.

On the Arena.ai leaderboard, Google claims Flash-Lite hits an Elo score of 1432, and it also posts 86.9% on GPQA Diamond and 76.8% on MMMU Pro, beating even the larger Gemini 2.5 Flash model and several smaller models from other AI labs, including OpenAI and Anthropic.

A notable addition for developers is the built-in thinking levels in Google AI Studio and Vertex AI, allowing them to choose how much reasoning the model should apply per task. This will be highly useful when they are balancing cost, latency, and answer depth in production workflows.

Tags