The voice agent stack everyone built last year suddenly looks expensive

OpenAI's Realtime reset and what it means for the rest of the stack

OpenAI just put GPT-5-class reasoning inside the audio loop, took the Realtime API to GA, and priced translation at a third of a cent per minute. The voice agent stack everyone built last year suddenly looks expensive.

Article Image Alt

If you've been quietly building a voice agent over the last 18 months, this week is the one where you stop and re-read your architecture diagram. On May 7, 2026, OpenAI shipped three new voice models — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper — and pulled the Realtime API out of beta into general availability. That last bit is the one procurement teams care about. The rest of it changes how the systems are built.

The short version: voice agents are graduating from stitched-together pipelines to single-model, full-duplex, tool-using systems. The implication is broader than a model upgrade. It changes who you buy from, how you architect, and how you cost a deployment.

What actually shipped

GPT-Realtime-2 is the headline. It's a single speech-to-speech model where reasoning happens inside the audio loop rather than between separate transcription, LLM, and synthesis steps. The spec sheet matters: a 128K context window (up from 32K), max 32K output tokens, and a five-level adjustable reasoning effort dial — minimal, low, medium, high, xhigh — with low set as the latency-friendly default.

A handful of features that used to live in your prompt scaffolding are now first-class API surface. Preambles let the model say "let me check that" while a tool call runs, killing the dead-air problem. Parallel tool calls can be narrated audibly ("checking your calendar, looking that up now") so the user knows something is happening. Recovery behaviour is more graceful — instead of going silent or hallucinating, the model now says something like "I'm having trouble with that right now." There's tone control for calm, empathetic, or upbeat delivery, better handling of healthcare vocabulary and proper nouns, and two new voices, Cedar and Marin, available only through the Realtime API.

The two companion models specialise where Realtime-2 generalises. GPT-Realtime-Translate covers 70+ input languages and 13 output languages, designed to keep up with speakers across context switches and regional pronunciation. GPT-Realtime-Whisper is a streaming, low-latency successor to the original Whisper, with developer-controllable latency settings — lower delay for earlier partial text, higher delay for cleaner transcripts.

The numbers, and the price tag

The benchmarks confirm a real generational jump. On Big Bench Audio, Realtime-2 (high) hits 96.6% versus 81.4% for Realtime-1.5 — a 15.2-point lift that's pushing the benchmark close to saturation. On Audio MultiChallenge, which measures multi-turn instruction-following in realistic spoken dialogue, the xhigh variant scores 48.5% versus 34.7%. Useful progress, and a useful reminder that production voice agents are still nowhere near solved. Independent results back this up: Scale AI puts Realtime-2 at #1 on its Audio MultiChallenge S2S leaderboard, and Artificial Analysis reports a time-to-first-audio of 1.12 seconds at minimal reasoning, climbing to 2.33 seconds at high.

Pricing is where this gets strategic. Realtime-2 itself comes in at parity with the previous model — $32 per million audio input tokens ($0.40 cached), $64 per million output tokens. The intelligence upgrade lands without a price hike. The aggressive moves are in the companion models: Translate at $0.034 per minute, Whisper at $0.017 per minute. Those numbers undercut most existing per-minute enterprise translation pipelines and put real pressure on standalone STT vendors.

Where it lands first

OpenAI's launch list reads like a market map. Zillow, Glean, Genspark, Bluejay, Intercom, Priceline, and Foundation Health on the realtime side. BolnaAI, Vimeo, and Deutsche Telekom on translation. The published metrics are unusually concrete: Zillow reports a 26-point lift in call success rate on its hardest adversarial benchmark, going from 69% to 95% after prompt optimisation, plus better Fair Housing compliance behaviour — a regulated-industry blocker that has held up many voice deployments. Glean measured a 42.9% relative increase in helpfulness for real-time voice interactions. Genspark's "Call for Me" agent saw a +26% effective conversation rate with fewer dropped calls. BolnaAI logged 12.5% lower word error rates across Hindi, Tamil, and Telugu using Translate compared to alternatives.

These numbers come from internal evaluations with prompt optimisation, not third-party benchmarks. Treat them as directional, not gospel. But they line up with three voice patterns that are clearly emerging: voice-to-action (Zillow), systems-to-voice (Priceline narrating a delay and re-route), and voice-to-voice (Deutsche Telekom doing live translation across customer support).

The stack just collapsed

For 18 months, every serious production voice agent has been a stitched-together pipeline. Whisper or Deepgram for ASR. GPT or Claude for reasoning. ElevenLabs or Cartesia for TTS. Bespoke turn-taking, barge-in, and recovery logic in between. Latency budgets, interruption semantics, and tool-call observability all had to be hand-built in the seams.

Realtime-2 compresses that pipeline into one inference loop. The Realtime API's GA status, with EU data residency and enterprise privacy commitments, removes the last "still in beta" objection from procurement reviews.

The bigger operational implication: voice apps now need to be architected as stateful real-time systems, not prompt-response endpoints. OpenAI's accompanying voice prompting guide pushes developers toward reasoning-effort tuning, preamble design, tool-call UX, unclear-audio recovery, and long-session state management. Voice-agent quality from here is a harness problem more than a model-selection problem.

Who just got squeezed

ElevenLabs is the most-funded pure-play voice company in the market, with a $500M Series D at an $11B valuation and roughly $330M ARR entering 2026. Its Agents pricing ($0.08–$0.12 per minute depending on tier) is now visibly above OpenAI's bundled translation and transcription rates, and the Cedar/Marin voices materially close the naturalness gap. Deepgram and AssemblyAI compete on the cascaded-pipeline thesis — that dedicated STT models still beat multimodal approaches on entity capture for things like phone numbers, addresses, and medical codes. That argument still holds for high-stakes regulated workflows. It just got narrower.

Google Gemini Live is the closest peer. Artificial Analysis flags Gemini 3.1 Flash Live Preview High at the same 96.6% Big Bench Audio score. Google's edge is broader language coverage and Workspace and Search distribution. OpenAI's edge is reasoning depth and tool ergonomics. Anthropic remains the conspicuous absence — Claude has a mobile voice mode and a push-to-talk option in Claude Code, but no realtime, full-duplex API for builders. For production voice at scale, Anthropic isn't currently in the conversation.

The European footnote that matters

If you're an EU-based buyer reading this, there's a wrinkle worth knowing about. Microsoft resells gpt-realtime through Azure OpenAI Service, which is the natural procurement path for many enterprises with existing EAs. But European regional availability for the Realtime models on Azure has been narrow — in practice, Sweden Central has been the primary EU region for realtime model deployments, with most other European Azure regions (West Europe, Germany West Central, France Central, UK South) not supporting realtime models for in-region data residency.

That single-region reality bit hard on January 27, 2026, when a Sweden Central outage took realtime and voice endpoints offline for several hours, and customers with EU residency requirements had no in-region failover option. Microsoft hasn't published an ETA for expanding realtime model availability to other EU regions. If you're architecting for both GDPR data residency and high availability on Azure, that's a real constraint to design around — multi-region failover within the EU is not yet a thing for these models. Worth checking the live region matrix before you commit.

What to actually do about this

A few practical moves for the next few weeks.

For builders, pilot Realtime-2 on the hardest 10% of your call traffic first — that's where the Zillow-style 26-point lifts show up; the easy 90% was already solved. Set reasoning.effort to low by default and only escalate per-turn when complexity warrants it; time-to-first-audio more than doubles between minimal and high. Use cached input aggressively — at $0.40 versus $32 per million tokens, prompt caching is the single biggest cost lever in production. And treat preambles, parallel-tool narration, and recovery phrases as core UX, not nice-to-haves.

For decision-makers, re-bid voice contracts. If you signed an enterprise translation or transcription deal before May 7, the ground has moved. Translate at $0.034/min and Whisper at $0.017/min reset baseline expectations. Don't single-vendor by default either — ElevenLabs voice quality, Deepgram and AssemblyAI entity-capture accuracy, and Anthropic-grade reasoning still have differentiated places in the stack. Budget 30–50% of your total deployment cost for harness work: escalation, brand-voice tuning, evaluation, analytics. The model is now the easy bit.

And keep an eye on ChatGPT consumer rollout. OpenAI explicitly noted these capabilities haven't yet reached ChatGPT Voice. When they do, end-user expectations of every voice product on the market reset upward overnight.

The honest caveats

A couple of reality checks before you re-architect anything. Most of the headline benchmarks were run at high or xhigh reasoning, but production traffic mostly runs at low — the user-facing experience won't always feel like the marketing numbers. Several of the customer-reported gains come from internal evaluations, not independent third-party benchmarks. And none of this removes the work around guardrails, escalation, and observability. Voice-specific failure modes — accidental activations, prompt injection through audio, emotional manipulation — are still underexplored at scale.

Still, the strategic signal is unambiguous. The gap between "voice that can actually do work" and what's running in production has narrowed by more than at any point in the last 18 months. For European builders especially — where the Azure region story is still catching up to the API story — the coming months are going to be about figuring out which interactions to automate, on which stack, with what guardrails. It's exactly the kind of architectural decision that the European ecosystem tends to chew through together at events like ECS, where the "what works in EU production" conversation often reveals more than the launch posts do.