- Published on
Gemini Live Upgraded β Real-Time Voice Translation and Proactive Audio Make AI a Listening Partner
Someone in a meeting is speaking a language you don't know. A human interpreter is expensive, and waiting for a translated transcript breaks the flow of conversation. This old problem just got a new answer from Google.
In May 2026, Google announced a major update to Gemini 2.5 Flash Native Audio. Two features stand out. The first is live speech-to-speech translation. The second is Proactive Audio. These aren't just incremental additions β they represent a shift in the philosophy of how AI handles voice.
Live Speech-to-Speech Translation: Carrying the Voice, Not Just the Words

Standard translation apps work in three stages: speech to text, text translation, text to speech. Each step adds latency, and along the way, something essential is lost β the speaker's tone, pace, emotional charge, and cadence. Half of what a voice communicates is in how something is said, not what.
Gemini's live speech-to-speech translation works differently.
Put in earphones and start a conversation. The other person's voice translates in real time. The critical distinction: the tone, speed, and pitch of the original voice are preserved in the translation. If the speaker is animated and rapid, the translation arrives with the same energy. A question sounds like a question. A carefully paced explanation arrives the same way.
The system supports 24 languages and 30 HD voices. It's available now in Google AI Studio and Vertex AI, with rollout to Gemini Live and Search Live underway.
Proactive Audio: Responds to You, Not to the Room
The second feature, Proactive Audio, tackles a different problem.
AI voice assistants have an inherent tension. They need to be always listening to be useful β but "always listening" means responding to the TV, nearby conversations, background noise, and anything else that sounds like a command. Raise the sensitivity, and you get false triggers. Lower it, and the assistant misses genuine requests.
Proactive Audio takes a different approach.
"The model generates text transcripts and audio responses only for queries directed to the device, and does not respond to non-device directed queries." β Google Official Documentation
Instead of detecting a wake word, the model reads the intent and direction of speech. Is this person talking to me? Or are they talking to someone else in the room? Context determines whether a response is generated. Both audio and text transcripts are produced, but only when the signal is relevant.
Proactive Audio is currently in Preview, accessible to developers via the Gemini API.
Multi-Turn Conversation Quality
A less headline-grabbing improvement that matters practically: multi-turn conversation quality has been significantly strengthened.
Previous versions would sometimes lose context from earlier turns during extended conversations. With the updated 12-25 model version, context from prior turns is maintained more reliably, producing "more cohesive conversations" according to Google.
Function calling accuracy and instruction following have also improved β important for developers building voice agents that connect to production services and APIs.
Educational Implications
As an EdTech CEO, I think about where this connects to learning environments.
Live translation in education:
- International online classes where students speak their native language and teachers understand in real time
- Students from multilingual families following instruction in a second language
- Study abroad programs where language barriers can be bridged without undermining the immersion experience
Proactive Audio in education:
- AI assistants in classrooms that respond to the teacher's prompts without firing on every student conversation
- Self-directed learning setups for younger students, where AI stays quiet during exploration and steps in when a direct question is asked
The language learning angle is worth pausing on. Real-time translation is a powerful aid β but learning a language requires exposure to the target language without a safety net. The decision about when to enable translation and when to turn it off is an important pedagogical choice, not a technical one.
Technical Summary
| Parameter | Details |
|---|---|
| Model | Gemini 2.5 Flash Native Audio (12-25) |
| Languages | 24 |
| HD Voice Options | 30 |
| Translation Method | Direct speech-to-speech (minimized text intermediary) |
| Preserved in Translation | Tone, pace, pitch |
| Proactive Audio | Preview β Gemini API developers |
| Multi-Turn Improvement | Enhanced context retention across turns |
| Function Calling | Improved accuracy |
Tips
Try live translation: Enable the Live API in Google AI Studio and set your input and output languages. The 24-language support covers most global business and academic contexts.
Enable Proactive Audio: Set
proactiveAudio: truein your API configuration. Currently available through the developer Preview.Test tone preservation: Speak the same sentence in a calm tone and then an animated tone. Comparing the translated outputs gives you a direct feel for how faithfully the feature preserves the speaker's affect.
Specify the 12-25 model: Explicitly target
gemini-2.5-flash-native-audio-12-25to get the multi-turn improvements. Older model versions don't carry the update.Function Calling for voice agents: The improved instruction following makes this a good time to build or revisit voice agents connected to internal tools β customer service, scheduling, information retrieval β where accuracy directly affects user experience.
Sources
- Google Blog, "Gemini 2.5 Native Audio upgrade, plus text-to-speech model updates": https://blog.google/products-and-platforms/products/gemini/gemini-audio-model-updates/
- Google DeepMind Blog, "Gemini 2.5's native audio capabilities": https://blog.google/innovation-and-ai/models-and-research/google-deepmind/gemini-2-5-native-audio/
- Android Central, "Google's upgraded Gemini 2.5 Flash Native Audio model makes AI more conversational": https://www.androidcentral.com/apps-software/ai/googles-upgraded-gemini-2-5-flash-native-audio-model-makes-ai-more-conversational
- Google Cloud Blog, "Gemini Live API available on Vertex AI": https://cloud.google.com/blog/products/ai-machine-learning/gemini-live-api-available-on-vertex-ai
- eWeek, "Gemini 2.5 Flash Native Audio Gets Major Voice Upgrade": https://www.eweek.com/news/google-gemini-2-5-flash-native-audio-update/