▸ Concept also: conversational AI, voice interface, speech AI

Voice AI

AI systems that speak and listen in real time — handling the full turn-taking, latency, and prosody of natural conversation rather than converting speech to text and back.

Learn first

Consumer AI Frontier models

In a nutshell

Voice AI is the discipline of building AI that converses through speech in real time. The hard problems are not transcription or synthesis in isolation — those have been solved — but their combination: the system must detect when a speaker finishes, respond without a perceptible gap, carry emotional tone, and handle interruption. Early voice assistants (Siri, Alexa) chained speech-to-text, a language model, and text-to-speech as separate passes, producing latency that breaks the rhythm of conversation. Newer approaches process audio end-to-end, letting the model read prosody directly and respond in kind. The gap between tolerable and natural is roughly 200 milliseconds.