Voice AI
AI systems that speak and listen in real time — handling the full turn-taking, latency, and prosody of natural conversation rather than converting speech to text and back.
Learn first
Voice AI is the discipline of building AI that converses through speech in real time. The hard problems are not transcription or synthesis in isolation — those have been solved — but their combination: the system must detect when a speaker finishes, respond without a perceptible gap, carry emotional tone, and handle interruption. Early voice assistants (Siri, Alexa) chained speech-to-text, a language model, and text-to-speech as separate passes, producing latency that breaks the rhythm of conversation. Newer approaches process audio end-to-end, letting the model read prosody directly and respond in kind. The gap between tolerable and natural is roughly 200 milliseconds.
In megatrends
How this connects
Tap a node to open it
