Real-Time AI Voice Agent

01 — Problem

What was hard about this

Real-time voice AI lives or dies on perceived latency. Human turn-taking happens around 200ms; anything past 500ms feels awkward. A naive STT → LLM → TTS pipeline serializes three slow systems — each waits for the previous to finish — so the user hears a multi-second pause after every utterance. The voice agent needs to start replying before the caller has even finished speaking.

02 — Architecture

How the pieces fit

Loading diagram…

Audio streams in both directions over a single WebSocket. STT emits partial transcripts continuously; the router decides when the utterance is stable enough to commit to an LLM call. TTS chunks start playing as the LLM token-streams.

03 — Decisions

Trade-offs I'd defend in an interview

01Stream everything, never wait

Deepgram's STT emits partial transcripts every ~100ms. The LLM emits tokens as they're generated. TTS speaks chunks as they arrive. Every stage starts producing output before the previous stage finishes — total perceived latency becomes the latency of the *slowest single chunk*, not the sum of all stages.

02Interrupt-aware router for utterance commitment

When does the caller actually mean 'go'? Pause detection is fragile (some people speak slowly, some don't pause). The router watches for transcript stability — when the last N partial transcripts agree on the same text, the utterance is committed and sent to the LLM. If the caller resumes speaking mid-commit, the in-flight LLM call is cancelled.

03Barge-in support

When the agent is mid-sentence and the caller starts talking, the agent has to stop. The router detects new STT input, sends a cancellation signal to the LLM, and immediately drops the in-flight TTS audio buffer. This is the single biggest UX difference between a 'voice AI demo' and something a real customer would tolerate.

04Twilio Media Streams over WebRTC

WebRTC would give lower latency but requires JS in a browser. Twilio Media Streams works over the public switched telephone network — any phone, anywhere, no app required. The latency hit (~80-150ms round-trip over the carrier network) is acceptable for the use case (customer service, scheduling, lead qualification).

04 — Outcomes

What shipped

End-to-end perceived latency well under 1 second on a real phone call
Barge-in works reliably — caller can cut the agent off and the agent immediately yields
Interrupt-aware router prevents 'commit and regret' loops when speech is hesitant
Single FastAPI service handles the entire bidirectional audio pipeline

05 — Next

What I'd do if this had another sprint

Add an eval harness: scripted dialogues + measured turn-taking latency + transcript accuracy
Add a tool-calling layer (e.g. lookup calendar, book appointment) with structured-output validation
Persist conversation transcripts + audio for offline review
Test fallback paths: STT timeout, LLM provider outage, TTS quota exceeded

06 — Visual proof

See it in code

Back to all projects Want to talk about how this would fit your team?