01 — Problem
What was hard about this
Real-time voice AI lives or dies on perceived latency. Human turn-taking happens around 200ms; anything past 500ms feels awkward. A naive STT → LLM → TTS pipeline serializes three slow systems — each waits for the previous to finish — so the user hears a multi-second pause after every utterance. The voice agent needs to start replying before the caller has even finished speaking.
02 — Architecture
How the pieces fit
03 — Decisions
Trade-offs I'd defend in an interview
01Stream everything, never wait
Deepgram's STT emits partial transcripts every ~100ms. The LLM emits tokens as they're generated. TTS speaks chunks as they arrive. Every stage starts producing output before the previous stage finishes — total perceived latency becomes the latency of the *slowest single chunk*, not the sum of all stages.
02Interrupt-aware router for utterance commitment
When does the caller actually mean 'go'? Pause detection is fragile (some people speak slowly, some don't pause). The router watches for transcript stability — when the last N partial transcripts agree on the same text, the utterance is committed and sent to the LLM. If the caller resumes speaking mid-commit, the in-flight LLM call is cancelled.
03Barge-in support
When the agent is mid-sentence and the caller starts talking, the agent has to stop. The router detects new STT input, sends a cancellation signal to the LLM, and immediately drops the in-flight TTS audio buffer. This is the single biggest UX difference between a 'voice AI demo' and something a real customer would tolerate.
04Twilio Media Streams over WebRTC
WebRTC would give lower latency but requires JS in a browser. Twilio Media Streams works over the public switched telephone network — any phone, anywhere, no app required. The latency hit (~80-150ms round-trip over the carrier network) is acceptable for the use case (customer service, scheduling, lead qualification).
04 — Outcomes
What shipped
- End-to-end perceived latency well under 1 second on a real phone call
- Barge-in works reliably — caller can cut the agent off and the agent immediately yields
- Interrupt-aware router prevents 'commit and regret' loops when speech is hesitant
- Single FastAPI service handles the entire bidirectional audio pipeline
05 — Next
What I'd do if this had another sprint
- Add an eval harness: scripted dialogues + measured turn-taking latency + transcript accuracy
- Add a tool-calling layer (e.g. lookup calendar, book appointment) with structured-output validation
- Persist conversation transcripts + audio for offline review
- Test fallback paths: STT timeout, LLM provider outage, TTS quota exceeded