Real-Time Voice AI: Building Speech-to-Speech Agents with Sub-Second Latency

How I reduced Claritel's voice agent initial response time from 8s to 3s — covering WebSocket Architecture, Pipecat, Deepgram, ElevenLabs, and the specific optimizations that cut latency by 62%.

when i joined Eniac (later acquired by Claritel), the voice agent took 8 seconds to respond after the caller stopped speaking. 8 seconds in a voice call feels broken. people were hanging up thinking the call dropped.

one week later we got it under 3 seconds. this post is exactly how.

i'm Mohd Mursaleen, an AI backend engineer based in Bengaluru. this is a practical breakdown of the pipeline, bottlenecks, and optimizations that actually moved latency.

Why Voice AI Latency Is Hard

text AI latency is mostly "time to first token." voice latency is a chain with separate budgets:

STT (Speech-to-Text): caller audio to text
LLM inference: text response generation
TTS (Text-to-Speech): response text to audio

at 8s, rough split was:

STT: ~800ms (end-of-speech wait)
LLM (TTFT + generation): ~5.5s
TTS: ~1.2s
Network + WebSocket: ~500ms

LLM was the main bottleneck. but just swapping to a "faster model" would hurt quality too much for production calls. we needed architecture changes.

The Architecture: Pipecat + FastAPI + WebSocket

pipeline was built on Pipecat. Pipecat gives a processor chain model where each stage passes frames downstream.

backend was FastAPI handling WebSocket connections from telephony providers (first Plivo, later LiveKit for direct WebRTC). each active call got an isolated asyncio pipeline.

async def create_pipeline(transport: WebsocketServerTransport) -> Pipeline:
    """
    Builds the voice agent pipeline for a single call session.
 
    Args:
        transport: WebSocket transport wrapping the caller's audio stream
 
    Returns:
        Configured Pipecat pipeline ready to run
    """
    stt = DeepgramSTTService(
        api_key=settings.DEEPGRAM_API_KEY,
        model="nova-2-phonecall",
        language="en-US",
    )
    llm = OpenAILLMService(
        api_key=settings.OPENAI_API_KEY,
        model="gpt-4o-mini",
    )
    tts = ElevenLabsTTSService(
        api_key=settings.ELEVENLABS_API_KEY,
        voice_id=settings.VOICE_ID,
        model="eleven_turbo_v2_5",
    )
    pipeline = Pipeline([transport.input(), stt, llm, tts, transport.output()])
    return pipeline

mental model: transport.input() is inbound caller audio stream. transport.output() is outbound synthesized voice. everything between is composable.

Optimization 1: Streaming TTS (1.1s saved)

largest win. old flow waited for full LLM output before TTS. new flow used ElevenLabs streaming API so TTS starts from early chunks.

with Pipecat this fit naturally: LLM emits TextFrame as tokens arrive, TTS starts on first sentence boundary.

key trick was sentence boundary heuristic: start at first ., ?, or ! after at least 40 characters. too early sounds choppy; too late loses latency.

result: ~1.1s saved on initial response because user hears audio before full text is done.

Optimization 2: Model and Context Window Tuning (2.2s saved)

LLM stage (~5.5s) had three issues.

Model size: moved first-turn path from gpt-4-turbo to gpt-4o-mini. median TTFT dropped ~2.8s → ~0.9s. quality drop was acceptable for this constrained phone-support domain.

Context bloat: prompt carried ~6,000 tokens of few-shot examples every turn. we moved examples to retrieval and only injected on low-confidence turns. normal turn prompt dropped to ~800 tokens.

History growth: by turn 8 context hit ~12,000 tokens. we added sliding window + summarization every 6 turns, keeping active context under ~3,000 tokens.

combined gain: ~2.2s.

Optimization 3: VAD Tuning (400ms saved)

VAD decides when user finished speaking. default conservative setup waited 800ms silence.

for phone calls we tuned aggressively:

End-of-speech silence: 800ms → 400ms
Minimum speech duration: 300ms
Pre-buffer: 150ms

on Plivo codec (G.711 μ-law, 8kHz), this saved ~400ms/turn without noticeable mid-sentence cutoffs.

Optimization 4: CDR-Based Bottleneck Detection

one practical trick people skip: use production CDRs.

Plivo CDRs gave per-leg timing. i wrote a script that pulled calls where users hung up in first 15s (proxy for "agent felt too slow"), then parsed timestamps to find latency hotspots.

this exposed an issue synthetic tests missed: certain international/cellular call paths added ~800ms WebSocket handshake jitter. we added connection pre-warming for those patterns. jitter dropped below ~200ms.

Final Latency Breakdown

after all optimizations:

| Stage | Before | After | | ---------------------------- | ------------ | ----------------- | | VAD (end-of-speech) | 800ms | 400ms | | STT (Deepgram nova-2) | 800ms | 600ms | | LLM TTFT | 2,800ms | 900ms | | TTS start (first audio) | 1,200ms | 100ms (streaming) | | Network/WebSocket | 500ms | 200ms | | Total (initial response) | ~8,000ms | ~2,900ms |

headline metric was 62% faster initial response. business metric that mattered more: call completion rate (users staying through first agent turn) jumped from 61% to 89%.

for multi-agent memory/orchestration patterns, see my Champion architecture post.