Guillermo Martin-coello Posted on October 15, 2025

The Architecture of Interruption: A Developer's Guide to Real-Time Voice Agents

Building a voice agent isn't just about daisy-chaining APIs. It's about managing state, handling interruptions, and optimizing the 'Turn-Taking' loop. Here is the reference architecture for low-latency agents.

Building a truly conversational AI agent is an exercise in orchestration. The straightforward approach—wait for user speech to end, send to LLM, wait for token, send to TTS, play audio—is a recipe for a sluggish, frustrating experience. To build an agent that feels alive, you need to master the art of asynchronous event handling.

At Lokutor, we’ve helped hundreds of developers optimize their voice stacks. This guide outlines the reference architecture for a sub-500ms end-to-end voice agent.

The Core Loop: VAD, LLM, and TTS

The three critical components of any voice agent are:

Voice Activity Detection (VAD): The “ears” that decide when the user has stopped talking.
Large Language Model (LLM): The “brain” generating the response.
Text-to-Speech (TTS): The “mouth” (Lokutor).

1. The VAD Problem: Silence vs. Pause

The biggest source of perceived latency is often not the model, but the VAD. If you wait for 1000ms of silence to confirm the user is done, you’ve already added 1 second of lag before you even start processing.

Best Practice:

Aggressive VAD: Set your silence threshold low (e.g., 300-500ms).
Speculative Execution: Send the transcript to the LLM while the user is possibly pausing. If they resume speaking, cancel the request.
Server-Side ASR: Stream audio continuously to a provider like Deepgram or AssemblyAI for the fastest “End of Utterance” signal.

2. The LLM Handoff: Streaming is Non-Negotiable

Never wait for the full sentence. Modern LLMs can stream tokens within 20ms of receiving a prompt. Your backend should be a “pass-through” pipe that forwards these tokens effectively to Lokutor.

The Sentence Boundary Challenge: TTS engines generally perform better with full sentences or clauses to get the prosody (intonation) right. However, waiting for a full sentence adds latency.

Lokutor’s Advantage: Our Flow Matching architecture is robust to partial context. You can push shorter chunks (3-5 words) and we maintain coherence, merging the prosody seamlessly with the previous chunk.

3. Handling Interruptions (Barge-In)

Real conversation involves interruption. Users shouldn’t have to wait for the bot to finish.

Implementation Pattern:

Listen While Speaking: Your ASR stream must remain active even while your TTS is playing audio.
Echo Cancellation: Ensure your user’s microphone doesn’t pick up the bot’s voice, or use software Acoustic Echo Cancellation (AEC).
The “Kill Switch”: When the ASR detects user speech (“Voice Activity Start”), your client must immediately send a clear signal to the audio player to clear its buffer.
- Do not just pause—clear the queue. Old audio is irrelevant context.

Code Pattern: The WebSocket Pipeline

Here is a high-level representation of an optimized event loop using Lokutor’s WebSocket API.

// Conceptual Architecture

class VoiceAgent {
  constructor() {
    this.asr = new SpeechService();
    this.llm = new LLMService();
    this.tts = new LokutorClient({ 
      voice: 'versa-v1-m1',
      latencyOptimization: 'ultra' // Hints decoder to prioritize TTFB
    });

    this.audioQueue = new PCMQueue();
  }

  onUserSpeechStart() {
    // 1. Immediate Interruption
    this.audioQueue.clear(); 
    this.tts.interrupt();
    this.llm.cancelGeneration();
  }

  onUserSpeechEnd(transcript) {
    // 2. Stream to LLM
    const stream = this.llm.generate(transcript);
    
    // 3. Pipe LLM tokens directly to TTS
    // Don't wait for full sentences.
    for await (const chunk of stream) {
      this.tts.feed(chunk);
    }
  }

  onAudioReceived(pcmData) {
    // 4. Play immediately if queue is empty
    this.audioQueue.push(pcmData);
  }
}

Latency Budget Analysis

In a highly optimized system using Lokutor, your budget looks like this:

Component	Time	Notes
ASR Processing	200ms	Detection of utterance end + transcription
Network RTT	50ms	Edge compute helps here
LLM TTFT	150ms	Time to First Token (highly variable)
Lokutor TTS	<90ms	Parallel generation starts on first token
Client Buffer	30ms	Minimal buffer to prevent jitter
Total Turn-Taking	~520ms	Approaching human parity

Conclusion

Latency is a full-stack problem. Using the world’s fastest TTS doesn’t help if your VAD adds a second of delay. But once you tune your orchestration, Lokutor allows you to cross the uncanny valley of timing, creating agents that feel less like IVR menus and more like colleagues.