Building a truly conversational AI agent is an exercise in orchestration. The straightforward approach—wait for user speech to end, send to LLM, wait for token, send to TTS, play audio—is a recipe for a sluggish, frustrating experience. To build an agent that feels alive, you need to master the art of asynchronous event handling.
At Lokutor, we’ve helped hundreds of developers optimize their voice stacks. This guide outlines the reference architecture for a sub-500ms end-to-end voice agent.
The Core Loop: VAD, LLM, and TTS
The three critical components of any voice agent are:
- Voice Activity Detection (VAD): The “ears” that decide when the user has stopped talking.
- Large Language Model (LLM): The “brain” generating the response.
- Text-to-Speech (TTS): The “mouth” (Lokutor).
1. The VAD Problem: Silence vs. Pause
The biggest source of perceived latency is often not the model, but the VAD. If you wait for 1000ms of silence to confirm the user is done, you’ve already added 1 second of lag before you even start processing.
Best Practice:
- Aggressive VAD: Set your silence threshold low (e.g., 300-500ms).
- Speculative Execution: Send the transcript to the LLM while the user is possibly pausing. If they resume speaking, cancel the request.
- Server-Side ASR: Stream audio continuously to a provider like Deepgram or AssemblyAI for the fastest “End of Utterance” signal.
2. The LLM Handoff: Streaming is Non-Negotiable
Never wait for the full sentence. Modern LLMs can stream tokens within 20ms of receiving a prompt. Your backend should be a “pass-through” pipe that forwards these tokens effectively to Lokutor.
The Sentence Boundary Challenge: TTS engines generally perform better with full sentences or clauses to get the prosody (intonation) right. However, waiting for a full sentence adds latency.
- Lokutor’s Advantage: Our Flow Matching architecture is robust to partial context. You can push shorter chunks (3-5 words) and we maintain coherence, merging the prosody seamlessly with the previous chunk.
3. Handling Interruptions (Barge-In)
Real conversation involves interruption. Users shouldn’t have to wait for the bot to finish.
Implementation Pattern:
- Listen While Speaking: Your ASR stream must remain active even while your TTS is playing audio.
- Echo Cancellation: Ensure your user’s microphone doesn’t pick up the bot’s voice, or use software Acoustic Echo Cancellation (AEC).
- The “Kill Switch”: When the ASR detects user speech (“Voice Activity Start”), your client must immediately send a clear signal to the audio player to clear its buffer.
- Do not just pause—clear the queue. Old audio is irrelevant context.
Code Pattern: The WebSocket Pipeline
Here is a high-level representation of an optimized event loop using Lokutor’s WebSocket API.
// Conceptual Architecture
class VoiceAgent {
constructor() {
this.asr = new SpeechService();
this.llm = new LLMService();
this.tts = new LokutorClient({
voice: 'versa-v1-m1',
latencyOptimization: 'ultra' // Hints decoder to prioritize TTFB
});
this.audioQueue = new PCMQueue();
}
onUserSpeechStart() {
// 1. Immediate Interruption
this.audioQueue.clear();
this.tts.interrupt();
this.llm.cancelGeneration();
}
onUserSpeechEnd(transcript) {
// 2. Stream to LLM
const stream = this.llm.generate(transcript);
// 3. Pipe LLM tokens directly to TTS
// Don't wait for full sentences.
for await (const chunk of stream) {
this.tts.feed(chunk);
}
}
onAudioReceived(pcmData) {
// 4. Play immediately if queue is empty
this.audioQueue.push(pcmData);
}
}
Latency Budget Analysis
In a highly optimized system using Lokutor, your budget looks like this:
| Component | Time | Notes |
|---|---|---|
| ASR Processing | 200ms | Detection of utterance end + transcription |
| Network RTT | 50ms | Edge compute helps here |
| LLM TTFT | 150ms | Time to First Token (highly variable) |
| Lokutor TTS | <90ms | Parallel generation starts on first token |
| Client Buffer | 30ms | Minimal buffer to prevent jitter |
| Total Turn-Taking | ~520ms | Approaching human parity |
Conclusion
Latency is a full-stack problem. Using the world’s fastest TTS doesn’t help if your VAD adds a second of delay. But once you tune your orchestration, Lokutor allows you to cross the uncanny valley of timing, creating agents that feel less like IVR menus and more like colleagues.