When we talk about “latency” in engineering, we talk about milliseconds. But in conversation, we aren’t measuring time; we are measuring trust.
The Sacks-Schegloff-Jefferson Model
In 1974, sociologists Sacks, Schegloff, and Jefferson published a seminal paper on “Turn-Taking.” They analyzed thousands of hours of natural conversation and found a remarkable constant:
The average gap between one person stopping speaking and the next person starting is roughly 200 milliseconds.
This is faster than the human reaction time to a visual stimulus.
How is this possible? It implies that we do not listen, then think, then speak. We predict. We project the end of the speaker’s sentence and prepare our motor cortex to fire before they retain silence.
The Uncanny Valley of Time
When an AI voice agent introduces a delay of 800ms or 1000ms (typical for standard cloud TTS chains), it violates this fundamental biological rhythm.
This creates specific psychological effects:
- Perceived Incompetence: A study by Boltz (2005) found that delayed responses in interviews made candidates seem less intelligent and less honest. We subconsciously associate hesitation with “making things up.”
- Turn-Friction: If the gap is too long, the user assumes the AI didn’t hear them. They start saying “Hello?” or “Are you there?” right as the AI finally responds. This collision ruins the session.
- Cognitive Load: Waiting forces the user to maintain “active attention.” In a seamless conversation, the flow is effortless. In a laggy one, the user is constantly managing the connection state in their head.
Engineering the “Backchannel”
To solve this, Lokutor targets a “Time to Insight” metric, not just Time to First Byte.
We enable:
- Backchanneling: The ability for the agent to emit “mm-hmm” or “right” during the user’s speech (overlapped audio), which signals active listening without claiming the floor.
- Rapid Barge-In: Because our generation is instantaneous, we can stop speaking the millisecond the user interrupts. Traditional systems often have 500ms of audio buffered, meaning the AI keeps talking over you for half a second after you interrupted. That “fighting for the floor” feeling is toxic to immersion.
Conclusion
We aren’t obsessed with 100ms latency because it’s a nice round number. We are obsessed with it because it is the biological threshold for human connection. Anything slower is just an answering machine.