Have you ever talked to a voice assistant and experienced one of these frustrating moments?
The AI cuts you off mid-sentence because you paused to think
You finish speaking but wait... and wait... for a response that seems delayed
You say "My phone number is 555..." and the AI responds before you finish
You say "um" or "you know" and the AI thinks you're done talking
These problems stem from poor turn detection — the AI's ability to know when you've finished speaking and expect a response.
Turn detection (also called "endpointing" or "end-of-turn detection") is the process of determining when a speaker has completed their conversational turn and is waiting for a response.
Traditional Approach: Voice Activity Detection (VAD)
Early systems used simple silence detection: "If the user stops talking for 1-2 seconds, they must be done."
This fails because:
People naturally pause to think mid-sentence
Filler words like "um" and "uh" create brief silences
Background noise can be mistaken for speech
Different languages and cultures have different pause patterns
Modern Approach: Smart/Semantic Turn Detection
Smart turn detection uses AI to understand not just WHEN you stop talking, but WHETHER you're actually done. It analyzes:
Linguistic cues: Is the sentence grammatically complete?
Semantic context: Did the user answer the question fully?
Prosodic features: Does the intonation indicate finality?
Conversational context: What was the AI's last question?
For companies building voice AI products, turn detection directly impacts user satisfaction, task completion rates, call duration, brand perception, and conversion rates. Turn detection is often the difference between a voice AI that feels magical and one that feels broken.
Our investigation identified three main approaches to turn detection, each with distinct trade-offs.
How they work: Process the transcribed text (from speech-to-text) to determine if a sentence is complete.
Best for: Systems that already have real-time transcription running.
Examples: TEN Turn Detection, LiveKit Turn Detector
Strengths:
Can understand semantic meaning ("I'm flying from San Francisco" → incomplete answer to "What are your departure AND arrival cities?")
Works with any audio quality since it only sees text
Can detect explicit stop commands ("wait", "hold on")
Weaknesses:
Dependent on transcription accuracy and latency
Misses prosodic cues (rising intonation, trailing off)
Can't detect that "123 456" was said with rising tone indicating more digits coming
How they work: Analyze raw audio waveforms to detect acoustic patterns indicating turn completion.
Best for: Low-latency applications where every millisecond counts.
Examples: Pipecat Smart Turn, Krisp Turn-Taking, Easy Turn
Strengths:
Captures intonation, pitch, and speaking rhythm
No dependency on transcription
Can be extremely fast (12ms inference)
Often language-independent (prosody is universal)
Weaknesses:
Can't understand semantic meaning
Doesn't know if the answer is contextually complete
Sensitive to audio quality and background noise
How they work: Combine both audio features and text transcription for the best of both worlds.
Best for: Applications prioritizing accuracy over latency.
Examples: Vogent Turn-80M, UltraVAD
Strengths:
Understands both what you said AND how you said it
Highest accuracy potential
Context-aware (knows the conversation history)
Weaknesses:
More complex to deploy
Requires both audio processing and transcription
Higher computational requirements
TEN Turn Detection
Provider: Agora/TEN Framework Input: Text only Languages: English, Chinese States Detected: 3 (finished, unfinished, wait) Model Size: 7B parameters
Pros: ✓ Meets 90%+ accuracy threshold ✓ Detects "wait" commands ✓ Fully open source (Apache 2.0) ✓ 3-state classification
Cons: ✗ Only 2 languages supported ✗ Large model (7B parameters) ✗ Requires transcription first ✗ Higher latency
Resources: HuggingFace (TEN-framework/TEN_Turn_Detection) • GitHub (TEN-framework/ten-turn-detection)
LiveKit Turn Detector
Provider: LiveKit Input: Text only Languages: 13 (EN, FR, ES, DE, IT, PT, NL, ZH, JA, KO, ID, RU, TR) States Detected: 2 (end/continue) Inference: ~25ms
Pros: ✓ Good multilingual (13 languages) ✓ Small model (0.5B params) ✓ Fast inference (25ms) ✓ Apache 2.0 license
Cons: ✗ No "wait" detection ✗ Only 2 states ✗ Requires transcription ✗ Varies by language
Resources: HuggingFace (livekit/turn-detector)
Pipecat Smart Turn v3.1
Provider: Daily/Pipecat Input: Audio only Languages: 23 (AR, BN, ZH, DA, NL, DE, EN, FI, FR, HI, ID, IT, JA, KO, MR, NO, PL, PT, RU, ES, TR, UK, VI) Model Size: 8MB (CPU) / 32MB (GPU) Inference: 12ms (CPU)
Pros: ✓ Fastest CPU inference (12ms) ✓ 23 languages supported ✓ Tiny model (8MB) ✓ Fully open (code + data + weights) ✓ Active community
Cons: ✗ No semantic understanding ✗ No backchannel detection ✗ Only 2 states ✗ Audio quality dependent
Resources: HuggingFace (pipecat-ai/smart-turn-v3) • GitHub (pipecat-ai/smart-turn)
Krisp Turn-Taking v2
Provider: Krisp Input: Audio only Languages: Language-independent (works on any language) Model Size: 6M parameters Inference: <20ms
Pros: ✓ Works on ANY language ✓ Smallest model (6M params) ✓ Very fast (<20ms) ✓ Production-proven ✓ Noise cancellation ecosystem
Cons: ✗ Closed source (SDK only) ✗ Proprietary/paid ✗ No semantic understanding ✗ Only 2 states
Resources: Krisp SDK (krisp.ai/developers/)
UltraVAD
Provider: Fixie/Ultravox Input: Audio + Text (multimodal) Languages: 26 (AR, BG, ZH, CS, DA, NL, EN, FI, FR, DE, EL, HI, HU, IT, JA, PL, PT, RO, RU, SK, ES, SV, TA, TR, UK, VI) Model Size: ~1B parameters Inference: 65-110ms
Pros: ✓ Best language coverage (26 languages) ✓ Context-aware ✓ Multimodal (audio + text fusion) ✓ Apache 2.0 license
Cons: ✗ Slower inference (65-110ms) ✗ Requires GPU ✗ Larger model (~1B params) ✗ Only 2 states
Resources: HuggingFace (fixie-ai/ultraVAD)
Easy Turn
Provider: ASLP-lab (Academic) Input: Audio Languages: Chinese only (English planned) States Detected: 4 (complete, incomplete, backchannel, wait) Training Data: 1,145 hours available
Pros: ✓ Only model with backchannel detection ✓ 4 states (most comprehensive) ✓ Fully open (code + 1,145 hrs data) ✓ Best benchmark dataset
Cons: ✗ Chinese only (for now) ✗ Limited language support ✗ Academic project
Resources: HuggingFace (ASLP-lab/Easy-Turn-Testset) • GitHub (ASLP-lab/Easy-Turn)
Vogent Turn-80M
Provider: Vogent Input: Audio + Text (multimodal) Languages: English only Accuracy: 94.1% Inference: 7ms (GPU)
Pros: ✓ Highest accuracy (94.1%) ✓ Fastest GPU inference (7ms) ✓ Context-aware ✓ Small for multimodal (80M params)
Cons: ✗ English only ✗ Restrictive license ✗ Requires GPU ✗ Only 2 states
Resources: HuggingFace (vogent/Vogent-Turn-80M) • GitHub (vogent/vogent-turn)
"I need the best accuracy for English" → Vogent Turn-80M (94.1% accuracy, multimodal)
"I need multilingual support" → UltraVAD (26 languages) or Smart Turn v3.1 (23 languages, faster)
"I need the fastest inference" → Vogent Turn-80M (7ms GPU) or Smart Turn v3.1 (12ms CPU)
"I need fully open source with training data" → Pipecat Smart Turn v3.1 or Easy Turn
"I need to detect backchannels (uh-huh, yeah)" → Easy Turn (only option, Chinese only currently)
"I need text-only with 90%+ accuracy" → TEN Turn Detection (90.6% finished, 98.4% unfinished)
"I need to support ANY language" → Krisp Turn-Taking v2 (language-independent, but closed source)
"I need the smallest model" → Krisp v2 (6M params) or Smart Turn v3.1 (8MB)
Want to train your own turn detection model? Here's a practical guide with example data you can use for text, audio, and multimodal approaches.
Collect Data: Gather conversational samples with turn boundaries labeled
Label Data: Mark each utterance as complete, incomplete, backchannel, or wait
Preprocess: Convert to model input format (text tokens, audio features, or both)
Train: Fine-tune a base model on your labeled data
Evaluate: Test on held-out data, measure accuracy per class
Deploy: Optimize for inference speed (quantization, ONNX export)
Your training data should include the utterance, optional context, and a label.
For text-only models, you need transcript + context + label. See the downloadable JSONL file for complete examples.
COMPLETE Examples (User finished, expects response):
"What time does the store close?" → complete
"I'd like to book a flight to New York for next Tuesday." → complete
"Yes, that works for me." (context: "Would 3pm work?") → complete
INCOMPLETE Examples (User paused but will continue):
"I'm looking for a flight to, um" → incomplete
"My phone number is 555" → incomplete
"I'm departing from San Francisco." (context: "What are your departure AND arrival cities?") → incomplete
BACKCHANNEL Examples (Acknowledgment, not taking turn):
"Uh-huh" (context: AI still explaining) → backchannel
"Yeah" (context: AI describing process) → backchannel
"Mm-hmm" (context: AI giving instructions) → backchannel
WAIT Examples (Explicit stop/pause request):
"Wait, hold on a second." → wait
"Hang on, let me grab a pen." → wait
"Give me a minute." → wait
For audio models, the model learns prosodic patterns (pitch, rhythm, intonation).
Audio Data Requirements:
Format: WAV or FLAC, 16kHz mono recommended
Duration: Last 2-8 seconds of user speech
Quality: Include both clean and noisy samples
Diversity: Multiple speakers, accents, speaking speeds
Key Prosodic Features:
↘ Falling intonation = typically complete
↗ Rising intonation = typically incomplete/question
... Trailing off = typically incomplete
Sharp/loud = typically wait (urgent)
Text-Based: Fine-tune an LLM with a classification head on the last token.
Audio-Based: Fine-tune Wav2Vec2 or Whisper encoder. Add classification head, freeze early layers. Input is last 8 seconds of audio at 16kHz.
Multimodal: Fuse audio + text embeddings. Use Whisper encoder + LLM with projection layer. Example: Vogent uses Whisper encoder → SmolLM with 80M params.
Start with text-only — easier to prototype, then add audio later
Use a small base model — SmolLM (135M) or Qwen2.5-0.5B for low latency
Oversample rare classes — backchannel and wait are typically underrepresented
Include context — models that see the AI's last question perform much better
Test with real users — synthetic data doesn't capture all natural speech patterns
Optimize for false positives first — interrupting users is worse than slight delays
Quantize for production — INT8 quantization can give 2-4x speedup with minimal accuracy loss
Turn detection has evolved rapidly in 2024-2026, with several high-quality open-source options now available. The best choice depends on your specific requirements:
Prioritize speed? → Smart Turn v3.1 or Vogent
Prioritize accuracy? → Vogent or TEN
Prioritize language coverage? → UltraVAD or Smart Turn
Prioritize openness? → Smart Turn or Easy Turn
Need production-ready but closed-source? → Krisp
The field is moving fast, with new models and improvements released regularly. We recommend evaluating multiple options on your specific use case using the Easy Turn Testset or TEN TestSet benchmarks.