From Gut Feel to Hard Numbers: Meet the Voice Agent Quality Index (VAQI)

By Dan Mishler

If you’re building a voice agent application, you already know the truth: users don’t hang up or ask for a human because a bot gets a transcription slightly wrong –they quit because the conversation just feels … annoying.

Customer annoyances hide in three places:

– Interruptions – The bot barges in while the caller is mid-thought.

– Long, awkward gaps – The user wonders whether the bot hung up.

– Missed response windows – The customer stops, gives the hint “your turn,” and … nothing.

Individually, each metric tells only part of the story, which is why teams often fall back on qualitative reviews: “That demo sounded pretty good.” But in the world of non-deterministic outputs, relying on subjective analysis for large-scale applications is risky at best. At Deepgram, we decided “pretty good” isn’t good enough for enterprise SLAs – so we built a scoring system that turns “feels right” into a hard, quantifiable metric.

Introducing the Voice-Agent Quality Index (VAQI). The VAQI condenses the three key timing pillars – interruptions (I), missed response windows (M), and latency (L) – into a single 0-to-100 score.

Normalization Methodology:

– Interruptions (40% weight): Each provider’s interruption count per conversation was normalized against the highest count among all providers for that same conversation, creating a 0–1 scale. This accounts for differences in conversation difficulty.

– Missed Responses (40% weight): Normalized the number of missed regions per provider against the maximum missed count in the same conversation. Easier conversations penalized more for the same number of misses.

– Latency (20% weight): Applied a log transformation (log(1 + latency)) to reduce the impact of outliers, then normalized by maximum latency per conversation. This ensures high-latency outliers don’t skew results.

This approach keeps VAQI scores consistent and comparable across various conversation difficulties, avoiding distortion from edge cases or complex inputs.

Deepgram’s enterprise customers repeatedly said, “Sub-300ms latency is great, but an agent that talks and listens like a real human is what we really need.”

Speed alone doesn’t ensure a good experience. A fast STT model followed by a slow LLM still results in a poor interaction. Likewise, excellent language understanding is worthless if the agent constantly interrupts the user. VAQI was created as a balanced scoring system that reflects real-world conversational performance.

The Test Bed: Food Ordering, on Purpose

We chose a food-ordering scenario – a deceptively simple but highly challenging dialog type – for testing. It includes natural pauses and fillers (“Hi, um … can I get a …”); contradictions (“Make that large … no, sorry, medium”); background noise (e.g., restaurant kitchen ambiance); and sparce response windows with few ideal “agent should speak now” moments.

This reflects the messy audio enterprises actually face – not pristine, studio-quality recordings.

Methodology at a Glance

– Enterprise Focus: We ussed 16 kHz PCM pre-recorded calls streamed over secure websockets to five providers: Deepgram, OpenAI, ElevenLabs, and Azure.

– 50 ms Chunks: Audio sliced identically and synchronized to the microsecond across providers.

– Multiple Passes: Each call was processed at least 10 times to control for natural LLM and network variance.

– Full-Stack Timestamps: Events were mapped back to the WAV file to calculate I, M, and L.

– Outlier Control: VAQI flags, but does not discard, extreme latency or disconnects—penalizing brittle systems.

What We Learned

– Single metrics can be misleading. One provider had a perfect interruption score but multi-second delays – dragging its VAQI below 70/100.

– Latency over ~3 seconds breaks perception. Even low-interruption, low-miss runs scored in the 50s when latency exceeded 3 seconds.

– Balance wins. Deepgram achieved top scores (70+) with a blend of sub-second latency, low interruptions, and minimal missed cues.

Why VAQI Beats the Back-of-the-Napkin Test

– Actionable Targets: Engineers can improve specific areas – cut interruptions, reduce latency, or tune response handling.

– Procurement Clarity: A VAQI score gives decision-makers clear, comparable benchmarks for evaluating vendors.

– End-to-End Accountability: VAQI doesn’t care where the delay or miss comes from – if the user experience suffers, the score reflects it.

From Qualitative to Quantitative

Until now, “pleasantness” in voice agents was unmeasured. With VAQI, we’ve shown it can be scored, tracked and improved over time. This gives every stakeholder – executives, PMs, engineers – a solid data foundation to act on.

A VAQI score of 71.5 isn’t a perfect 100. Deepgram currently leads our internal leaderboard thanks to fast, accurate STT, natural TTS and precise end-of-thought detection. But we’re far from done. VAQI is a scoreboard, not a trophy.

In the coming months, we’ll release updated rankings, audio comparisons and deeper insights into what’s moving the scores. If you’re building or evaluating voice agents, keep your eye on this space – and demand more from every provider you consider. Because the agents that sound effortless are the ones customers come back to.

Dan Mishler is chief of staff at Deepgram.

This post originally appeared on the Deepgram blog.