● CASE STUDY · VOICE · REAL-TIME AI

Callaquest: cutting AI voice latency until calls stopped dropping

PERFORMANCE ENGINEERING STREAMING INFERENCE RIGHT-SIZED INFRA

Callaquest is a real-time AI voice and calling product. In a voice product, latency isn't a metric on a dashboard — it's the conversation. A pause that would be invisible in a chat app becomes an awkward silence on a call, and enough of them in a row becomes a dropped call and a churned user. The pipeline was accurate. It just wasn't fast enough to feel human.

By Yogreet Global Engineering · 10 min read · Updated June 2026

3×

faster response time

1900→620

turn latency (ms)

<1%

call drop rate

−40%

infra cost at peak

A note on the numbers: the figures on this page illustrate the structure and scale of the engagement; exact metrics are shared with Callaquest's permission on request.

The situation

Every spoken turn ran through a chain: capture audio, transcribe it, run the language model, synthesise a voice reply, stream it back. Each link added its own delay, and the links ran strictly one after another — nothing started until the previous step had completely finished. The user heard the sum of all of it as a single, uncomfortable pause before the AI spoke.

Under a few test users it was tolerable. Under real concurrent load, the servers — provisioned as one fixed pool for the whole pipeline — started queuing requests, and the pause stretched until people simply hung up.

The problem we found

We measured per-turn latency as concurrent calls increased. Before, latency climbed past the ~1-second mark where a conversation stops feeling natural, then kept climbing until calls dropped. After, it stayed flat and well under that line.

Per-turn latency as concurrent calls increase— Before— After

Root causes

A fully sequential pipeline. Transcription, reasoning and speech synthesis ran strictly in order — the model didn't start until the last word was transcribed, and speech didn't start until the model fully finished.

No streaming. The reply was generated in full before any audio played, so the user waited for the entire sentence instead of hearing it begin.

One fixed server pool for the whole pipeline. Speech synthesis and the language model competed for the same capacity, and nothing scaled to match call volume.

Cold model loads on traffic spikes. When call volume jumped, new instances spent precious seconds warming up — exactly when latency mattered most.

What we rebuilt

The win came from overlapping the pipeline instead of running it end to end, and from sizing each stage independently so no single step became the bottleneck under load.

BEFORE — sequential

Transcribe

→

Language model

→

Synthesise speech

→

Play

user waits for the sum of every stage ≈ 1,900ms

AFTER — streamed & overlapped

Transcribe

Language model

Synthesise + play

stages overlap; first audio plays as soon as the model starts ≈ 620ms to first sound

Why each change mattered

Streaming, token-by-token: speech synthesis starts on the first words the model produces, so the user hears a reply begin almost immediately instead of waiting for the whole sentence.

Overlapped stages: transcription of the tail of a sentence overlaps with reasoning on its start — the pipeline stops being a strict relay race.

Independent scaling per stage: speech synthesis and the language model now scale on their own curves, with warm capacity kept ready ahead of predictable call peaks — so spikes don't pay a cold-start penalty.

What this means going forward

Latency stopped being the thing that capped Callaquest's growth. The product can take on more concurrent calls without the conversation degrading, and it does so on less peak infrastructure than before — because each stage is sized for its own load instead of one pool over-provisioned for the worst case. Fast and cheaper, at the same time, is what right-sizing a real-time pipeline buys you.