Callaquest is a real-time AI voice and calling product. In a voice product, latency isn't a metric on a dashboard — it's the conversation. A pause that would be invisible in a chat app becomes an awkward silence on a call, and enough of them in a row becomes a dropped call and a churned user. The pipeline was accurate. It just wasn't fast enough to feel human.
A note on the numbers: the figures on this page illustrate the structure and scale of the engagement; exact metrics are shared with Callaquest's permission on request.
Every spoken turn ran through a chain: capture audio, transcribe it, run the language model, synthesise a voice reply, stream it back. Each link added its own delay, and the links ran strictly one after another — nothing started until the previous step had completely finished. The user heard the sum of all of it as a single, uncomfortable pause before the AI spoke.
Under a few test users it was tolerable. Under real concurrent load, the servers — provisioned as one fixed pool for the whole pipeline — started queuing requests, and the pause stretched until people simply hung up.
We measured per-turn latency as concurrent calls increased. Before, latency climbed past the ~1-second mark where a conversation stops feeling natural, then kept climbing until calls dropped. After, it stayed flat and well under that line.
A fully sequential pipeline. Transcription, reasoning and speech synthesis ran strictly in order — the model didn't start until the last word was transcribed, and speech didn't start until the model fully finished.
No streaming. The reply was generated in full before any audio played, so the user waited for the entire sentence instead of hearing it begin.
One fixed server pool for the whole pipeline. Speech synthesis and the language model competed for the same capacity, and nothing scaled to match call volume.
Cold model loads on traffic spikes. When call volume jumped, new instances spent precious seconds warming up — exactly when latency mattered most.
The win came from overlapping the pipeline instead of running it end to end, and from sizing each stage independently so no single step became the bottleneck under load.
Streaming, token-by-token: speech synthesis starts on the first words the model produces, so the user hears a reply begin almost immediately instead of waiting for the whole sentence.
Overlapped stages: transcription of the tail of a sentence overlaps with reasoning on its start — the pipeline stops being a strict relay race.
Independent scaling per stage: speech synthesis and the language model now scale on their own curves, with warm capacity kept ready ahead of predictable call peaks — so spikes don't pay a cold-start penalty.
Latency stopped being the thing that capped Callaquest's growth. The product can take on more concurrent calls without the conversation degrading, and it does so on less peak infrastructure than before — because each stage is sized for its own load instead of one pool over-provisioned for the worst case. Fast and cheaper, at the same time, is what right-sizing a real-time pipeline buys you.
Real-time AI lives or dies on the pipeline behind it. A build audit will pinpoint which stage is costing you the seconds — and the users.
Book a build audit →