# How to Build a Voice Agent: The STT → LLM → TTS Pipeline

> How to build a real-time voice agent: the STT → LLM → TTS pipeline, the latency budget that makes or breaks it, and how to wire each stage.

A voice agent is a real-time loop: speech-to-text transcribes the user, an LLM picks the reply, and text-to-speech speaks it back. What separates a usable agent from a frustrating one is the latency budget — every stage adds delay, and the round trip must feel conversational. This guide covers the pipeline, the providers per stage, turn-taking, and engineering the latency.

A voice agent sounds simple — the user talks, the agent talks back — but under the hood it's a **real-time pipeline** with three stages and an unforgiving latency budget. Speech-to-text turns the user's audio into text, an LLM decides the reply, and text-to-speech speaks it. Get the architecture right and it feels like a conversation; get the latency wrong and it feels like a bad phone call. This guide walks the pipeline and the engineering that actually makes it work.

## The pipeline: STT → LLM → TTS

Three stages, in a loop, ideally all streaming:

- **Speech-to-text (STT)** — transcribe the incoming audio, streaming partial results. A specialized provider like [Deepgram](/tools/deepgram) gives you low-latency streaming transcription with voice-activity detection and endpointing.
- **LLM** — take the transcript (plus history) and generate the reply, streamed token by token. This is an ordinary LLM call — route it through your gateway so you can right-size the model and add fallback (see [Calling Any Model](/guides/concepts/calling-any-model-gateways)).
- **Text-to-speech (TTS)** — turn the reply into audio as the tokens arrive, so playback starts before the reply is finished. [ElevenLabs](/tools/elevenlabs) and Deepgram's Aura are common choices.

The non-negotiable principle: **stream everything**. If you wait for STT to fully finish before calling the LLM, then wait for the whole LLM reply before starting TTS, the delays stack into something unusable. Overlap the stages.

## Latency is the product

> [!WARNING]
> The single biggest determinant of whether a voice agent feels good is **mouth-to-ear latency** — the time from the user finishing to the first audio of the reply. Natural conversation has gaps of only a couple hundred milliseconds, so a round trip much past a second starts to feel broken — and that budget has to cover endpointing, the LLM's time-to-first-token, and the TTS's time-to-first-byte *combined*. Optimize the round trip, not any single stage in isolation.

The levers, in rough order of impact: stream every stage; use low-latency STT/TTS models; keep the LLM prompt tight and the model right-sized (the [llm-cost-latency-engineering](/guides/advanced/llm-cost-latency-engineering) playbook applies directly); and minimize network hops between services.

## Turn-taking: the part everyone underestimates

A natural conversation isn't just fast — it has rhythm. Three mechanics matter as much as model quality:

- **Voice-activity detection (VAD)** — knowing when the user is speaking versus silent.
- **Endpointing** — deciding the user has actually *finished*, not just paused. Too eager and you cut them off; too patient and the agent feels slow.
- **Barge-in** — when the user starts talking while the agent is speaking, immediately stop the TTS and the in-flight LLM call. Without barge-in, the agent steamrolls the user and the illusion breaks.

## Orchestrate it with a framework

Building real-time audio transport, streaming hand-offs, and turn-taking by hand is where most voice projects stall. An open-source framework like [Pipecat](/tools/pipecat) gives you the composable STT → LLM → TTS pipeline, WebRTC/WebSocket transports, and integrations with dozens of providers — so you build the agent's behavior, not the plumbing. You can prototype against a single bundled voice-agent API first, then unbundle the stage that's limiting you.

## Putting it together

Stream STT with good endpointing → stream the LLM through your gateway → stream TTS so audio starts early → orchestrate with a framework → add barge-in and tune turn-taking → budget and measure mouth-to-ear latency, then fix the slowest stage. The [voice-agent-engineer](/agents/data-ai/voice-agent-engineer) builds and tunes this loop end-to-end. For the model side — whether to call a hosted API or self-host — see [Self-Host vs API](/guides/mlops/self-host-vs-api-llm).

---

_Source: https://agentscamp.com/guides/voice/build-a-voice-agent — Guide on AgentsCamp._
