# Realtime Voice Agents: Build on LiveKit, Buy Vapi, or Pipeline with Pipecat

> The three ways to ship a realtime voice agent in 2026 — open infrastructure, managed platform, or OSS pipeline framework — and how speech-to-speech models change it.

Three postures cover realtime voice in 2026: build on LiveKit (open WebRTC infra + agents framework + telephony — maximum control), assemble with Pipecat (the OSS pipeline framework for custom STT→LLM→TTS flows), or buy Vapi (assistants live in an afternoon at a per-minute platform fee). Speech-to-speech realtime models slot into all three rather than replacing them.

Voice agents crossed the production threshold — a billion-plus calls on the major platforms — and the tooling sorted into three honest postures. The question isn't which is "best"; it's **how much of the realtime stack you want to own**.

## The short list

| Posture | Tool | You own | You get |
| --- | --- | --- | --- |
| **Build** | [LiveKit](/tools/livekit) | Infra + pipeline | Open source, max control, scale economics |
| **Assemble** | [Pipecat](/tools/pipecat) | Pipeline logic | OSS framework, provider freedom |
| **Buy** | [Vapi](/tools/vapi) | Config | Live agents in an afternoon, per-minute fee |

## The three postures

**Build on [LiveKit](/tools/livekit)** when voice is core product. The Apache-2.0 WebRTC server plus the Agents framework covers transport, the STT→LLM→TTS pipeline *or* realtime speech-to-speech models, an open-sourced semantic turn-detection model, and Telephony 1.0 (SIP, transfers, scale) — with LiveKit Cloud as the managed escape hatch. The credential is hard to argue with: per LiveKit, ChatGPT's Voice Mode runs on this stack. Cost: real engineering; payoff: control and unit economics that improve with volume.

**Assemble with [Pipecat](/tools/pipecat)** when the pipeline *is* your differentiation. The open-source Python framework composes voice flows from interchangeable pieces — any [STT](/guides/voice/best-stt-apis-2026), any LLM, any [TTS](/guides/voice/best-tts-apis-2026), custom logic between stages — without also adopting a media-server worldview. It pairs naturally with LiveKit or other transports underneath.

**Buy [Vapi](/tools/vapi)** when shipping beats owning. Assistant = prompt + model + voice + tools; attach a number; live. Turn-taking (vendor-claimed sub-600ms), interruptions, telephony, and multi-agent Squads come managed, at a platform fee per minute plus model costs (BYO keys pass through at cost). The 2026 traction — a $50M Series B, Amazon Ring routing all inbound calls through it — says the buy side is no toy. ([Cartesia Line](/tools/cartesia) plays the same posture, vertically integrated on Cartesia's models.)

## What actually decides quality

Whatever posture you pick, the same three system properties make or break the agent. **The latency budget**: a natural conversational turn is well under a second — STT endpointing + LLM time-to-first-token + TTS time-to-first-audio, measured p95, decides whether the agent feels human or like hold music. **Turn detection**: knowing when the user *finished* (versus paused) is the hardest perceptual problem in the stack — LiveKit's open semantic model, Ink's native turn events, and platform bundles are all answers to it. **Interruption handling**: users barge in; the agent must stop talking, cheaply discard in-flight generation, and listen — a transport-and-state problem no model solves alone.

Start by posture (own infra / own pipeline / own nothing), prototype on the buy side if speed matters, and revisit the build math when minutes get expensive. The component-level walkthrough — models, prompts, and the pipeline's failure modes — is [How to Build a Voice Agent](/guides/voice/build-a-voice-agent), and the [voice-agent-engineer](/agents/data-ai/voice-agent-engineer) agent owns exactly this build.

---

_Source: https://agentscamp.com/guides/voice/realtime-voice-apis — Guide on AgentsCamp._