ElevenLabs
A voice-AI platform for high-quality text-to-speech, voice cloning, dubbing, and real-time conversational agents, via API.
ElevenLabs is a voice-AI platform best known for state-of-the-art text-to-speech: natural, expressive voices in many languages, plus voice cloning, dubbing, sound effects, and a speech-to-text model. It also offers conversational AI agents, and everything is available via API under one credit-based plan — a common choice for the TTS (or the whole voice) stage of a voice agent.
ElevenLabs is a voice-AI platform whose core strength is text-to-speech — among the most natural and expressive synthetic voices available, across many languages, with low-latency models (Flash, Turbo) built for real-time use. Around that it has grown a full voice suite: voice cloning, dubbing, sound effects, music, a speech-to-text model (Scribe), and conversational AI agents — all accessible via API and billed under one credit system.
For building a voice agent, it's most often the TTS stage — the voice your agent speaks with — though its bundled conversational-agent product can cover the whole STT → LLM → TTS loop when you want the simplest path.
Highlights
- State-of-the-art TTS — natural, expressive voices in 70+ languages, with low-latency Flash/Turbo models for real-time agents.
- Voice cloning — instant clones from a short sample, or high-fidelity professional voice cloning.
- Conversational AI agents — build real-time voice agents with built-in STT, LLM, and TTS.
- Dubbing, sound effects & music — translate and re-voice audio/video, generate effects and music.
- One API, one credit system — TTS billed per character; speech-to-text per minute; agents per minute.
In an AI-assisted workflow
# stream TTS audio so playback can start before the full reply is generated
from elevenlabs.client import ElevenLabs
client = ElevenLabs() # reads ELEVENLABS_API_KEY
audio = client.text_to_speech.stream(voice_id="...", model_id="eleven_flash_v2_5", text=reply)TIP
For voice agents, latency beats fidelity: prefer the low-latency models (Flash/Turbo) and stream the audio so it begins playing as the LLM's tokens arrive — time-to-first-byte is what users feel.
Good to know
ElevenLabs is a commercial platform with a freemium plan: a free tier (with attribution and limited credits) and paid tiers (Creator, Pro, Scale, Enterprise) priced in credits, where credits map to characters of TTS, minutes of speech-to-text, and minutes of agent conversation. It's a hosted API — your text/audio passes through it — so factor in availability and data handling. For the speech-to-text side of a voice agent, compare Deepgram; to orchestrate a custom pipeline, Pipecat.
Related
- How to Build a Voice Agent: The STT → LLM → TTS PipelineHow to build a real-time voice agent: the STT → LLM → TTS pipeline, the latency budget that makes or breaks it, and how to wire each stage.
- DeepgramA voice-AI platform with fast, accurate speech-to-text (Nova) and low-latency text-to-speech (Aura), plus a bundled Voice Agent API.
- PipecatAn open-source Python framework for real-time voice and multimodal conversational AI — it orchestrates streaming STT, LLM, and TTS into composable pipelines.
- Voice Agent EngineerUse this agent to build or fix a real-time voice agent — the streaming STT → LLM → TTS pipeline, conversational (mouth-to-ear) latency, turn-taking, barge-in/interruptions, and per-stage provider selection. Examples — "our voice bot feels laggy and talks over people, fix the turn-taking and latency", "build a phone agent that transcribes, answers with our LLM, and speaks back", "get our voice agent's response time under a second".