Deepgram
A voice-AI platform with fast, accurate speech-to-text (Nova) and low-latency text-to-speech (Aura), plus a bundled Voice Agent API.
Deepgram is a voice-AI platform centered on fast, accurate speech-to-text (its Nova models, with streaming, diarization, and 45+ languages) and low-latency text-to-speech (Aura). It also offers a bundled Voice Agent API that combines STT, an LLM, and TTS. It's a common choice for the transcription stage of a voice agent, and a single-vendor option for the whole loop.
Deepgram is a voice-AI platform whose core strength is speech-to-text — its Nova models offer fast, accurate streaming transcription across 45+ languages, with speaker diarization, smart formatting, and keyterm prompting. It pairs that with text-to-speech (Aura, tuned for very low time-to-first-byte) and a bundled Voice Agent API that wires STT, an LLM, and TTS into one real-time endpoint.
For building a voice agent, Deepgram is most often the STT stage — turning the user's speech into text with the low latency the loop demands — and increasingly a single-vendor option for the entire pipeline via its Voice Agent API.
Highlights
- Streaming speech-to-text (Nova) — low-latency, accurate transcription with interim results, diarization, and 45+ languages.
- Low-latency text-to-speech (Aura) — sub-200ms time-to-first-byte voices built for real-time agents.
- Voice Agent API — a bundled STT + LLM + TTS endpoint for building voice agents fast.
- Real-time features — voice-activity detection, endpointing, smart formatting, and keyterm prompting.
- Usage-based API — STT billed per minute, TTS per character, the agent API per hour.
In an AI-assisted workflow
# stream microphone audio to Nova and consume interim transcripts for low-latency endpointing
from deepgram import DeepgramClient
dg = DeepgramClient() # reads DEEPGRAM_API_KEY
# open a streaming connection, send audio chunks, receive partial + final transcriptsTIP
For voice agents, lean on interim transcripts and tuned endpointing rather than waiting for a final transcript — reacting early to "the user has stopped" is what keeps the round trip conversational.
Good to know
Deepgram is a commercial platform with a freemium model: free credits to start, then usage-based pay-as-you-go (STT per minute, Aura TTS per character, the Voice Agent API per hour) plus enterprise plans. It's a hosted API, so factor in availability and that audio passes through it. For the text-to-speech side, compare ElevenLabs; to compose a custom STT → LLM → TTS pipeline, see Pipecat.
Related
- How to Build a Voice Agent: The STT → LLM → TTS PipelineHow to build a real-time voice agent: the STT → LLM → TTS pipeline, the latency budget that makes or breaks it, and how to wire each stage.
- ElevenLabsA voice-AI platform for high-quality text-to-speech, voice cloning, dubbing, and real-time conversational agents, via API.
- PipecatAn open-source Python framework for real-time voice and multimodal conversational AI — it orchestrates streaming STT, LLM, and TTS into composable pipelines.
- Voice Agent EngineerUse this agent to build or fix a real-time voice agent — the streaming STT → LLM → TTS pipeline, conversational (mouth-to-ear) latency, turn-taking, barge-in/interruptions, and per-stage provider selection. Examples — "our voice bot feels laggy and talks over people, fix the turn-taking and latency", "build a phone agent that transcribes, answers with our LLM, and speaks back", "get our voice agent's response time under a second".