Best Speech-to-Text APIs in 2026
The STT field, honestly ranked — Deepgram and AssemblyAI's hosted duel, Whisper as the open baseline, Cartesia Ink for latency — and how to pick by workload.
Four answers cover STT in 2026: Deepgram (streaming-first enterprise workhorse), AssemblyAI (promptable Universal-3 Pro plus the understanding stack — summaries, sentiment, PII), Whisper (the open-weights baseline for self-hosting via faster-whisper/whisper.cpp), and Cartesia Ink (the latency newcomer with model-native turn detection). Pick by workload.
Key takeaways
- STT split into workloads: realtime streaming (agents, captions), batch + understanding (calls, meetings, compliance), and self-hosted (privacy, unit cost) — each has a different winner.
- Deepgram and AssemblyAI are the hosted duel: Deepgram's identity is streaming speed at enterprise scale; AssemblyAI's is the intelligence layer — promptable transcription plus summaries, sentiment, speaker ID, and redaction.
- Whisper stays the open baseline: MIT weights, huge ecosystem, but no new generation since turbo — hosted models now beat it on accuracy and features when audio can leave your perimeter.
- Cartesia Ink is the one to watch for agents: turn detection native to the model removes a whole pipeline component.
- Benchmark WER on YOUR audio — accents, jargon, phone-line quality — and watch billing models: one vendor meters streaming by session time, not audio time.
Speech-to-text stopped being one product: realtime streaming (agents, captions), batch with understanding (every call center's analytics), and self-hosted (privacy and unit economics) reward different engines. The 2026 field maps cleanly onto those workloads.
The short list
| Engine | Pick it for | Shape |
|---|---|---|
| Deepgram | Realtime streaming at scale | Hosted, streaming-first |
| AssemblyAI | Accuracy + the understanding stack | Hosted, promptable |
| Whisper | Self-hosting, privacy, batch cost | Open weights (MIT) |
| Cartesia (Ink) | Agent latency, native turn detection | Hosted, realtime specialist |
The picks, by workload
Realtime agents → Deepgram or Ink. Deepgram built its identity on streaming: low latency, robust endpointing, enterprise scale, with Aura TTS alongside for one-vendor stacks. Cartesia Ink is the 2026 challenger — streaming STT with turn detection emitted by the model itself (no external VAD), which deletes one of the voice pipeline's trickiest components; it's English-first at launch.
Batch + understanding → AssemblyAI. Universal-3 Pro's promptability — steer transcription with context and keyterms — is the accuracy story of 2026, and the platform around it (summarization, sentiment, speaker ID, translation, PII redaction in 50+ languages) turns audio archives into queryable, compliant data. When transcription is the input to analysis, this stack is the shortcut.
Self-hosted → Whisper. MIT weights, ~99 languages, and an ecosystem (faster-whisper, whisper.cpp) that runs it from datacenter to laptop. The honest 2026 status: no new generation since turbo, so hosted models lead on accuracy and features — but nothing touches it when audio can't leave your infrastructure or volume makes per-hour pricing sting.
How to actually choose
Three checks beat any leaderboard. WER on your audio: fifty representative clips — your accents, your jargon, your phone-line quality — through each candidate; the published-benchmark winner loses on somebody's domain every week. Latency where it counts: for agents, measure streaming time-to-first-token and endpointing behavior from your region, p95 not median. The billing fine print: AssemblyAI streams bill by session time (idle sockets cost), Whisper bills in GPU-hours and engineering, add-ons stack per hour everywhere. The output half of the conversation is Best TTS APIs; the architecture that consumes both is Realtime Voice Agents.
Frequently asked questions
- What's the most accurate speech-to-text API?
- On clean English benchmarks the hosted leaders cluster tightly; on YOUR audio they diverge — domain terms, accents, and noise decide it. AssemblyAI's promptable Universal-3 Pro takes a real lead when you can feed it context and keyterms (vendor-cited +45% on domain terms); Deepgram counters with streaming performance. Run both on fifty representative clips before believing anyone's chart.
- Is Whisper good enough to skip the APIs?
- For privacy-bound, cost-sensitive, or offline workloads — yes, via faster-whisper or whisper.cpp, with VAD to manage its silence-hallucination habit. You give up streaming-grade latency, diarization, and the understanding layers, and accept a model frozen at the turbo generation. It's the right floor, not the frontier.
- Which STT should a voice agent use?
- A streaming model with great time-to-first-token and solid endpointing: Deepgram and AssemblyAI's streaming tiers are the proven picks, with Cartesia Ink the latency-first newcomer whose native turn detection simplifies the pipeline. If you'd rather not assemble at all, the bundles (AssemblyAI's Voice Agent API, LiveKit Inference, Vapi) make STT a dropdown.
Related
- DeepgramA voice-AI platform with fast, accurate speech-to-text (Nova) and low-latency text-to-speech (Aura), plus a bundled Voice Agent API.
- AssemblyaiSpeech AI platform: Universal STT models (promptable Universal-3 Pro), a flat-rate Voice Agent API, and speech understanding — summarization, sentiment, PII redaction.
- WhisperOpenAI's open-weights speech-to-text — the MIT-licensed multilingual model family that made self-hosted transcription a default, with a huge ecosystem.
- CartesiaReal-time voice AI on state-space models — Sonic streaming TTS, Ink STT with native turn detection, and Line, a code-first voice-agent platform.
- Best Text-to-Speech APIs in 2026The TTS APIs worth building on — ElevenLabs for quality and breadth, Cartesia Sonic for realtime latency — and how to choose for agents vs produced audio.
- Realtime Voice Agents: Build on LiveKit, Buy Vapi, or Pipeline with PipecatThe three ways to ship a realtime voice agent in 2026 — open infrastructure, managed platform, or OSS pipeline framework — and how speech-to-speech models change it.
- How to Build a Voice Agent: The STT → LLM → TTS PipelineHow to build a real-time voice agent: the STT → LLM → TTS pipeline, the latency budget that makes or breaks it, and how to wire each stage.