# Best Speech-to-Text APIs in 2026

> The STT field, honestly ranked — Deepgram and AssemblyAI's hosted duel, Whisper as the open baseline, Cartesia Ink for latency — and how to pick by workload.

Four answers cover STT in 2026: Deepgram (streaming-first enterprise workhorse), AssemblyAI (promptable Universal-3 Pro plus the understanding stack — summaries, sentiment, PII), Whisper (the open-weights baseline for self-hosting via faster-whisper/whisper.cpp), and Cartesia Ink (the latency newcomer with model-native turn detection). Pick by workload.

Speech-to-text stopped being one product: **realtime streaming** (agents, captions), **batch with understanding** (every call center's analytics), and **self-hosted** (privacy and unit economics) reward different engines. The 2026 field maps cleanly onto those workloads.

## The short list

| Engine | Pick it for | Shape |
| --- | --- | --- |
| [Deepgram](/tools/deepgram) | Realtime streaming at scale | Hosted, streaming-first |
| [AssemblyAI](/tools/assemblyai) | Accuracy + the understanding stack | Hosted, promptable |
| [Whisper](/tools/whisper) | Self-hosting, privacy, batch cost | Open weights (MIT) |
| [Cartesia](/tools/cartesia) (Ink) | Agent latency, native turn detection | Hosted, realtime specialist |

## The picks, by workload

**Realtime agents → [Deepgram](/tools/deepgram) or Ink.** Deepgram built its identity on streaming: low latency, robust endpointing, enterprise scale, with Aura TTS alongside for one-vendor stacks. [Cartesia Ink](/tools/cartesia) is the 2026 challenger — streaming STT with **turn detection emitted by the model itself** (no external VAD), which deletes one of the [voice pipeline's](/guides/voice/build-a-voice-agent) trickiest components; it's English-first at launch.

**Batch + understanding → [AssemblyAI](/tools/assemblyai).** Universal-3 Pro's *promptability* — steer transcription with context and keyterms — is the accuracy story of 2026, and the platform around it (summarization, sentiment, speaker ID, translation, PII redaction in 50+ languages) turns audio archives into queryable, compliant data. When transcription is the input to analysis, this stack is the shortcut.

**Self-hosted → [Whisper](/tools/whisper).** MIT weights, ~99 languages, and an ecosystem (faster-whisper, whisper.cpp) that runs it from datacenter to laptop. The honest 2026 status: no new generation since turbo, so hosted models lead on accuracy and features — but nothing touches it when audio can't leave your infrastructure or volume makes per-hour pricing sting.

## How to actually choose

Three checks beat any leaderboard. **WER on your audio**: fifty representative clips — your accents, your jargon, your phone-line quality — through each candidate; the published-benchmark winner loses on somebody's domain every week. **Latency where it counts**: for agents, measure streaming time-to-first-token and endpointing behavior from your region, p95 not median. **The billing fine print**: AssemblyAI streams bill by *session* time (idle sockets cost), Whisper bills in GPU-hours and engineering, add-ons stack per hour everywhere. The output half of the conversation is [Best TTS APIs](/guides/voice/best-tts-apis-2026); the architecture that consumes both is [Realtime Voice Agents](/guides/voice/realtime-voice-apis).

---

_Source: https://agentscamp.com/guides/voice/best-stt-apis-2026 — Guide on AgentsCamp._
