Assemblyai
Speech AI platform: Universal STT models (promptable Universal-3 Pro), a flat-rate Voice Agent API, and speech understanding — summarization, sentiment, PII redaction.
AssemblyAI packages speech intelligence as one API: the Universal STT family — topped by Universal-3 Pro (February 2026), a promptable speech model you steer with natural-language context and keyterms — streaming for voice agents, a flat-rate Voice Agent API bundling STT+LLM+TTS over one WebSocket, and understanding layers. Freemium with signup credits, then per-hour usage.
AssemblyAI's bet is that transcription is the floor, not the product: the value sits in speech understanding — and increasingly in owning the whole voice-agent loop. Its 2026 lineup runs from promptable STT to a one-WebSocket agent pipeline.
Highlights
- Universal-3 Pro (Feb 2026) — promptable STT: steer with natural-language context and keyterms, capture disfluencies, handle code-switching; six native languages with routing to 99+.
- Streaming STT — the realtime tier voice agents and live captions build on.
- Voice Agent API (Apr 2026) — STT + LLM + TTS + turn detection + interruptions + tool calling over one WebSocket, flat-rate per hour.
- Speech understanding — summarization, sentiment, entities, topics, speaker labels/identification, translation across 89 languages.
- Guardrails — PII redaction, profanity filtering, and moderation in 50+ languages: the compliance layer audio pipelines need.
- LLM Gateway — route understanding workloads across GPT/Claude/Gemini with caching, keeping the audio and reasoning bills in one place.
In an AI-assisted workflow
Sign up, take the key, POST files or open a WebSocket — Python/JS SDKs cover both. In voice-agent stacks it's either the best-in-class STT component or, via the Voice Agent API, the whole pipeline; in data work, it's the "turn 10,000 calls into queryable, redacted, summarized records" machine.
WARNING
Two billing edges: streaming meters session time (close idle connections), and the legacy best/nano model tiers are deprecated — new integrations should target the Universal family.
Good to know
Hosted and proprietary, with a genuinely useful free-credit start. Against the field: Deepgram competes hardest on enterprise streaming, Whisper is the self-host baseline, Cartesia Ink the latency-first newcomer — the decision table is Best Speech-to-Text APIs in 2026.
Frequently asked questions
- What's special about Universal-3 Pro?
- It's promptable — a speech language model you steer with context: tell it the domain, feed keyterms (vendor-cited +45% accuracy on domain terms), and it adapts — plus audio tagging, disfluency capture, and code-switching. That contextual steering is the line between 'transcription' and 'transcription that knows what your call is about.'
- What is the Voice Agent API?
- A full speech-to-speech pipeline over a single WebSocket (April 2026): STT, your chosen LLM, TTS, server-side turn detection, interruption handling, and JSON-Schema tool calling — at a flat hourly rate that bundles the whole loop. It collapses the assemble-five-vendors voice stack into one connection.
- How does AssemblyAI pricing work?
- Free credits at signup (no card), then usage: per-hour rates by model for pre-recorded and streaming, with intelligence add-ons (PII redaction, speaker ID, etc.) stacking per hour. One gotcha worth designing around: streaming bills on session duration, not audio duration — idle WebSockets cost money.
Related
- Best Speech-to-Text APIs in 2026The STT field, honestly ranked — Deepgram and AssemblyAI's hosted duel, Whisper as the open baseline, Cartesia Ink for latency — and how to pick by workload.
- Realtime Voice Agents: Build on LiveKit, Buy Vapi, or Pipeline with PipecatThe three ways to ship a realtime voice agent in 2026 — open infrastructure, managed platform, or OSS pipeline framework — and how speech-to-speech models change it.
- DeepgramA voice-AI platform with fast, accurate speech-to-text (Nova) and low-latency text-to-speech (Aura), plus a bundled Voice Agent API.
- WhisperOpenAI's open-weights speech-to-text — the MIT-licensed multilingual model family that made self-hosted transcription a default, with a huge ecosystem.
- CartesiaReal-time voice AI on state-space models — Sonic streaming TTS, Ink STT with native turn detection, and Line, a code-first voice-agent platform.
- How to Build a Voice Agent: The STT → LLM → TTS PipelineHow to build a real-time voice agent: the STT → LLM → TTS pipeline, the latency budget that makes or breaks it, and how to wire each stage.
- VapiThe API-first voice-agent platform — assemble phone and web agents from any STT/LLM/TTS mix, with telephony, squads, and tool calling handled for you.