# Whisper

> OpenAI's open-weights speech-to-text — the MIT-licensed multilingual model family that made self-hosted transcription a default, with a huge ecosystem.

Whisper (OpenAI, MIT, ~102k stars) is the open-weights STT baseline: multilingual transcription across ~99 languages, speech-to-English translation, six model sizes from tiny (runs anywhere) to large, with turbo — an 8x-faster large-v3 — as the practical default. Production deployments mostly run it through faster-whisper or whisper.cpp; hosted Whisper is offered by many APIs.

Website: https://github.com/openai/whisper

Whisper is the model that democratized speech-to-text: open weights, MIT license, and robustness that held up outside the lab. Three-plus years on, it remains the **self-hosted baseline** the whole category is measured against — less because it's unbeatable than because it's *everywhere*, free, and good.

## Highlights

- **Genuinely multilingual** — transcription across ~99 languages (accuracy varies with resource level), plus speech-to-English translation and language ID.
- **Six sizes, one family** — tiny (39M, runs on anything) through large (1.5B); **turbo** packs large-v3 quality at ~8× speed in ~6GB VRAM.
- **The ecosystem is the product** — faster-whisper (CTranslate2, ~23k stars) and whisper.cpp (ggml/Apple-Silicon-native, ~50k stars) are how production actually runs it; pipelines, GUIs, and integrations are innumerable.
- **MIT everything** — weights and code; the only bill is compute.
- **Hosted when you want it** — OpenAI and many providers serve Whisper-family inference if self-hosting isn't the point.

## In an AI-assisted workflow

```bash
pip install -U openai-whisper        # needs ffmpeg
whisper meeting.mp3 --model turbo
```

The classic agent-era uses: private transcription pipelines (audio never leaves your infra), batch processing where per-hour API pricing would sting, and the STT layer of self-hosted [voice agents](/guides/voice/build-a-voice-agent) — usually via whisper.cpp on-device or faster-whisper on a modest GPU.

> [!WARNING]
> Design around the failure modes: add VAD so silence never reaches the model (hallucination lives there), chunk long audio deliberately (30-second windows), and don't expect native streaming or diarization — those are ecosystem add-ons.

## Good to know

The repo stays maintained but the frontier moved hosted: [AssemblyAI](/tools/assemblyai)'s promptable Universal-3 and [Deepgram](/tools/deepgram)'s streaming stack beat raw Whisper on accuracy and features when the audio can leave your perimeter. The honest decision — open baseline vs hosted specialists — is mapped in [Best Speech-to-Text APIs in 2026](/guides/voice/best-stt-apis-2026).

---

_Source: https://agentscamp.com/tools/whisper — Tool on AgentsCamp._
