# LLM Context Windows Compared (2026)

> Context windows and max output tokens across Claude, GPT, Gemini, DeepSeek, and Grok — the million-token era, what it costs, and what fits in practice.

The frontier standardized on a million tokens in 2026: Claude Fable 5, Opus 4.8, and Sonnet 4.6 (1M, at standard pricing), GPT-5.5 and 5.4 (1M), Gemini's lineup (~1M), DeepSeek V4 and Grok 4.3 (1M). Budget tiers trail: Haiku 4.5 at 200K, GPT-5.4 mini/nano at 400K. Max outputs range 64K–384K. Capacity is now rarely the constraint — cost, latency, and attention quality are.

Specs verified against vendor docs on **June 12, 2026** (same methodology as the [pricing table](/guides/advanced/llm-api-pricing-2026): vendor pages only, unverifiable cells omitted). The headline: **the million-token window became the frontier baseline** — and stopped being the interesting number.

## The table

| Model | Context window | Max output | Long-context pricing |
| --- | --- | --- | --- |
| Claude Fable 5 | 1M | 128K | Standard rates across full window |
| Claude Opus 4.8 | 1M | 128K | Standard rates across full window |
| Claude Sonnet 4.6 | 1M | 64K | Standard rates across full window |
| Claude Haiku 4.5 | 200K | 64K | — |
| GPT-5.5 | 1M | 128K | Standard |
| GPT-5.5-pro | ~1.05M | 128K | Standard (premium model) |
| GPT-5.4 | 1M | 128K | Standard |
| GPT-5.4-mini / nano | 400K | 128K | Standard |
| Gemini 3.1 Pro Preview | ~1.05M | 65K | ~2x per-token beyond 200K |
| Gemini 3.5 Flash / Flash-Lite | ~1.05M | 65K | Flat |
| DeepSeek V4 (Flash/Pro) | 1M | up to 384K | Flat (cache pricing separate) |
| Grok 4.3 | 1M | — | — |

Token rules of thumb for reading it: ~4 characters ≈ 1 [token](/glossary/llm-token); ~0.75 English words per token; a dense 500-page book ≈ 150–200K tokens; codebases run ~5–10 tokens per line.

## What the table doesn't say

**Capacity stopped being the constraint; three other things took its place.** *Cost*: you pay per token sent — a full 1M-token prompt is real money on every call, softened by [prompt caching](/glossary/prompt-caching) only for stable prefixes (and note the pricing-shape difference: Anthropic's flat-rate window vs Google's >200K tiering). *Latency*: prefill scales with input; whole-corpus prompts mean multi-second time-to-first-token. *Attention*: needle-in-haystack benchmarks are near-perfect, but synthesis across a packed window still measurably favors the start and end — a curated 10K context beats a noisy 1M one containing the same answer, which is the entire thesis of [context engineering](/guides/prompting/context-engineering).

**Max output is the sleeper limit.** Reading got huge; writing didn't keep pace — 64–128K output caps mean "translate this book" or "generate the full report" still needs chunked generation, and [reasoning](/glossary/reasoning-model) thinking-tokens spend from the same output budget.

**The practical playbook** follows directly: use big windows to retrieve *generously* rather than to skip retrieval ([RAG vs Long Context](/guides/concepts/rag-vs-long-context) draws the line), cache the stable prefix, and treat window size as budget ceiling — not target. Agents operationalize the same idea with [compaction and memory](/guides/configuration/claude-code-memory-context): the window is working memory, files are the disk.

---

_Source: https://agentscamp.com/guides/advanced/llm-context-windows-compared — Guide on AgentsCamp._
