LLM Context Windows Compared (2026)
Context windows and max output tokens across Claude, GPT, Gemini, DeepSeek, and Grok — the million-token era, what it costs, and what fits in practice.
The frontier standardized on a million tokens in 2026: Claude Fable 5, Opus 4.8, and Sonnet 4.6 (1M, at standard pricing), GPT-5.5 and 5.4 (1M), Gemini's lineup (~1M), DeepSeek V4 and Grok 4.3 (1M). Budget tiers trail: Haiku 4.5 at 200K, GPT-5.4 mini/nano at 400K. Max outputs range 64K–384K. Capacity is now rarely the constraint — cost, latency, and attention quality are.
Key takeaways
- A million tokens is the 2026 frontier baseline — roughly 750k words, several large codebases — across Anthropic, OpenAI, Google, DeepSeek, and xAI flagships.
- Pricing models differ: Anthropic includes 1M at flat per-token rates; Google roughly doubles Pro per-token pricing beyond 200K. Same capacity, different bill shapes.
- Max output is the forgotten limit: 64K–128K typical (DeepSeek claims up to 384K) — long generation, not long reading, is where jobs still hit walls.
- Rules of thumb: ~4 characters or ~0.75 English words per token; a 500-page book ≈ 150–200K tokens; a mid-size codebase ≈ 1–5M tokens — still bigger than any window.
- Fitting ≠ using well: cost scales with tokens sent, latency with prefill, and mid-window attention measurably lags — retrieval and context discipline outlive every window increase.
Specs verified against vendor docs on June 12, 2026 (same methodology as the pricing table: vendor pages only, unverifiable cells omitted). The headline: the million-token window became the frontier baseline — and stopped being the interesting number.
The table
| Model | Context window | Max output | Long-context pricing |
|---|---|---|---|
| Claude Fable 5 | 1M | 128K | Standard rates across full window |
| Claude Opus 4.8 | 1M | 128K | Standard rates across full window |
| Claude Sonnet 4.6 | 1M | 64K | Standard rates across full window |
| Claude Haiku 4.5 | 200K | 64K | — |
| GPT-5.5 | 1M | 128K | Standard |
| GPT-5.5-pro | ~1.05M | 128K | Standard (premium model) |
| GPT-5.4 | 1M | 128K | Standard |
| GPT-5.4-mini / nano | 400K | 128K | Standard |
| Gemini 3.1 Pro Preview | ~1.05M | 65K | ~2x per-token beyond 200K |
| Gemini 3.5 Flash / Flash-Lite | ~1.05M | 65K | Flat |
| DeepSeek V4 (Flash/Pro) | 1M | up to 384K | Flat (cache pricing separate) |
| Grok 4.3 | 1M | — | — |
Token rules of thumb for reading it: ~4 characters ≈ 1 token; ~0.75 English words per token; a dense 500-page book ≈ 150–200K tokens; codebases run ~5–10 tokens per line.
What the table doesn't say
Capacity stopped being the constraint; three other things took its place. Cost: you pay per token sent — a full 1M-token prompt is real money on every call, softened by prompt caching only for stable prefixes (and note the pricing-shape difference: Anthropic's flat-rate window vs Google's >200K tiering). Latency: prefill scales with input; whole-corpus prompts mean multi-second time-to-first-token. Attention: needle-in-haystack benchmarks are near-perfect, but synthesis across a packed window still measurably favors the start and end — a curated 10K context beats a noisy 1M one containing the same answer, which is the entire thesis of context engineering.
Max output is the sleeper limit. Reading got huge; writing didn't keep pace — 64–128K output caps mean "translate this book" or "generate the full report" still needs chunked generation, and reasoning thinking-tokens spend from the same output budget.
The practical playbook follows directly: use big windows to retrieve generously rather than to skip retrieval (RAG vs Long Context draws the line), cache the stable prefix, and treat window size as budget ceiling — not target. Agents operationalize the same idea with compaction and memory: the window is working memory, files are the disk.
Frequently asked questions
- Which LLM has the biggest context window in 2026?
- The frontier clusters at 1M tokens — Claude's Fable 5/Opus 4.8/Sonnet 4.6, GPT-5.5/5.4, Gemini's current lineup, DeepSeek V4, and Grok 4.3 all claim it. Differentiation moved from headline size to quality-at-depth (how well the model uses token 800,000) and to pricing shape across the window.
- How much fits in a million tokens?
- About 750,000 English words: several long novels, a year of meeting notes, or a substantial codebase (rule of thumb: ~1M tokens covers roughly 100k–200k lines of code with comments). What typically doesn't fit: enterprise document corpora and monorepos — which is why retrieval still exists.
- Does a bigger window replace RAG?
- It moved the threshold, not the conclusion. Under a few hundred pages of stable content, stuffing (plus prompt caching) beats building a pipeline. At corpus scale, four walls remain — per-query cost, prefill latency, mid-window attention degradation, and access control — covered honestly in our RAG vs Long Context guide.
Related
- Context WindowThe context window is the maximum text — measured in tokens — an LLM can consider at once: prompt, conversation, documents, and its own output combined.
- LLM API Pricing in 2026: Every Major Model ComparedPer-million-token prices for Claude, GPT, Gemini, DeepSeek, Mistral, and Grok — plus caching and batch discounts — verified against vendor pricing pages.
- RAG vs Long Context: Do Million-Token Windows Kill Retrieval?Million-token context windows promised the end of RAG. The honest 2026 answer: long context changed where retrieval starts paying, not whether it does.
- Context EngineeringTreating the context window as a finite budget — what to load, what to leave out, and when to reset.
- Prompt CachingPrompt caching reuses the computed state of a repeated prompt prefix across requests — dramatically cutting cost and time-to-first-token for stable context.
- Managing Claude Code Memory & Context: CLAUDE.md, /compact, and Auto-MemoryHow Claude Code remembers — every CLAUDE.md scope and load order, path-scoped rules, the auto-memory system, and the context commands that keep sessions sharp.