Context Window
The context window is the maximum text — measured in tokens — an LLM can consider at once: prompt, conversation, documents, and its own output combined.
The context window is the maximum number of tokens a language model can process in one request — everything counts against it: the system prompt, conversation history, retrieved documents, tool results, and the response being generated.
It's the defining resource constraint of LLM applications. Frontier models grew from 4K tokens (2023) to 200K as standard with million-token windows on recent Claude models — yet the window stays a budget, for three durable reasons: cost scales with tokens processed, latency grows with input length, and attention dilutes — models recall the start and end of long contexts better than the middle, so the right answer buried under noise often goes unused.
That's why the craft of context engineering — load the relevant slice, not the repo — outlives every window-size increase, why RAG retrieves rather than stuffs, and why agents like Claude Code ship compaction and memory machinery to keep long sessions sharp.
Frequently asked questions
- What happens when the context window fills up?
- Nothing more fits — so something must go. Applications truncate old turns, summarize them (Claude Code's /compact), or retrieve selectively instead of loading everything (RAG). Quality usually degrades before the hard limit: models weight the start and end of a long window more than the middle, so buried facts get missed.
- Bigger context windows keep shipping — does context management still matter?
- Yes. A million-token window changes what's possible (whole codebases, long documents) but not the economics or attention physics: you pay per token processed, latency grows with input size, and signal still competes with noise. A focused window reliably beats a stuffed one — capacity is budget, not license.
Related
- Token (LLM)A token is the unit LLMs read and write — a word fragment of roughly 3–4 characters in English. Models are priced, limited, and measured in tokens, not words.
- Context EngineeringTreating the context window as a finite budget — what to load, what to leave out, and when to reset.
- Managing Claude Code Memory & Context: CLAUDE.md, /compact, and Auto-MemoryHow Claude Code remembers — every CLAUDE.md scope and load order, path-scoped rules, the auto-memory system, and the context commands that keep sessions sharp.
- Prompt CachingPrompt caching reuses the computed state of a repeated prompt prefix across requests — dramatically cutting cost and time-to-first-token for stable context.
- RAG (Retrieval-Augmented Generation)RAG retrieves relevant documents from your own data and injects them into an LLM's prompt at query time, grounding answers in facts the model wasn't trained on.
- LLM Context Windows Compared (2026)Context windows and max output tokens across Claude, GPT, Gemini, DeepSeek, and Grok — the million-token era, what it costs, and what fits in practice.
- RAG vs Long Context: Do Million-Token Windows Kill Retrieval?Million-token context windows promised the end of RAG. The honest 2026 answer: long context changed where retrieval starts paying, not whether it does.
- Agent MemoryAgent memory is how an AI agent retains information beyond its context window — working state during a task and persistent knowledge across sessions.
- KV CacheThe KV cache stores each token's attention keys and values so an LLM doesn't recompute the whole context per new token — the memory that makes generation fast.
- SubagentA subagent is a specialist agent a primary agent delegates to — running in its own context window with its own prompt and tools, returning only a summary.
- Token StreamingToken streaming delivers model output incrementally as it's generated — via SSE or websockets — so users see text immediately instead of waiting.