Attention Mechanism
Attention lets a model weigh how relevant every other token is to each token, building a context-aware representation as a weighted blend of their values.
An attention mechanism is the operation that, for each token, computes how relevant every other token is and builds a new representation of that token as a weighted sum of the others — so meaning is shaped by context rather than position alone.
The intuition is query/key/value. Each token emits a query ("what am I looking for?"), every token also exposes a key ("what do I offer?"), and the query is matched against all keys to produce relevance scores. Those scores are scaled and normalized (softmax) into weights, then used to blend the tokens' value vectors. A token attending to its grammatical subject ten words earlier simply lands a high weight on that key. When a sequence attends to itself this way it is called self-attention — the core operation inside a transformer block.
Real models run multi-head attention: several attention patterns in parallel, each free to track a different relationship (syntax, coreference, topic), with the per-head results concatenated and projected. This captures long-range dependencies directly — any token can reach any other in one step — and the comparisons are parallelizable across the whole sequence rather than processed left-to-right.
The catch is cost: comparing every token with every other is quadratic in sequence length, so doubling the context window roughly quadruples the compute and memory. That scaling is exactly what motivates optimizations like the KV cache, which reuses already-computed keys and values during generation, and FlashAttention, which restructures the computation to avoid materializing the full attention matrix.
Frequently asked questions
- What is self-attention?
- Self-attention is attention applied within a single sequence: every token attends to every other token in the same input (including itself), so each position's representation is recomputed as a relevance-weighted blend of all the others. It is the core operation inside a transformer block, and it is what lets a model resolve a pronoun against a noun fifty words back.
- Why is attention computationally expensive?
- Standard attention compares every token against every other token, so compute and memory grow with the square of the sequence length (O(n2)). Doubling the context roughly quadruples the work, which is why long context windows are pricey and why optimizations like FlashAttention and the KV cache exist to cut memory traffic and recomputation.
Related
- Context WindowThe context window is the maximum text — measured in tokens — an LLM can consider at once: prompt, conversation, documents, and its own output combined.
- KV CacheThe KV cache stores each token's attention keys and values so an LLM doesn't recompute the whole context per new token — the memory that makes generation fast.
- Token (LLM)A token is the unit LLMs read and write — a word fragment of roughly 3–4 characters in English. Models are priced, limited, and measured in tokens, not words.
- TokenizationTokenization splits text into tokens — the sub-word units a model reads and writes — and maps each to an integer ID the model processes.
- EmbeddingAn embedding is a vector of numbers representing text's meaning, placed so similar texts land close together — the foundation of semantic search and RAG.