Transformer
The neural-network architecture (Vaswani et al., 2017) that uses self-attention to process sequences in parallel — the basis of nearly all modern LLMs.
A Transformer is a neural-network architecture, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need," that uses self-attention to process an entire sequence in parallel — and it underpins virtually every modern large language model.
The key move was dropping recurrence. Earlier sequence models read tokens one at a time, which made training slow. The Transformer instead uses self-attention so each token can directly weigh every other token in the input at once. That parallelism is the whole point: it scales efficiently on GPUs, which is what made it practical to train models on enormous datasets.
Architecturally, a Transformer stacks repeated blocks, each combining an attention layer with a feed-forward layer. Because attention itself is order-agnostic — it has no built-in sense of sequence — the model adds positional information so it knows which token came where.
Variants differ in how they consume text. Decoder-only models (the GPT- and Claude-style designs) predict the next token and power generative chat; encoder variants (like BERT) read the full input for understanding tasks. Crucially, "scaling the Transformer" — more parameters, more data, more compute — is what turned this 2017 architecture into modern LLMs. Its design also shapes practical limits you hit at inference time, including the model's context window.
Frequently asked questions
- Why was the Transformer such a breakthrough?
- Earlier sequence models (RNNs, LSTMs) processed tokens one after another, which made training slow and hard to parallelize. The Transformer replaced that recurrence with self-attention, letting it look at every token in a sequence at once. That parallelism scales cleanly on GPUs, so models could be trained on far more data and parameters — and scaling the Transformer is exactly what produced today's large language models.
- Are all LLMs Transformers?
- Almost all of the well-known ones are. GPT, Claude, Gemini, and Llama are decoder-only Transformers; older models like BERT use the encoder variant. Some newer architectures (state-space models such as Mamba) explore alternatives, but the Transformer remains the dominant design for frontier LLMs as of 2026.
Related
- TokenizationTokenization splits text into tokens — the sub-word units a model reads and writes — and maps each to an integer ID the model processes.
- InferenceInference is running a trained model to produce output — for LLMs, generating tokens one at a time. Its cost and latency define the economics of AI products.
- Context WindowThe context window is the maximum text — measured in tokens — an LLM can consider at once: prompt, conversation, documents, and its own output combined.
- Mixture of Experts (MoE)MoE is a model architecture where a router activates only a few expert subnetworks per token — huge total capacity, a fraction of the compute per token.
- Reasoning ModelA reasoning model is an LLM trained to think before answering — generating internal reasoning tokens it can spend adaptively on hard problems.