# Transformer

> The neural-network architecture (Vaswani et al., 2017) that uses self-attention to process sequences in parallel — the basis of nearly all modern LLMs.

**A Transformer is a neural-network architecture, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need," that uses self-attention to process an entire sequence in parallel — and it underpins virtually every modern large language model.**

The key move was dropping recurrence. Earlier sequence models read tokens one at a time, which made training slow. The Transformer instead uses [self-attention](/glossary/attention-mechanism) so each [token](/glossary/tokenization) can directly weigh every other token in the input at once. That parallelism is the whole point: it scales efficiently on GPUs, which is what made it practical to train models on enormous datasets.

Architecturally, a Transformer stacks repeated blocks, each combining an attention layer with a feed-forward layer. Because attention itself is order-agnostic — it has no built-in sense of sequence — the model adds positional information so it knows which token came where.

Variants differ in how they consume text. Decoder-only models (the GPT- and Claude-style designs) predict the next token and power generative chat; encoder variants (like BERT) read the full input for understanding tasks. Crucially, "scaling the Transformer" — more parameters, more data, more compute — is what turned this 2017 architecture into modern LLMs. Its design also shapes practical limits you hit at [inference](/glossary/inference) time, including the model's [context window](/glossary/context-window).

---

_Source: https://agentscamp.com/glossary/transformer — Term on AgentsCamp._