# How RAG Actually Works: Ingestion, Chunking, Retrieval & Reranking

> A clear, practical walkthrough of the retrieval-augmented generation pipeline — what each stage does, where it fails, and how the pieces fit together.

RAG works by retrieving relevant passages from your data and putting them in the model's context before it answers. The quality of the answer is capped by the quality of retrieval — so most of the engineering is in ingestion, chunking, embeddings, indexing, retrieval, and reranking, not in the prompt.

Retrieval-augmented generation (RAG) is the most common way to make a language model answer questions about *your* data — private docs, a codebase, support tickets, contracts — instead of only what it absorbed in training. The idea is simple: **retrieve the relevant passages, then ask the model to answer using them.** The engineering is in making retrieval good, because **the answer can only be as good as what you retrieve.**

This guide walks the whole pipeline, what each stage is for, and where it tends to break.

## The one principle that explains everything

RAG is a pipeline of stages, and **a failure in an early stage cannot be repaired by a later one.** If the chunk containing the answer is never retrieved, no amount of prompt engineering or a bigger model will produce a correct, grounded answer — the model simply doesn't have the information. So the order of priority is: get retrieval right first, then improve generation. Teams who invert this — polishing the prompt while retrieval quietly fails — ship confident, wrong answers.

## The pipeline, stage by stage

### Ingestion

You load the source material and clean it. The unglamorous part matters: stripping navigation, repeated headers/footers, and boilerplate prevents that text from later dominating retrieval and crowding out real answers.

### Chunking

You split documents into passages — the units you'll embed and retrieve. This is the **highest-leverage and most-overlooked** stage. Chunks that are too large dilute meaning (and retrieve "sort of relevant" pages); too small and they lose the context that makes them answerable. There's no universal best size — it depends on your documents and embedding model — so you measure it rather than guess. (The [chunking-strategy-optimizer](/skills/data/chunking-strategy-optimizer) skill sweeps configurations against an eval set.)

### Embedding

Each chunk becomes a **vector** — a list of numbers positioning it in a semantic space where similar meanings sit close together. A retrieval-tuned embedding model is what makes "how do I rotate keys?" land near a passage titled "Credential rotation." Picking the model is a real decision with lock-in (changing it means re-embedding everything) — see [Choosing Embeddings in 2026](/guides/concepts/choosing-embeddings-2026).

> [!NOTE]
> Many embedding models are *asymmetric*: embed your corpus with the "document" input type and the question with the "query" input type. Getting this wrong silently hurts retrieval.

### Indexing

Vectors (plus metadata like source, date, and tenant) go into a **vector database** built for fast nearest-neighbour search with filtering — for example [Qdrant](/tools/qdrant). The metadata lets you constrain retrieval ("only this product's docs") without losing recall.

### Retrieval

At query time you embed the question and pull the nearest chunks. The key move is to **over-retrieve** — grab a wide candidate set (top-25–50) — and to use **hybrid search** (dense vectors plus sparse/keyword matching) so that exact terms like error codes, IDs, and product names aren't missed by pure semantic similarity. The next guide, [Hybrid Search & Reranking](/guides/concepts/hybrid-search-reranking), covers this in depth.

### Reranking

A first-stage retriever is fast but approximate. A **reranker** is a slower, more accurate cross-encoder that reads the query and each candidate *together* and reorders them by true relevance. You rerank the wide candidate set down to the few passages you actually put in the prompt. It's one of the cheapest, highest-impact upgrades to RAG quality.

### Generation

Finally, the model gets the question plus the top passages. Instruct it to **answer only from the provided context, cite the sources** it used, and say "I don't have enough information" when the context doesn't contain the answer. That grounding — not a clever system prompt — is your primary defense against hallucination.

## Where RAG goes wrong (and which stage to blame)

- **Wrong or vague answers** → usually **retrieval**: the right chunk wasn't in context. Measure recall@k before touching the prompt.
- **Misses exact terms** (codes, IDs, names) → add a **sparse/keyword** component (hybrid search).
- **Relevant chunk retrieved but ignored** → improve **reranking** or reduce the number of low-quality passages in the prompt.
- **Confident hallucinations** → tighten **generation**: enforce grounding, citations, and a valid "not found" path.
- **"Invisible" documents** → an **ingestion/chunking/embedding** bug (empty chunks, boilerplate domination, normalization mismatch). The [embedding-set-inspector](/skills/data/embedding-set-inspector) skill catches these.

## How to build it well

The throughline: treat RAG as a measured system. Build a small eval set of real questions with their gold passages, get **retrieval** right against it first, then layer reranking and grounded generation, and re-run the eval as a regression gate. For the end-to-end build, the [rag-pipeline-engineer](/agents/data-ai/rag-pipeline-engineer) agent owns exactly this workflow; for tuning the retrieval half in isolation, the [retrieval-engineer](/agents/data-ai/retrieval-engineer) does.

---

_Source: https://agentscamp.com/guides/concepts/how-rag-works — Guide on AgentsCamp._