How RAG Actually Works: Ingestion, Chunking, Retrieval & Reranking
A clear, practical walkthrough of the retrieval-augmented generation pipeline — what each stage does, where it fails, and how the pieces fit together.
RAG works by retrieving relevant passages from your data and putting them in the model's context before it answers. The quality of the answer is capped by the quality of retrieval — so most of the engineering is in ingestion, chunking, embeddings, indexing, retrieval, and reranking, not in the prompt.
Steps at a glance
- Ingest. Load source documents (docs, tickets, code, PDFs) and clean them — strip boilerplate, navigation, and duplicated headers/footers that would otherwise crowd retrieval.
- Chunk. Split documents into passages sized to your embedding model and retrieval granularity. This is the highest-leverage knob; sweep strategies rather than guessing.
- Embed. Turn each chunk into a vector with a retrieval-tuned embedding model, using the document input type for the corpus and the query input type at search time.
- Index. Store vectors plus metadata in a vector database so you can run fast nearest-neighbour search with payload filtering.
- Retrieve. Embed the question and pull a wide candidate set (top-25–50), ideally with hybrid dense + sparse search so exact terms aren't missed.
- Rerank. Reorder the candidates with a cross-encoder and keep only the few most relevant passages to put in the prompt.
- Generate. Give the model the question plus the top passages, instruct it to answer only from that context and cite sources, and allow an honest 'not enough information.'
Key takeaways
- RAG = retrieve relevant context, then generate an answer grounded in it. The model only knows what you retrieve.
- Answer quality is capped by retrieval quality — a failure in an early stage can't be fixed by a later one.
- The pipeline is ingestion → chunking → embedding → indexing → retrieval → reranking → grounded generation.
- Chunking quietly sets the ceiling: if the answer never lands in one chunk, nothing downstream recovers it.
- Force citations and allow 'not found' — grounding the model in retrieved context is the main hallucination defense.
Retrieval-augmented generation (RAG) is the most common way to make a language model answer questions about your data — private docs, a codebase, support tickets, contracts — instead of only what it absorbed in training. The idea is simple: retrieve the relevant passages, then ask the model to answer using them. The engineering is in making retrieval good, because the answer can only be as good as what you retrieve.
This guide walks the whole pipeline, what each stage is for, and where it tends to break.
The one principle that explains everything
RAG is a pipeline of stages, and a failure in an early stage cannot be repaired by a later one. If the chunk containing the answer is never retrieved, no amount of prompt engineering or a bigger model will produce a correct, grounded answer — the model simply doesn't have the information. So the order of priority is: get retrieval right first, then improve generation. Teams who invert this — polishing the prompt while retrieval quietly fails — ship confident, wrong answers.
The pipeline, stage by stage
Ingestion
You load the source material and clean it. The unglamorous part matters: stripping navigation, repeated headers/footers, and boilerplate prevents that text from later dominating retrieval and crowding out real answers.
Chunking
You split documents into passages — the units you'll embed and retrieve. This is the highest-leverage and most-overlooked stage. Chunks that are too large dilute meaning (and retrieve "sort of relevant" pages); too small and they lose the context that makes them answerable. There's no universal best size — it depends on your documents and embedding model — so you measure it rather than guess. (The chunking-strategy-optimizer skill sweeps configurations against an eval set.)
Embedding
Each chunk becomes a vector — a list of numbers positioning it in a semantic space where similar meanings sit close together. A retrieval-tuned embedding model is what makes "how do I rotate keys?" land near a passage titled "Credential rotation." Picking the model is a real decision with lock-in (changing it means re-embedding everything) — see Choosing Embeddings in 2026.
NOTE
Many embedding models are asymmetric: embed your corpus with the "document" input type and the question with the "query" input type. Getting this wrong silently hurts retrieval.
Indexing
Vectors (plus metadata like source, date, and tenant) go into a vector database built for fast nearest-neighbour search with filtering — for example Qdrant. The metadata lets you constrain retrieval ("only this product's docs") without losing recall.
Retrieval
At query time you embed the question and pull the nearest chunks. The key move is to over-retrieve — grab a wide candidate set (top-25–50) — and to use hybrid search (dense vectors plus sparse/keyword matching) so that exact terms like error codes, IDs, and product names aren't missed by pure semantic similarity. The next guide, Hybrid Search & Reranking, covers this in depth.
Reranking
A first-stage retriever is fast but approximate. A reranker is a slower, more accurate cross-encoder that reads the query and each candidate together and reorders them by true relevance. You rerank the wide candidate set down to the few passages you actually put in the prompt. It's one of the cheapest, highest-impact upgrades to RAG quality.
Generation
Finally, the model gets the question plus the top passages. Instruct it to answer only from the provided context, cite the sources it used, and say "I don't have enough information" when the context doesn't contain the answer. That grounding — not a clever system prompt — is your primary defense against hallucination.
Where RAG goes wrong (and which stage to blame)
- Wrong or vague answers → usually retrieval: the right chunk wasn't in context. Measure recall@k before touching the prompt.
- Misses exact terms (codes, IDs, names) → add a sparse/keyword component (hybrid search).
- Relevant chunk retrieved but ignored → improve reranking or reduce the number of low-quality passages in the prompt.
- Confident hallucinations → tighten generation: enforce grounding, citations, and a valid "not found" path.
- "Invisible" documents → an ingestion/chunking/embedding bug (empty chunks, boilerplate domination, normalization mismatch). The embedding-set-inspector skill catches these.
How to build it well
The throughline: treat RAG as a measured system. Build a small eval set of real questions with their gold passages, get retrieval right against it first, then layer reranking and grounded generation, and re-run the eval as a regression gate. For the end-to-end build, the rag-pipeline-engineer agent owns exactly this workflow; for tuning the retrieval half in isolation, the retrieval-engineer does.
Frequently asked questions
- What is retrieval-augmented generation (RAG)?
- RAG is a technique where, instead of relying only on what a model learned in training, you retrieve relevant passages from your own data at query time and put them in the model's context so it can answer grounded in current, specific information. It's how you make an LLM answer questions about your private docs, code, or knowledge base.
- Why does my RAG system hallucinate or miss obvious answers?
- Almost always because retrieval failed — the right passage never made it into the model's context. The model can only use what you retrieve, so if chunking, embeddings, or retrieval are weak, the model fills the gap by guessing. Fix retrieval (measure recall@k) before touching the prompt.
- Do I still need RAG if models have huge context windows?
- Often yes. Large context windows let you stuff more in, but cost, latency, and the 'lost in the middle' effect all grow with context size, and you still can't fit a whole knowledge base. RAG retrieves just the relevant slice, which is usually cheaper, faster, and more accurate than dumping everything in.
- What's the difference between RAG and fine-tuning?
- RAG injects knowledge at query time via retrieval; fine-tuning bakes behavior or style into the model's weights via training. RAG is the right tool for changing, factual knowledge you need cited; fine-tuning is for teaching a consistent format, tone, or task. They're complementary, not competitors.
Related
- Hybrid Search & Reranking: From Top-50 Recall to Top-5 PrecisionHow production RAG combines dense and sparse search, fuses with RRF, and reranks — turning a wide candidate set into the few passages that actually answer.
- Choosing Embeddings in 2026: OpenAI vs Cohere vs Voyage vs Open-SourceA decision guide for picking an embedding model for retrieval — accuracy, dimensions, cost, multilingual and domain fit, self-hosting, and lock-in.
- Rag Pipeline EngineerUse this agent to design, build, and harden a production retrieval-augmented generation (RAG) pipeline end to end — ingestion, chunking, embeddings, indexing, retrieval, reranking, and grounded generation — with evals that prove each stage works. Examples — "stand up RAG over our docs", "our RAG hallucinates and misses obvious answers, fix the pipeline", "take our prototype RAG to production with evals and citations".
- Chunking Strategy OptimizerFind the chunking strategy and size that maximizes retrieval quality for a specific corpus, by sweeping configurations against a fixed eval set instead of guessing. Use when RAG answers miss obvious content, when standing up a new corpus, or when picking chunk size/overlap.
- Building an MCP ServerAn accurate introduction to the Model Context Protocol: server anatomy, transports, and connecting a tool to Claude Code.
- ML EngineerUse this agent for production ML — pipelines, training, serving, evaluation, and MLOps. Examples — building a training pipeline, deploying a model, setting up evaluation.
- Best Vector Database in 2026: pgvector vs Pinecone vs Qdrant vs Weaviate vs Milvus vs Chroma vs LanceDBA decision guide to vector databases — embedded, server, or managed; whether you already run Postgres; and which fits your scale, filtering, and RAG needs.
- Fine-Tune vs RAG vs Prompt vs Distill: The 2026 Decision TreeWhen to reach for prompt engineering, RAG, fine-tuning, or distillation — what each actually changes, where each fails, and how to combine them.
- ChonkieA lightweight, fast chunking library for RAG with many splitting strategies in one API.
- ChromaAn open-source, Python-first vector database that runs in-process — the fastest path from pip install to a working retrieval prototype.
- RAGASAn open-source framework for evaluating retrieval-augmented generation with reference-free RAG metrics.