RAG (Retrieval-Augmented Generation)
RAG retrieves relevant documents from your own data and injects them into an LLM's prompt at query time, grounding answers in facts the model wasn't trained on.
RAG (retrieval-augmented generation) is the technique of fetching relevant documents from your own data and inserting them into a language model's prompt at query time, so the model answers from retrieved facts instead of training-data memory alone.
The pipeline has two halves. Offline, your documents are split into chunks, converted to embeddings, and stored in a vector database. Online, the user's question is embedded the same way, the most similar chunks are retrieved (often refined by reranking), and those chunks are placed in the prompt alongside the question. The model then generates an answer grounded in what was retrieved.
RAG became the default architecture for "chat with your data" because it solves the two things models can't do alone: know private information and know current information — without the cost of retraining. Its quality ceiling is retrieval quality: if the right chunk isn't fetched, the best model still answers wrong, which is why most RAG engineering effort goes into chunking, search, and reranking rather than the model call.
For the full pipeline, stage by stage, see How RAG Actually Works.
Frequently asked questions
- What problem does RAG solve?
- Models only know their training data — nothing about your codebase, your docs, or anything after their cutoff. RAG closes that gap at query time by fetching the relevant slice of your data and putting it in the prompt, which grounds answers in real sources and sharply reduces hallucination on private or fresh information.
- Is RAG the same as fine-tuning?
- No. RAG supplies knowledge at query time without changing the model; fine-tuning changes the model's weights to teach behavior or style. Knowledge that changes often belongs in RAG; durable behavior belongs in fine-tuning — and many production systems use both.
Related
- How RAG Actually Works: Ingestion, Chunking, Retrieval & RerankingA clear, practical walkthrough of the retrieval-augmented generation pipeline — what each stage does, where it fails, and how the pieces fit together.
- Hybrid Search & Reranking: From Top-50 Recall to Top-5 PrecisionHow production RAG combines dense and sparse search, fuses with RRF, and reranks — turning a wide candidate set into the few passages that actually answer.
- EmbeddingAn embedding is a vector of numbers representing text's meaning, placed so similar texts land close together — the foundation of semantic search and RAG.
- Vector DatabaseA vector database stores embeddings and answers nearest-neighbor queries fast — the retrieval layer under RAG and semantic search, using ANN indexes like HNSW.
- HallucinationA hallucination is fluent, confident output that is factually wrong or fabricated — plausible text unsupported by any source, the signature LLM failure mode.
- Rag Pipeline EngineerUse this agent to design, build, and harden a production retrieval-augmented generation (RAG) pipeline end to end — ingestion, chunking, embeddings, indexing, retrieval, reranking, and grounded generation — with evals that prove each stage works. Examples — "stand up RAG over our docs", "our RAG hallucinates and misses obvious answers, fix the pipeline", "take our prototype RAG to production with evals and citations".
- GraphRAG Explained: When Knowledge Graphs Beat Vector SearchWhat GraphRAG is, how graph-based retrieval differs from vector RAG, the query shapes where it wins, and the honest costs before you build one.
- RAG vs Long Context: Do Million-Token Windows Kill Retrieval?Million-token context windows promised the end of RAG. The honest 2026 answer: long context changed where retrieval starts paying, not whether it does.
- Getting Web Data into AI Agents: Search & Scraping APIs ComparedThe agent web-data layer — Exa for semantic search, Firecrawl for extraction at scale, Tavily for all-in-one, Jina Reader for zero-setup — and how they compose.
- Jina ReaderPrepend r.jina.ai/ to any URL and get LLM-ready markdown — JS rendering, PDFs and Office docs, image captioning, and s.jina.ai for read-the-results search.
- TavilyThe web-access layer for agents — Search, Extract, Crawl, Map, and Research APIs purpose-built for LLMs, behind one key, with a hosted MCP server.
- Agent MemoryAgent memory is how an AI agent retains information beyond its context window — working state during a task and persistent knowledge across sessions.
- ChunkingChunking splits documents into retrievable pieces before embedding — the RAG design decision that quietly determines retrieval quality.
- Context WindowThe context window is the maximum text — measured in tokens — an LLM can consider at once: prompt, conversation, documents, and its own output combined.
- Fine-TuningFine-tuning continues training a pretrained model on your own examples, changing its weights to teach durable behavior, format, or domain style.
- GroundingGrounding ties a model's output to verifiable sources — retrieved documents, tool results, citations — instead of training-data memory.
- Hybrid SearchHybrid search runs keyword (BM25) and semantic (vector) retrieval together and merges the results — catching both exact terms and paraphrases.
- RerankingReranking is a second-pass scoring step: a cross-encoder model re-orders the top results from fast retrieval so the truly relevant few rise to the top.
- Semantic SearchSemantic search retrieves results by meaning rather than keyword overlap — embedding queries and documents in one vector space and matching by similarity.