Scaffold RAG Pipeline
Scaffold a Retrieval-Augmented Generation pipeline — ingestion (load, chunk, embed, upsert) and retrieval (search, rerank, grounded prompt with citations) — fitted to the project's stack.
/scaffold-rag-pipeline<data source and use case>npx agentscamp add commands/scaffold-rag-pipelineInstall to ~/.claude/commands/scaffold-rag-pipeline.md
Scaffolds a Retrieval-Augmented Generation pipeline fitted to the project's stack: an idempotent ingestion half (load, clean, chunk, embed, upsert) and a retrieval half (embed query, vector search, optional rerank, assemble a grounded prompt with source citations). States the chunking, embedding, vector-store, and top-k choices up front and leaves a slot for evaluation.
Scope
Treat $ARGUMENTS as the data source(s) and the use case — e.g. "our markdown docs, for an in-app Q&A assistant" or "support tickets in Postgres, for answer suggestions". Restate it in one sentence to confirm before scaffolding.
If $ARGUMENTS is empty, ask one focused question: "What are you retrieving over, and what's the use case?" Do not scaffold a generic pipeline against an imagined corpus.
WARNING
Chunking quality dominates retrieval quality. A great embedding model and a great vector store cannot rescue chunks that split a sentence in half or merge three unrelated sections. Spend your attention on Step 3, not on picking a fancier model.
Step 1 — Detect the stack and existing AI dependencies
Before writing anything, ground the scaffold in what's already here:
- Identify the language/runtime —
Globforpackage.json,pyproject.toml,requirements.txt,go.mod, etc. Grepfor AI/RAG deps already in use:openai,@anthropic-ai/sdk,anthropic,langchain,llamaindex,@ai-sdk, and any vector store client (pinecone,weaviate,chromadb,qdrant,pgvector,@supabase).Grepfor an existing embeddings/vector call so you extend the project's conventions instead of introducing a parallel one.
Match the scaffold to what you find. If the project already has a vector store or an LLM client, build on it rather than adding a competing dependency.
Step 2 — Decide and state the key choices
Write these decisions at the top of the generated code as a comment block, so they're reviewable and tunable. Pick concrete defaults — don't punt to "configurable":
- Chunking — split on natural boundaries (headings, paragraphs, code blocks), not a blind character count. Default: ~400-800 tokens per chunk, 10-15% overlap. Attach metadata to every chunk:
source,title,heading, and a line/char range for citation. - Embedding model — use the project's existing provider if one is present; otherwise pick a current general-purpose embedding model and pin the dimension. State it explicitly so ingestion and retrieval can never drift apart.
- Vector store — reuse what's installed; if nothing exists, default to whatever the deployment already runs (e.g.
pgvectorif there's a Postgres, otherwise a local store). Store the chunk text alongside the vector and metadata. - Retrieval — default top-k of 8-12 candidates, then an optional rerank pass down to the 3-5 chunks actually placed in the prompt.
- Generation — when a generation model is needed (answer synthesis, rerank-by-LLM), default to Anthropic's latest, most capable model:
claude-opus-4-8.
NOTE
Pin the embedding model and dimension in one shared constant imported by both halves. If ingestion embeds with one model and retrieval queries with another, every search silently returns noise — and there's no error to catch it.
Step 3 — Scaffold ingestion (idempotent, re-runnable)
Generate the ingestion path as: load → clean → chunk → embed → upsert.
- Load the source(s) from
$ARGUMENTS. - Clean — strip boilerplate, normalize whitespace, drop empty fragments.
- Chunk per the Step 2 strategy, carrying source metadata into each chunk.
- Embed each chunk in batches with retry/backoff.
- Upsert by a stable content-derived ID (e.g. a hash of
source+ chunk index + chunk text) so re-running the pipeline replaces changed chunks and skips unchanged ones instead of duplicating them.
Make it safe to run repeatedly against a partially-populated store — that's the whole point of a content-derived key.
Step 4 — Scaffold retrieval (grounded, with citations)
Generate the query path as: embed query → vector search → optional rerank → assemble grounded prompt.
- Embed the incoming query with the same pinned model from Step 2.
- Vector-search for top-k candidates.
- Optionally rerank (cross-encoder or LLM-as-reranker) down to the few chunks that go in the prompt.
- Assemble a prompt that includes the selected chunks and their source attributions, instructing the model to answer only from the provided context, cite each claim by source, and say it doesn't know when the context doesn't cover the question.
- Return the answer with the source list, so the caller can render citations.
WARNING
Never return an ungrounded answer. If retrieval finds nothing relevant, the pipeline must surface "I don't have information on that" — not let the model answer from parametric memory. An unsourced answer in a RAG system is a bug, not a fallback.
Step 5 — Leave a slot for evaluation
Stub an evaluation entry point next to retrieval — a small harness that takes question/expected-source pairs and reports retrieval hit-rate and answer faithfulness. Leave it empty but wired in, with a comment on what to measure. Don't fabricate eval data; let the user supply it.
Report
List every file you created and what each one does (ingestion, retrieval, shared config, eval stub). Then give the exact next steps to make it live:
- Which credentials/env vars to set (embedding + generation API keys, vector-store connection).
- The command to run ingestion against the real
$ARGUMENTSsource. - The single first query to verify retrieval returns grounded, cited results.
End with the one decision most worth revisiting after a first run — almost always the chunking strategy.
Frequently asked questions
- What chunking strategy should a RAG pipeline use?
- Chunk on natural boundaries (headings, paragraphs, code blocks) rather than a fixed character count, target roughly 400-800 tokens per chunk with 10-15% overlap, and attach source metadata (path, title, heading, line range) to every chunk. Chunk quality dominates retrieval quality — get this right before tuning anything else.
- How does the pipeline avoid hallucinated answers?
- Retrieval assembles a prompt that includes only the retrieved chunks and instructs the model to answer strictly from that context, cite each claim by source, and say it doesn't know when the context is insufficient — never falling back to parametric knowledge for grounded questions.
Related
- Add a Streaming LLM EndpointScaffold a token-streaming LLM endpoint — server-side streaming plus the client handler — so responses render incrementally instead of after a long wait.
- Agent Memory DesignerDesign a project's CLAUDE.md and memory hierarchy by exploring the repo to learn its real build/test/lint commands, architecture, and non-obvious gotchas, then writing a concise, skimmable memory that keeps only what belongs in context. Use when onboarding a repo to Claude Code with no CLAUDE.md, or when an existing one is bloated, stale, or being ignored.
- Prompt Regression TesterBuild a regression test harness for an LLM prompt so a prompt edit or model upgrade can't silently degrade quality — a fixed eval set, checkable assertions, and a diff against a committed baseline. Use when changing a production prompt, migrating model versions, or any time 'I tweaked the prompt' needs to be backed by evidence instead of eyeballing two outputs.