Chunking Strategy Optimizer
Find the chunking strategy and size that maximizes retrieval quality for a specific corpus, by sweeping configurations against a fixed eval set instead of guessing. Use when RAG answers miss obvious content, when standing up a new corpus, or when picking chunk size/overlap.
Install to ~/.claude/skills/chunking-strategy-optimizer/SKILL.md
Chunking quietly sets the ceiling on RAG quality. This skill turns 'pick a chunk size' from a guess into a measured decision: build a small retrieval eval set, sweep chunk strategies and sizes, score recall@k, and recommend the smallest config that hits the target.
Chunking is the highest-leverage, most-overlooked knob in retrieval: if the right passage never lands in a single chunk, no reranker or bigger model recovers it. This skill replaces "512 tokens with 50 overlap, because that's what the tutorial said" with a measured choice — sweep candidate strategies over a fixed eval set and pick the one that actually retrieves the answers.
When to use this skill
- Standing up retrieval for a new corpus and you need a defensible chunking default.
- RAG answers miss content you can see exists in the source documents.
- Deciding chunk size, overlap, or strategy (token vs. sentence vs. recursive vs. semantic).
- Migrating embedding models and want to re-confirm chunking still holds up.
Instructions
- Build a retrieval eval set first. Collect 20–50 real questions and, for each, the passage(s) that contain the answer (the "gold" spans). Hand-label if needed — even 20 cases beat eyeballing. This set is the ground truth every configuration is scored against; freeze it.
- Define the candidate configurations. A small grid, not a search of everything: 2–3 strategies (e.g. recursive, sentence, semantic) × 2–3 sizes (e.g. 256 / 512 / 1024 tokens) × overlap (0 / 10–15%). Hold the embedding model and retriever fixed so chunking is the only variable.
- Run each configuration end to end. For each config: chunk the corpus (e.g. with Chonkie), embed the chunks with the fixed model, index them, and run the eval queries.
- Score retrieval, not generation. Report recall@k (does a gold passage appear in the top-k?) and a rank-aware metric like nDCG@k for k ∈ {5, 10, 20}. Generation quality is downstream noise here — measure whether the right chunk is retrieved at all.
- Pick the smallest config that clears the bar. Prefer the configuration with the fewest/smallest chunks that hits your recall target — smaller chunks mean lower embedding cost, lower storage, and tighter prompts. Report the full table so the trade-off is visible.
- Re-check after any upstream change. New embedding model, new document types, or a corpus that grew in a new direction all invalidate the result — re-run the sweep.
WARNING
Never tune chunking without a frozen eval set and a baseline number. "The answers look better" is how silent recall regressions ship. If no eval set exists, building one is your first deliverable.
TIP
Semantic chunking often wins on heterogeneous prose but costs embeddings at ingestion time; fixed-size recursive chunking is cheaper and frequently close. Let the numbers, not the brochure, decide.
Output
A ranked table of configurations with recall@k and nDCG@k, the recommended configuration with its rationale, and the eval set itself (so the decision is reproducible and re-runnable).
Related
- How RAG Actually Works: Ingestion, Chunking, Retrieval & RerankingA clear, practical walkthrough of the retrieval-augmented generation pipeline — what each stage does, where it fails, and how the pieces fit together.
- ChonkieA lightweight, fast chunking library for RAG with many splitting strategies in one API.
- Rag Pipeline EngineerUse this agent to design, build, and harden a production retrieval-augmented generation (RAG) pipeline end to end — ingestion, chunking, embeddings, indexing, retrieval, reranking, and grounded generation — with evals that prove each stage works. Examples — "stand up RAG over our docs", "our RAG hallucinates and misses obvious answers, fix the pipeline", "take our prototype RAG to production with evals and citations".
- Embedding Set InspectorDiagnose the health of an embedding set before blaming the retriever — checking normalization, dimensionality, near-duplicates, degenerate vectors, and corpus/query distribution mismatch. Use when retrieval quality is poor, after a re-embed, or before shipping a new index.