Benchmark Rerankers
Measure whether adding a reranker actually improves retrieval, by scoring reranked vs. un-reranked results on a labeled query set.
/benchmark-rerankers<path to eval set / retrieval results, or a description of the pipeline>Install to ~/.claude/commands/benchmark-rerankers.md
Reranking usually helps RAG — but it adds latency and cost, so prove the lift before you ship it. This command scores first-stage retrieval against reranked retrieval on a labeled query set (recall@k, nDCG@k) and reports whether the reranker earns its place.
Scope
Treat $ARGUMENTS as the retrieval setup to benchmark — a path to an eval set and retrieval results, or a description of the pipeline (retriever, candidate count, reranker). Restate what you're comparing in one sentence before running.
Goal: quantify the lift a reranker adds over first-stage retrieval, so the decision to ship it (and pay its latency/cost) is measured, not assumed.
NOTE
A reranker reorders candidates the retriever already found — it cannot recover an answer that first-stage retrieval missed. So always over-retrieve (top-25–50) before reranking, and measure recall at the first stage too.
Step 1 — Establish the eval set
Use a labeled set of queries with known-relevant passages (gold spans). If none exists, say so and help build a small one (20–50 queries) before benchmarking — a benchmark without ground truth is theater.
Step 2 — Produce two result sets
For each query, capture the top-k before reranking (raw retriever order) and the top-k after reranking (e.g. via Cohere Rerank or another cross-encoder), over the same candidate pool.
Step 3 — Score both
Compute, for k ∈ {3, 5, 10}:
- recall@k — fraction of queries with a gold passage in the top-k.
- nDCG@k — rank-aware quality (rewards putting the right passage higher).
- MRR — mean reciprocal rank of the first gold passage.
Report a side-by-side table: metric | retriever-only | + reranker | delta.
Step 4 — Weigh the cost
State the added per-query latency and cost of the rerank call. Reranking only the top candidates keeps both modest, but make the trade-off explicit.
Step 5 — Recommend
Give a clear verdict: ship the reranker, skip it, or change candidate depth / rerank model. Justify it from the numbers — e.g. "+0.14 nDCG@5 for +90ms is worth it" or "negligible lift, not worth the latency here."
WARNING
Don't tune the reranker against the same handful of queries you eyeball. Use the frozen eval set, and report all metrics, not just the one that improved.
Related
- Hybrid Search & Reranking: From Top-50 Recall to Top-5 PrecisionHow production RAG combines dense and sparse search, fuses with RRF, and reranks — turning a wide candidate set into the few passages that actually answer.
- Cohere RerankA hosted reranking API that reorders retrieved passages by true relevance to a query.
- Retrieval EngineerUse this agent to raise the retrieval quality of a search or RAG system — recall and precision, hybrid (dense + sparse) search, reranking, query transformation, and metadata filtering — measured against a labeled eval set. Examples — "our RAG retrieves irrelevant chunks, fix recall", "add hybrid search and reranking and prove it helps", "queries with acronyms/IDs return nothing, fix it".