LLM Evals — AI Agents, Skills & Tools
Agents, skills, guides, tools, and commands for llm evals — 14 curated resources for building with AI coding agents.
LLM Evaluation Engineer
Use this agent to make an LLM feature's quality measurable — building the dataset, choosing metrics, setting a baseline, and turning evals into a CI gate so prompt and model changes are scored, not guessed. Examples — "we changed the prompt and don't know if it's better, set up evals", "add a regression gate for our extraction feature", "our RAG quality is drifting, build an eval suite".
LLM Observability Engineer
Use this agent to make a production LLM app observable — tracing every step, scoring live traffic with online evals, and monitoring quality, cost, and latency — so you can debug agent runs and catch regressions in production. Examples — "add tracing to our RAG/agent so we can debug bad answers", "set up online evals and cost/latency dashboards", "production quality is slipping and we're flying blind".
LLM As Judge Scorer
Design a reliable LLM-as-judge metric — a calibrated rubric, a clear scoring scale, and bias controls — and validate it against human labels before trusting it. Use when grading open-ended LLM output (summaries, answers, tone) that exact-match can't score.
LLM Eval Suite Scaffolder
Stand up an evaluation suite for an LLM feature from scratch — a representative dataset, the right metrics, a baseline score, and a CI gate — using DeepEval, promptfoo, or RAGAS. Use when a feature has no evals, before tuning a prompt, or when adding an LLM feature to CI.
Best LLM & RAG Evaluation Tools in 2026: DeepEval vs RAGAS vs LangSmith vs Phoenix vs promptfoo
A decision guide to the LLM eval landscape — code-first frameworks vs. eval-and-observability platforms, open-source vs. hosted, and which fits your stack.
Write Evals for an LLM App: From Zero to a CI Gate
How to evaluate an LLM feature — build a dataset, choose metrics, set a baseline, score offline, add an LLM judge, and gate CI so quality changes are measured.
Arize Phoenix
An open-source LLM observability and evaluation tool built on OpenTelemetry, runnable anywhere.
Braintrust
An end-to-end platform for evaluating, iterating on, and observing LLM apps, with a prompt playground.
DeepEval
An open-source evaluation framework for LLM apps — 'Pytest for LLMs' with ready-made metrics and CI integration.
Langfuse
An open-source LLM engineering platform for tracing, evals, prompt management, and metrics.
LangSmith
LangChain's platform for tracing, evaluating, and monitoring LLM apps — framework-agnostic.
promptfoo
An open-source CLI for testing, comparing, and red-teaming LLM prompts, models, and apps.
RAGAS
An open-source framework for evaluating retrieval-augmented generation with reference-free RAG metrics.
Run Evals
Run the project's LLM evaluation suite (DeepEval, promptfoo, or RAGAS) and report scores against thresholds before a merge.