LLM Evals — AI Agents, Skills & Tools

Agents, skills, guides, tools, and commands for llm evals — 29 curated resources for building with AI coding agents.

Agent

LLM Evaluation Engineer

Use this agent to make an LLM feature's quality measurable — building the dataset, choosing metrics, setting a baseline, and turning evals into a CI gate so prompt and model changes are scored, not guessed. Examples — "we changed the prompt and don't know if it's better, set up evals", "add a regression gate for our extraction feature", "our RAG quality is drifting, build an eval suite".

sonnet6

Agent

LLM Observability Engineer

Use this agent to make a production LLM app observable — tracing every step, scoring live traffic with online evals, and monitoring quality, cost, and latency — so you can debug agent runs and catch regressions in production. Examples — "add tracing to our RAG/agent so we can debug bad answers", "set up online evals and cost/latency dashboards", "production quality is slipping and we're flying blind".

sonnet6

Agent

Eval Driven Developer

Use this agent to drive AI feature development with evals the way TDD drives code with tests — define success criteria and a representative eval set BEFORE iterating on prompts/models, then optimize against measured scores instead of vibes. Examples — "make the summarizer better" (turn it into measurable criteria first), "our prompt change keeps regressing quality, set up a loop that catches it", "add an eval gate to CI so a model swap can't silently degrade output", "we tweak prompts and pray — give us a baseline and a change-by-change scoreboard".

opus5

Skill

Agent Trajectory Evaluator

Evaluate a multi-step AI agent's whole run — tool calls, intermediate steps, and final result — not just final-answer correctness, so you can pinpoint WHERE it went wrong. Use when building or debugging a tool-using or multi-step agent, when final-answer-only evals can't explain failures, or when a prompt/model change quietly makes the agent less efficient or more error-prone even though the answer still looks right.

invocablev1.0.0

Skill

Hallucination Evaluator

Detect and measure ungroundedness in LLM and RAG outputs — claims the source doesn't support — by decomposing answers into atomic claims and checking each for entailment, so you can quantify faithfulness and gate on it instead of eyeballing it. Use when a RAG/LLM feature makes confident wrong claims, before shipping anything that must be factual, or to add a groundedness gate to evals/CI.

invocablev1.0.0

Skill

LLM As Judge Scorer

Design a reliable LLM-as-judge metric — a calibrated rubric, a clear scoring scale, and bias controls — and validate it against human labels before trusting it. Use when grading open-ended LLM output (summaries, answers, tone) that exact-match can't score.

invocablev1.0.0

Skill

LLM Eval Suite Scaffolder

Stand up an evaluation suite for an LLM feature from scratch — a representative dataset, the right metrics, a baseline score, and a CI gate — using DeepEval, promptfoo, or RAGAS. Use when a feature has no evals, before tuning a prompt, or when adding an LLM feature to CI.

invocablev1.0.0

Skill

Prompt Regression Tester

Build a regression test harness for an LLM prompt so a prompt edit or model upgrade can't silently degrade quality — a fixed eval set, checkable assertions, and a diff against a committed baseline. Use when changing a production prompt, migrating model versions, or any time 'I tweaked the prompt' needs to be backed by evidence instead of eyeballing two outputs.

invocablev1.0.0

Guide

DeepEval vs RAGAS: LLM Evaluation Frameworks Compared (2026)

DeepEval vs RAGAS — pytest-style general LLM testing vs RAG-specialized metrics. Which open-source eval framework fits your pipeline, or whether you need both.

2m read· AgentsCamp

Guide

Langfuse vs LangSmith: LLM Observability Compared (2026)

Langfuse vs LangSmith — open-source self-hostable observability vs LangChain's first-party platform. Tracing, evals, prompt management, and which to adopt.

2m read· AgentsCamp

Guide

Best LLM & RAG Evaluation Tools in 2026: DeepEval vs RAGAS vs LangSmith vs Phoenix vs promptfoo

A decision guide to the LLM eval landscape — code-first frameworks vs. eval-and-observability platforms, open-source vs. hosted, and which fits your stack.

3m read· AgentsCamp

Guide

LLM Evaluation Metrics Explained: Which One to Use and When

A practical map of LLM and RAG evaluation metrics — why BLEU/ROUGE fail open-ended text, how LLM-as-judge and RAG metrics work, and which to pick per task.

6m read· AgentsCamp

Guide

Write Evals for an LLM App: From Zero to a CI Gate

How to evaluate an LLM feature — build a dataset, choose metrics, set a baseline, score offline, add an LLM judge, and gate CI so quality changes are measured.

3m read· AgentsCamp

Guide

Testing LLM Applications: How to Test Non-Deterministic Software

How to test software that calls LLMs when outputs are non-deterministic — the testing pyramid, assertion strategies, golden datasets, and CI gating.

6m read· AgentsCamp

Tool

Arize Phoenix

An open-source LLM observability and evaluation tool built on OpenTelemetry, runnable anywhere.

open sourceobservability

Tool

Braintrust

An end-to-end platform for evaluating, iterating on, and observing LLM apps, with a prompt playground.

freemiumevaluation

Tool

DeepEval

An open-source evaluation framework for LLM apps — 'Pytest for LLMs' with ready-made metrics and CI integration.

open sourceevaluation

Tool

Langfuse

An open-source LLM engineering platform for tracing, evals, prompt management, and metrics.

open sourceobservability

Tool

LangSmith

LangChain's platform for tracing, evaluating, and monitoring LLM apps — framework-agnostic.

freemiumobservability

Tool

promptfoo

An open-source CLI for testing, comparing, and red-teaming LLM prompts, models, and apps.

open sourceevaluation

Tool

RAGAS

An open-source framework for evaluating retrieval-augmented generation with reference-free RAG metrics.

open sourceevaluation

Tool

Swe Agent

Open-source autonomous coding agent from Princeton/Stanford that turns an LLM into a software engineer to fix real GitHub issues.

open sourceagent

Command

Run Evals

Run the project's LLM evaluation suite (DeepEval, promptfoo, or RAGAS) and report scores against thresholds before a merge.

/run-evals<eval suite path / config, or the feature to evaluate>

Term

LLM Evals — AI Agents, Skills & Tools

LLM Evaluation Engineer

LLM Observability Engineer

Eval Driven Developer

Agent Trajectory Evaluator

Hallucination Evaluator

LLM As Judge Scorer

LLM Eval Suite Scaffolder

Prompt Regression Tester

DeepEval vs RAGAS: LLM Evaluation Frameworks Compared (2026)

Langfuse vs LangSmith: LLM Observability Compared (2026)

Best LLM & RAG Evaluation Tools in 2026: DeepEval vs RAGAS vs LangSmith vs Phoenix vs promptfoo

LLM Evaluation Metrics Explained: Which One to Use and When

Write Evals for an LLM App: From Zero to a CI Gate

Testing LLM Applications: How to Test Non-Deterministic Software

Arize Phoenix

Braintrust

DeepEval

Langfuse

LangSmith

promptfoo

RAGAS

Swe Agent

Run Evals

Eval Dataset

Hallucination

LLM-as-Judge

Needle in a Haystack

Perplexity

Tracing (LLM)