Skip to content
agentscamp

LLM Evals — AI Agents, Skills & Tools

Agents, skills, guides, tools, and commands for llm evals — 14 curated resources for building with AI coding agents.

Agent

LLM Evaluation Engineer

Use this agent to make an LLM feature's quality measurable — building the dataset, choosing metrics, setting a baseline, and turning evals into a CI gate so prompt and model changes are scored, not guessed. Examples — "we changed the prompt and don't know if it's better, set up evals", "add a regression gate for our extraction feature", "our RAG quality is drifting, build an eval suite".

sonnet6
Agent

LLM Observability Engineer

Use this agent to make a production LLM app observable — tracing every step, scoring live traffic with online evals, and monitoring quality, cost, and latency — so you can debug agent runs and catch regressions in production. Examples — "add tracing to our RAG/agent so we can debug bad answers", "set up online evals and cost/latency dashboards", "production quality is slipping and we're flying blind".

sonnet6
Skill

LLM As Judge Scorer

Design a reliable LLM-as-judge metric — a calibrated rubric, a clear scoring scale, and bias controls — and validate it against human labels before trusting it. Use when grading open-ended LLM output (summaries, answers, tone) that exact-match can't score.

invocablev1.0.0
Skill

LLM Eval Suite Scaffolder

Stand up an evaluation suite for an LLM feature from scratch — a representative dataset, the right metrics, a baseline score, and a CI gate — using DeepEval, promptfoo, or RAGAS. Use when a feature has no evals, before tuning a prompt, or when adding an LLM feature to CI.

invocablev1.0.0
Guide

Best LLM & RAG Evaluation Tools in 2026: DeepEval vs RAGAS vs LangSmith vs Phoenix vs promptfoo

A decision guide to the LLM eval landscape — code-first frameworks vs. eval-and-observability platforms, open-source vs. hosted, and which fits your stack.

3m read· AgentsCamp
Guide

Write Evals for an LLM App: From Zero to a CI Gate

How to evaluate an LLM feature — build a dataset, choose metrics, set a baseline, score offline, add an LLM judge, and gate CI so quality changes are measured.

3m read· AgentsCamp
Tool

Arize Phoenix

An open-source LLM observability and evaluation tool built on OpenTelemetry, runnable anywhere.

open sourceobservability
Tool

Braintrust

An end-to-end platform for evaluating, iterating on, and observing LLM apps, with a prompt playground.

freemiumevaluation
Tool

DeepEval

An open-source evaluation framework for LLM apps — 'Pytest for LLMs' with ready-made metrics and CI integration.

open sourceevaluation
Tool

Langfuse

An open-source LLM engineering platform for tracing, evals, prompt management, and metrics.

open sourceobservability
Tool

LangSmith

LangChain's platform for tracing, evaluating, and monitoring LLM apps — framework-agnostic.

freemiumobservability
Tool

promptfoo

An open-source CLI for testing, comparing, and red-teaming LLM prompts, models, and apps.

open sourceevaluation
Tool

RAGAS

An open-source framework for evaluating retrieval-augmented generation with reference-free RAG metrics.

open sourceevaluation
Command

Run Evals

Run the project's LLM evaluation suite (DeepEval, promptfoo, or RAGAS) and report scores against thresholds before a merge.

/run-evals<eval suite path / config, or the feature to evaluate>