# LLM-as-Judge

> LLM-as-judge uses a language model to score AI outputs against a rubric — evaluating quality at scale where exact-match metrics fail and humans don't scale.

**LLM-as-judge is the evaluation technique of using a language model, given a rubric, to score another model's outputs — the workhorse for measuring quality that's too subjective for string matching and too voluminous for human review.**

It exists because LLM quality is mostly not exact-match. "Is this summary faithful to the source?" "Did the agent's answer actually resolve the ticket?" A judge prompt encodes the rubric, the judge model applies it across thousands of cases, and you get numbers you can track, compare, and gate releases on — the backbone of modern [eval suites](/guides/evaluation/write-llm-evals) and the feature every eval platform ([compared here](/guides/evaluation/best-llm-eval-tools-2026)) builds around.

The craft is making the judge *trustworthy*. Known biases — position, verbosity, self-preference — have known mitigations (randomized ordering, pairwise comparison, anchored rubrics), and the non-negotiable step is **calibration**: validate the judge against human labels on a sample before believing it at scale. An uncalibrated judge is a random number generator with confidence. Designing one well is exactly what the [llm-as-judge-scorer](/skills/data/llm-as-judge-scorer) skill walks through.

---

_Source: https://agentscamp.com/glossary/llm-as-judge — Term on AgentsCamp._
