LLM-as-Judge
LLM-as-judge uses a language model to score AI outputs against a rubric — evaluating quality at scale where exact-match metrics fail and humans don't scale.
LLM-as-judge is the evaluation technique of using a language model, given a rubric, to score another model's outputs — the workhorse for measuring quality that's too subjective for string matching and too voluminous for human review.
It exists because LLM quality is mostly not exact-match. "Is this summary faithful to the source?" "Did the agent's answer actually resolve the ticket?" A judge prompt encodes the rubric, the judge model applies it across thousands of cases, and you get numbers you can track, compare, and gate releases on — the backbone of modern eval suites and the feature every eval platform (compared here) builds around.
The craft is making the judge trustworthy. Known biases — position, verbosity, self-preference — have known mitigations (randomized ordering, pairwise comparison, anchored rubrics), and the non-negotiable step is calibration: validate the judge against human labels on a sample before believing it at scale. An uncalibrated judge is a random number generator with confidence. Designing one well is exactly what the llm-as-judge-scorer skill walks through.
Frequently asked questions
- Why use a model to judge a model?
- Because most LLM output quality is subjective-but-describable: helpfulness, faithfulness to sources, tone. Exact-match metrics can't score an open-ended answer, and humans don't scale to thousands of cases per release. A judge model with a precise rubric gets you scalable evaluation that correlates with human judgment — when built carefully.
- What are the known biases of LLM judges?
- Position bias (favoring the first answer in pairwise comparisons), verbosity bias (longer reads as better), self-preference (favoring outputs from its own model family), and score clustering. Mitigations are standard: randomize order, compare pairwise instead of absolute-scoring, use rubrics with anchored examples, and calibrate the judge against a slice of human labels before trusting it.
Related
- Write Evals for an LLM App: From Zero to a CI GateHow to evaluate an LLM feature — build a dataset, choose metrics, set a baseline, score offline, add an LLM judge, and gate CI so quality changes are measured.
- LLM As Judge ScorerDesign a reliable LLM-as-judge metric — a calibrated rubric, a clear scoring scale, and bias controls — and validate it against human labels before trusting it. Use when grading open-ended LLM output (summaries, answers, tone) that exact-match can't score.
- Best LLM & RAG Evaluation Tools in 2026: DeepEval vs RAGAS vs LangSmith vs Phoenix vs promptfooA decision guide to the LLM eval landscape — code-first frameworks vs. eval-and-observability platforms, open-source vs. hosted, and which fits your stack.
- HallucinationA hallucination is fluent, confident output that is factually wrong or fabricated — plausible text unsupported by any source, the signature LLM failure mode.
- LLM Evaluation EngineerUse this agent to make an LLM feature's quality measurable — building the dataset, choosing metrics, setting a baseline, and turning evals into a CI gate so prompt and model changes are scored, not guessed. Examples — "we changed the prompt and don't know if it's better, set up evals", "add a regression gate for our extraction feature", "our RAG quality is drifting, build an eval suite".
- DeepEval vs RAGAS: LLM Evaluation Frameworks Compared (2026)DeepEval vs RAGAS — pytest-style general LLM testing vs RAG-specialized metrics. Which open-source eval framework fits your pipeline, or whether you need both.
- Langfuse vs LangSmith: LLM Observability Compared (2026)Langfuse vs LangSmith — open-source self-hostable observability vs LangChain's first-party platform. Tracing, evals, prompt management, and which to adopt.
- Eval DatasetAn eval dataset is the curated set of test cases — inputs with expected outcomes — that an LLM application's quality is measured against.