DeepEval
An open-source evaluation framework for LLM apps — 'Pytest for LLMs' with ready-made metrics and CI integration.
DeepEval is an open-source Python framework that brings unit-testing ergonomics to LLM evaluation. It ships research-backed metrics (G-Eval, faithfulness, answer relevancy, hallucination, RAG and agent metrics) you assert on like pytest, so eval becomes a CI gate instead of a vibe check.
DeepEval is an open-source evaluation framework that makes testing an LLM application feel like writing unit tests. If you know pytest, you already know the shape: you write test cases with inputs and expected behavior, attach metrics, and assert that the scores clear a threshold — except the "assertions" are research-backed LLM metrics rather than exact-match checks.
It is aimed at engineers who want evaluation to be a repeatable, automatable gate rather than a one-off spreadsheet. DeepEval runs locally, integrates with CI, and pairs with the Confident AI platform if you want hosted dashboards and dataset management.
Highlights
- Pytest-style API — define test cases, attach metrics, and
assert_test; run the whole suite from the CLI or CI. - Ready-made metrics — G-Eval (LLM-as-judge with custom rubrics), faithfulness, answer relevancy, hallucination, plus RAG metrics (contextual precision/recall) and agent/tool-use metrics.
- Custom metrics — define your own LLM-as-judge or deterministic metrics when the built-ins don't fit.
- Synthetic data & datasets — generate test cases and manage evaluation datasets.
- CI-native — fail a build when a metric regresses, so prompt or model changes are scored, not guessed.
In an AI-assisted workflow
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
def test_answer_is_relevant():
case = LLMTestCase(input="How do I rotate API keys?", actual_output=app(query))
assert_test(case, [AnswerRelevancyMetric(threshold=0.7)])TIP
Start with 15–30 representative cases (include the adversarial ones that broke before), pick the two or three metrics your feature is actually graded on, and wire deepeval test run into CI before tuning prompts.
Good to know
DeepEval is free and open source (Apache-2.0); you bring an LLM provider for the judge metrics, so those incur token cost. The optional Confident AI cloud adds hosted reporting and collaboration. For a RAG-specific metric set, compare with RAGAS; for the full landscape see Best LLM & RAG Evaluation Tools in 2026.
Related
- RAGASAn open-source framework for evaluating retrieval-augmented generation with reference-free RAG metrics.
- Best LLM & RAG Evaluation Tools in 2026: DeepEval vs RAGAS vs LangSmith vs Phoenix vs promptfooA decision guide to the LLM eval landscape — code-first frameworks vs. eval-and-observability platforms, open-source vs. hosted, and which fits your stack.
- Write Evals for an LLM App: From Zero to a CI GateHow to evaluate an LLM feature — build a dataset, choose metrics, set a baseline, score offline, add an LLM judge, and gate CI so quality changes are measured.
- LLM Eval Suite ScaffolderStand up an evaluation suite for an LLM feature from scratch — a representative dataset, the right metrics, a baseline score, and a CI gate — using DeepEval, promptfoo, or RAGAS. Use when a feature has no evals, before tuning a prompt, or when adding an LLM feature to CI.
- Run EvalsRun the project's LLM evaluation suite (DeepEval, promptfoo, or RAGAS) and report scores against thresholds before a merge.
- LLM Evaluation EngineerUse this agent to make an LLM feature's quality measurable — building the dataset, choosing metrics, setting a baseline, and turning evals into a CI gate so prompt and model changes are scored, not guessed. Examples — "we changed the prompt and don't know if it's better, set up evals", "add a regression gate for our extraction feature", "our RAG quality is drifting, build an eval suite".
- LLM As Judge ScorerDesign a reliable LLM-as-judge metric — a calibrated rubric, a clear scoring scale, and bias controls — and validate it against human labels before trusting it. Use when grading open-ended LLM output (summaries, answers, tone) that exact-match can't score.
- BraintrustAn end-to-end platform for evaluating, iterating on, and observing LLM apps, with a prompt playground.
- promptfooAn open-source CLI for testing, comparing, and red-teaming LLM prompts, models, and apps.