Skip to content
agentscamp
Tool

DeepEval

An open-source evaluation framework for LLM apps — 'Pytest for LLMs' with ready-made metrics and CI integration.

open sourceevaluation
Updated Jun 3, 2026
evalstestingragopen-sourcepython

DeepEval is an open-source Python framework that brings unit-testing ergonomics to LLM evaluation. It ships research-backed metrics (G-Eval, faithfulness, answer relevancy, hallucination, RAG and agent metrics) you assert on like pytest, so eval becomes a CI gate instead of a vibe check.

DeepEval is an open-source evaluation framework that makes testing an LLM application feel like writing unit tests. If you know pytest, you already know the shape: you write test cases with inputs and expected behavior, attach metrics, and assert that the scores clear a threshold — except the "assertions" are research-backed LLM metrics rather than exact-match checks.

It is aimed at engineers who want evaluation to be a repeatable, automatable gate rather than a one-off spreadsheet. DeepEval runs locally, integrates with CI, and pairs with the Confident AI platform if you want hosted dashboards and dataset management.

Highlights

  • Pytest-style API — define test cases, attach metrics, and assert_test; run the whole suite from the CLI or CI.
  • Ready-made metrics — G-Eval (LLM-as-judge with custom rubrics), faithfulness, answer relevancy, hallucination, plus RAG metrics (contextual precision/recall) and agent/tool-use metrics.
  • Custom metrics — define your own LLM-as-judge or deterministic metrics when the built-ins don't fit.
  • Synthetic data & datasets — generate test cases and manage evaluation datasets.
  • CI-native — fail a build when a metric regresses, so prompt or model changes are scored, not guessed.

In an AI-assisted workflow

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
 
def test_answer_is_relevant():
    case = LLMTestCase(input="How do I rotate API keys?", actual_output=app(query))
    assert_test(case, [AnswerRelevancyMetric(threshold=0.7)])

TIP

Start with 15–30 representative cases (include the adversarial ones that broke before), pick the two or three metrics your feature is actually graded on, and wire deepeval test run into CI before tuning prompts.

Good to know

DeepEval is free and open source (Apache-2.0); you bring an LLM provider for the judge metrics, so those incur token cost. The optional Confident AI cloud adds hosted reporting and collaboration. For a RAG-specific metric set, compare with RAGAS; for the full landscape see Best LLM & RAG Evaluation Tools in 2026.

Related