promptfoo

promptfoo is an open-source, config-driven CLI for evaluating and comparing LLM prompts and models side by side, plus a red-teaming mode that probes apps for prompt injection, jailbreaks, and unsafe output. Declarative YAML test cases make it CI-friendly and provider-agnostic.

promptfoo is an open-source, developer-first tool for evaluating LLM outputs. You declare test cases and assertions in a YAML config, point it at one or more prompts, models, or providers, and it runs a side-by-side matrix so you can see — quantitatively — which combination wins. It also ships a red-teaming mode that automatically probes an app for vulnerabilities like prompt injection and jailbreaks.

It is aimed at engineers who want eval to feel like a fast, config-driven CLI step rather than a platform. Because tests are declarative and provider-agnostic, promptfoo drops cleanly into CI and works across OpenAI, Anthropic, open models, and custom endpoints.

Highlights

Side-by-side matrix — compare prompts × models × providers on the same cases and view results in a web UI or CI output.
Declarative tests — assertions in YAML (exact match, similarity, LLM-as-judge, JSON schema, custom), kept in version control.
Red teaming — automated adversarial probes for prompt injection, jailbreaks, PII leakage, and unsafe content.
Provider-agnostic — works with hosted APIs, local models, and custom HTTP endpoints.
CI-native — run headlessly and fail the build on a regression or a failed safety probe.

In an AI-assisted workflow

# promptfooconfig.yaml
prompts: [file://prompt_a.txt, file://prompt_b.txt]
providers: [anthropic:claude, openai:gpt]
tests:
  - vars: { question: "How do I rotate API keys?" }
    assert:
      - type: llm-rubric
        value: "answers accurately and cites the docs"

npx promptfoo@latest eval && npx promptfoo@latest view

TIP

promptfoo straddles evaluation and security: use the eval matrix to pick prompts/models, and the red-team mode as a pre-ship safety gate against prompt injection.

Good to know

promptfoo is free and open source (MIT); judge-based assertions and red-team probes call an LLM, so they incur token cost. For a Python, pytest-style framework instead of a YAML CLI, compare DeepEval; for the broader landscape see Best LLM & RAG Evaluation Tools in 2026.

Frequently asked questions

What is promptfoo?

promptfoo is an open-source, developer-first CLI for evaluating LLM outputs. You declare test cases and assertions in YAML, point it at one or more prompts, models, or providers, and it runs a side-by-side matrix showing which combination wins. It also ships a red-teaming mode that automatically probes an app for prompt injection, jailbreaks, PII leakage, and unsafe content.

Is promptfoo free?

Yes — free and open source under MIT. Judge-based assertions and red-team probes call an LLM, so those incur token cost.

How do I run promptfoo?

Define prompts, providers, and tests in promptfooconfig.yaml, then run npx promptfoo@latest eval and npx promptfoo@latest view to see the results matrix. It runs headlessly in CI and can fail the build on a regression or a failed safety probe.

promptfoo vs DeepEval?

promptfoo is a config-driven YAML CLI; DeepEval is a Python, pytest-style framework. Pick promptfoo for declarative, provider-agnostic matrix comparisons and red teaming; pick DeepEval if you want evals written as Python tests.

Highlights

In an AI-assisted workflow

Good to know

Frequently asked questions

Related