promptfoo
An open-source CLI for testing, comparing, and red-teaming LLM prompts, models, and apps.
promptfoo is an open-source, config-driven CLI for evaluating and comparing LLM prompts and models side by side, plus a red-teaming mode that probes apps for prompt injection, jailbreaks, and unsafe output. Declarative YAML test cases make it CI-friendly and provider-agnostic.
promptfoo is an open-source, developer-first tool for evaluating LLM outputs. You declare test cases and assertions in a YAML config, point it at one or more prompts, models, or providers, and it runs a side-by-side matrix so you can see — quantitatively — which combination wins. It also ships a red-teaming mode that automatically probes an app for vulnerabilities like prompt injection and jailbreaks.
It is aimed at engineers who want eval to feel like a fast, config-driven CLI step rather than a platform. Because tests are declarative and provider-agnostic, promptfoo drops cleanly into CI and works across OpenAI, Anthropic, open models, and custom endpoints.
Highlights
- Side-by-side matrix — compare prompts × models × providers on the same cases and view results in a web UI or CI output.
- Declarative tests — assertions in YAML (exact match, similarity, LLM-as-judge, JSON schema, custom), kept in version control.
- Red teaming — automated adversarial probes for prompt injection, jailbreaks, PII leakage, and unsafe content.
- Provider-agnostic — works with hosted APIs, local models, and custom HTTP endpoints.
- CI-native — run headlessly and fail the build on a regression or a failed safety probe.
In an AI-assisted workflow
# promptfooconfig.yaml
prompts: [file://prompt_a.txt, file://prompt_b.txt]
providers: [anthropic:claude, openai:gpt]
tests:
- vars: { question: "How do I rotate API keys?" }
assert:
- type: llm-rubric
value: "answers accurately and cites the docs"npx promptfoo@latest eval && npx promptfoo@latest viewTIP
promptfoo straddles evaluation and security: use the eval matrix to pick prompts/models, and the red-team mode as a pre-ship safety gate against prompt injection.
Good to know
promptfoo is free and open source (MIT); judge-based assertions and red-team probes call an LLM, so they incur token cost. For a Python, pytest-style framework instead of a YAML CLI, compare DeepEval; for the broader landscape see Best LLM & RAG Evaluation Tools in 2026.
Related
- Best LLM & RAG Evaluation Tools in 2026: DeepEval vs RAGAS vs LangSmith vs Phoenix vs promptfooA decision guide to the LLM eval landscape — code-first frameworks vs. eval-and-observability platforms, open-source vs. hosted, and which fits your stack.
- DeepEvalAn open-source evaluation framework for LLM apps — 'Pytest for LLMs' with ready-made metrics and CI integration.
- LLM Evaluation EngineerUse this agent to make an LLM feature's quality measurable — building the dataset, choosing metrics, setting a baseline, and turning evals into a CI gate so prompt and model changes are scored, not guessed. Examples — "we changed the prompt and don't know if it's better, set up evals", "add a regression gate for our extraction feature", "our RAG quality is drifting, build an eval suite".
- Defending Against Prompt Injection: A Practical Guide for LLM AppsPrompt injection can't be solved at the model layer — so you defend in depth: trust boundaries, least privilege, human approval, guardrails, and red-teaming.
- Red Team LLMRed-team an LLM app or agent for prompt injection, jailbreaks, and data leakage — probe the real attack surface (input, RAG, tools, system prompt) with adversarial inputs and report what got through and how to fix it.
- LLM Eval Suite ScaffolderStand up an evaluation suite for an LLM feature from scratch — a representative dataset, the right metrics, a baseline score, and a CI gate — using DeepEval, promptfoo, or RAGAS. Use when a feature has no evals, before tuning a prompt, or when adding an LLM feature to CI.
- LLM GuardAn open-source security toolkit of input and output scanners for LLM apps — prompt injection, PII/anonymize, secrets, toxicity, and more, from Protect AI.
- Run EvalsRun the project's LLM evaluation suite (DeepEval, promptfoo, or RAGAS) and report scores against thresholds before a merge.