Red-Teaming (AI)

AI red-teaming is adversarial testing: deliberately attacking your own model or agent — jailbreaks, injections, exfiltration, tool abuse — to surface failures before attackers and users find them in production.

Borrowed from security practice, it became standard at two levels. Model-level red-teaming (the labs' discipline) probes frontier models for dangerous capabilities and policy bypasses pre-release. Application-level red-teaming — the kind every team shipping LLM features owns — attacks the system: can prompt injection ride in through retrieved documents or fetched pages? Can a jailbreak defeat the persona? Can an agent's tools be steered into exfiltration or destructive calls — the scenarios the OWASP agentic top 10 catalogs?

The discipline that separates it from poking around: coverage across every untrusted input channel, escalation from obvious to creative attacks, and findings → fixes → regression tests so resilience compounds instead of resetting. Tooling automates the grind (promptfoo's adversarial suites, scanners like LLM Guard for the runtime side), and the red-team-llm command packages the workflow for any app in reach.

Frequently asked questions

What does red-teaming an LLM app actually involve?

Systematically playing the attacker against your own system: jailbreak attempts against its policies, prompt injection through every content channel it reads, data-exfiltration probes, tool-abuse scenarios for agents, and domain-specific misuse. Findings become fixes (guardrails, scoping, gates) and then regression tests, so the same hole can't reopen.

How is red-teaming different from normal evals?

Evals measure expected behavior on representative inputs; red-teaming hunts unexpected behavior under adversarial ones. An app can score perfectly on its eval suite and fall to the first 'ignore previous instructions' — the two are complements: evals for quality, red-teaming for resilience.

Frequently asked questions

Related