Red Team LLM
Red-team an LLM app or agent for prompt injection, jailbreaks, and data leakage — probe the real attack surface (input, RAG, tools, system prompt) with adversarial inputs and report what got through and how to fix it.
/red-team-llm<the app/endpoint/agent to test, or a description of its inputs, tools, and data>Install to ~/.claude/commands/red-team-llm.md
Scope
Treat $ARGUMENTS as the LLM app/agent to red-team — an endpoint, an agent, or a description of its inputs, tools, retrieved sources, and data. Restate the target and its attack surface in one sentence before probing.
WARNING
Red-team only systems you are authorized to test. This command runs adversarial attacks; confirm you have permission for the target and use a non-production or isolated environment where possible. The aim is to find holes before an attacker does — on your own system.
Goal: probe the real attack surface with adversarial inputs, record what succeeds and its blast radius, and return prioritized fixes — an active attack campaign, complementary to the design review the prompt-injection-auditor performs.
Step 1 — Map the attack surface
Enumerate every channel that reaches the model: direct user input, retrieved/RAG content, tool outputs, browsed pages or ingested files, and the system prompt. The indirect channels (content the system reads while working) are the ones most worth attacking.
Step 2 — Choose attack categories
Cover the categories that matter for this target:
- Direct prompt injection — instruction-override in user input.
- Indirect injection — payloads planted in a document, tool result, or page the system ingests.
- Jailbreak — bypassing safety/policy constraints.
- System-prompt leakage — extracting the hidden instructions (LLM07).
- Data exfiltration — making the model reveal data or secrets it shouldn't.
- Tool misuse — inducing a harmful or out-of-scope tool call (for agents).
Step 3 — Run the probes
Execute adversarial inputs for each category — automated with a red-teaming tool like promptfoo (injection/jailbreak suites) and/or targeted manual probes, including the indirect path (seed a poisoned document/tool result and see if the agent obeys it). Vary phrasings; a single failed attempt proves nothing.
Step 4 — Record what got through and its blast radius
For each successful attack, capture the input, what the model did, and — critically — the impact: data leaked, action taken, constraint bypassed. Rank by blast radius (what it could actually cause), not by novelty.
Step 5 — Recommend fixes and re-test
Map each finding to a mitigation — least privilege, human approval, trust boundaries, input/output guardrails, secrets out of context (see Defending Against Prompt Injection) — then re-run the successful attacks to confirm the fix contains them. An attack that now achieves nothing is the success criterion, not one you believe you blocked.
NOTE
Report negatives honestly: state which attack categories you ran, which you didn't, and that passing today's probes is not proof of safety — red-teaming is continuous, because new bypasses appear. Gate releases on it, don't treat it as a one-time sign-off.
Related
- Defending Against Prompt Injection: A Practical Guide for LLM AppsPrompt injection can't be solved at the model layer — so you defend in depth: trust boundaries, least privilege, human approval, guardrails, and red-teaming.
- Prompt Injection AuditorUse this agent to audit an LLM app or agent for prompt-injection exposure — mapping where untrusted content enters the model's context (user, RAG, tools, web), assessing the blast radius if an injection succeeds, probing with adversarial inputs, and recommending architectural mitigations. Examples — "audit our RAG agent for indirect prompt injection", "what's the blast radius if our agent gets injected — which tools and credentials are exposed?", "review our LLM app's trust boundaries and tell us what to fix".
- promptfooAn open-source CLI for testing, comparing, and red-teaming LLM prompts, models, and apps.
- Securing AI Agents: The OWASP Agentic Top 10 in PracticeAgents add risks LLM-app security misses — autonomy, tools, memory, multi-agent trust. The key OWASP agentic threats and how to mitigate each in practice.