Prompt Injection
Prompt injection is an attack where untrusted content carries instructions an LLM then follows — overriding its task, leaking data, or triggering tool calls.
Prompt injection is the attack of smuggling instructions into content an LLM processes, so the model follows the attacker's intent instead of its task — the LLM-era descendant of SQL injection, ranked the #1 LLM application risk by OWASP.
The root cause is structural: a model's context mixes trusted instructions and untrusted data in the same medium (text), and the model has no hard boundary between them. Direct injection comes from a hostile user; the sharper threat is indirect injection, where instructions hide in things the system reads — a webpage, a document, an email, tool output. For agents with tools, that escalates from wrong answers to wrong actions: exfiltrated secrets, malicious tool calls, poisoned memory.
Because the model layer can't fully solve it, defense is architectural: scope tools to least privilege, gate dangerous actions with deterministic checks outside the model, treat every fetched byte as untrusted, and keep humans on irreversible operations. The working playbook is Defending Against Prompt Injection; auditing an existing app for exposure is the prompt-injection-auditor agent's job.
Frequently asked questions
- What's the difference between direct and indirect prompt injection?
- Direct: the attacker is the user, typing instructions that override the system prompt ('ignore previous instructions…'). Indirect: the attack rides in content the model processes — a web page it fetches, an email it summarizes, a README it reads — so a completely benign user can trigger it. Indirect is the dangerous one for agents, which read untrusted content constantly.
- Can prompt injection be fully solved?
- Not at the model layer, today. Models can't reliably distinguish 'data to process' from 'instructions to follow' inside one context. Real defenses are architectural: least-privilege tools, deterministic permission gates outside the model, treating all fetched content as untrusted, and human approval on irreversible actions — defense in depth, not a magic prompt.
Related
- Defending Against Prompt Injection: A Practical Guide for LLM AppsPrompt injection can't be solved at the model layer — so you defend in depth: trust boundaries, least privilege, human approval, guardrails, and red-teaming.
- Securing AI Agents: The OWASP Agentic Top 10 in PracticeAgents add risks LLM-app security misses — autonomy, tools, memory, multi-agent trust. The key OWASP agentic threats and how to mitigate each in practice.
- Prompt Injection AuditorUse this agent to audit an LLM app or agent for prompt-injection exposure — mapping where untrusted content enters the model's context (user, RAG, tools, web), assessing the blast radius if an injection succeeds, probing with adversarial inputs, and recommending architectural mitigations. Examples — "audit our RAG agent for indirect prompt injection", "what's the blast radius if our agent gets injected — which tools and credentials are exposed?", "review our LLM app's trust boundaries and tell us what to fix".
- GuardrailsGuardrails are programmatic checks around an LLM — validating inputs and outputs in code — enforcing safety and format rules a prompt alone can't guarantee.
- Red Team LLMRed-team an LLM app or agent for prompt injection, jailbreaks, and data leakage — probe the real attack surface (input, RAG, tools, system prompt) with adversarial inputs and report what got through and how to fix it.
- JailbreakA jailbreak is a prompt crafted to bypass a model's safety training and policies — making it produce output it was trained to refuse.
- Red-Teaming (AI)AI red-teaming is adversarial testing — attacking your model or agent with jailbreaks, injections, and misuse scenarios to find failures before users do.
- System PromptThe system prompt is the standing instruction layer an LLM receives before user input — defining its role, rules, tools, and tone for the whole conversation.