LLM Guardrails Designer
Design input and output guardrails for an LLM app — decide what to check (injection patterns, PII, secrets, policy, schema, leakage, toxicity), place them as input vs. output rails, implement with a library like NeMo Guardrails or LLM Guard, and fail closed. Use when adding a safety/validation layer around an LLM, not relying on the prompt alone.
Install to ~/.claude/skills/llm-guardrails-designer/SKILL.md
Designs the input/output guardrail layer for an LLM app: it decides what to validate (injection, PII, secrets, policy, schema, leakage), places each check as an input or output rail, implements it with a library like NeMo Guardrails or LLM Guard, and fails closed — defense in depth around the model, not a clever system prompt.
A guardrail is the validation layer around an LLM that a system prompt can't be: programmatic checks on what goes into the model and what comes out, enforced in code rather than requested in text. This skill designs that layer — deciding which checks matter for your app, placing them as input or output rails, implementing them with a guardrails library, and making them fail closed — as defense in depth, not a wall.
When to use this skill
- Adding a safety/validation layer to an LLM app instead of trusting the prompt to police itself.
- Enforcing output structure, policy, or PII/secret-leakage checks before responses reach users or downstream systems.
- Hardening a RAG or agent app against injection and unsafe actions as part of defending against prompt injection.
Instructions
- Threat-model the app first. Identify the untrusted inputs (user, retrieved content, tool output), the sensitive data/actions to protect, and the unacceptable outputs (leaked secrets, policy violations, malformed structure). Guardrails follow the threats — don't add checks with no threat behind them.
- Choose input rails. On the way in, decide what to scan and reject/sanitize: prompt-injection patterns, PII/secret stripping (often via the prompt-pii-redactor), banned topics, and input size/token limits. Input rails reduce what reaches the model.
- Choose output rails. On the way out, validate before the response is trusted: schema/structure conformance, policy and safety (toxicity, disallowed content), leakage (PII, secrets, system-prompt disclosure), and grounding/relevance for RAG. Output rails are your last line before a user or a tool acts on the response.
- Implement with a library, not from scratch. Use NeMo Guardrails (programmable rails, Colang) or LLM Guard (ready-made input/output scanners) rather than hand-rolling detectors. Match the choice to the stack and the checks you need.
- Fail closed and make it observable. When a guardrail trips, default to the safe action (block, sanitize, or escalate to a human) rather than passing through. Log every trigger with enough context to tune it — guardrails you can't see are guardrails you can't trust.
- Acknowledge the limits. State plainly that guardrails are defense in depth, not prevention — they raise the cost of an attack and catch known patterns, but they don't replace least privilege and human approval for high-impact actions. Don't let a guardrail create false confidence.
WARNING
Guardrails are probabilistic and bypassable — a detector for injection or toxicity will miss novel phrasings. Layer them with architectural controls (least privilege, approvals, output validation), and never let "we have guardrails" substitute for limiting what the model can actually do.
TIP
Fail closed by default. A guardrail that, on error or uncertainty, lets the request through is worse than none — it gives you confidence without protection. The safe default when a check can't run or is unsure is to block or route to a human.
Output
A guardrail design and implementation: the threat model it addresses, the input and output rails with what each checks and its fail-closed behavior, the library wiring (NeMo Guardrails or LLM Guard), logging for each trigger, and an explicit statement of what the guardrails do and do not cover — so they're treated as one layer of defense, not the whole defense.
Related
- Defending Against Prompt Injection: A Practical Guide for LLM AppsPrompt injection can't be solved at the model layer — so you defend in depth: trust boundaries, least privilege, human approval, guardrails, and red-teaming.
- NeMo GuardrailsNVIDIA's open-source toolkit for adding programmable guardrails to LLM apps — input, dialog, retrieval, and output rails defined in the Colang language.
- LLM GuardAn open-source security toolkit of input and output scanners for LLM apps — prompt injection, PII/anonymize, secrets, toxicity, and more, from Protect AI.
- Prompt Pii RedactorDetect and redact PII and secrets from prompts (and logs/traces) before they reach an LLM provider — mask or tokenize emails, phone numbers, names, IDs, and API keys, reversibly where the response needs the real values back. Use when sending user or document data to a third-party model, or when LLM request logs may capture sensitive data.
- Prompt Injection AuditorUse this agent to audit an LLM app or agent for prompt-injection exposure — mapping where untrusted content enters the model's context (user, RAG, tools, web), assessing the blast radius if an injection succeeds, probing with adversarial inputs, and recommending architectural mitigations. Examples — "audit our RAG agent for indirect prompt injection", "what's the blast radius if our agent gets injected — which tools and credentials are exposed?", "review our LLM app's trust boundaries and tell us what to fix".
- Securing AI Agents: The OWASP Agentic Top 10 in PracticeAgents add risks LLM-app security misses — autonomy, tools, memory, multi-agent trust. The key OWASP agentic threats and how to mitigate each in practice.