Guardrails
Guardrails are programmatic checks around an LLM — validating inputs and outputs in code — enforcing safety and format rules a prompt alone can't guarantee.
Guardrails are deterministic checks wrapped around a language model — code that validates what goes in and what comes out, enforcing the rules a prompt can only request.
The distinction that matters is ask versus enforce. Everything inside the model is probabilistic: instructions usually hold, until a prompt injection or an odd input bends them. Guardrails sit outside that uncertainty: an input scanner that strips PII before the model sees it, an output validator that rejects malformed JSON, a policy classifier that blocks disallowed content, a permission gate that stops a dangerous tool call. The model proposes; the rails dispose.
In practice they're layered at three chokepoints — input, output, and around tool/action execution — using a mix of plain validators (structured-output schemas), specialized scanners (LLM Guard), and rule engines (NeMo Guardrails). Agentic systems add a fourth surface: deterministic action gates, which is exactly what Claude Code hooks implement. Designing the right set for an app — without strangling it — is the llm-guardrails-designer skill's job.
Frequently asked questions
- How are guardrails different from the system prompt?
- A system prompt asks; a guardrail enforces. Instructions shape model behavior probabilistically and can be overridden or ignored. Guardrails run as code outside the model — schema validators, PII scanners, policy classifiers, permission gates — and deterministically block, redact, or rewrite what violates the rules, no matter what the model 'wants'.
- What do guardrails typically check?
- Inbound: prompt-injection patterns, PII and secrets, jailbreak attempts, off-topic abuse. Outbound: format and schema validity, toxicity and policy compliance, leaked secrets, hallucinated claims against sources, and unsafe tool calls. Each check sits at a chokepoint — before the model, after it, or around a tool invocation.
Related
- LLM Guardrails DesignerDesign input and output guardrails for an LLM app — decide what to check (injection patterns, PII, secrets, policy, schema, leakage, toxicity), place them as input vs. output rails, implement with a library like NeMo Guardrails or LLM Guard, and fail closed. Use when adding a safety/validation layer around an LLM, not relying on the prompt alone.
- Defending Against Prompt Injection: A Practical Guide for LLM AppsPrompt injection can't be solved at the model layer — so you defend in depth: trust boundaries, least privilege, human approval, guardrails, and red-teaming.
- NeMo GuardrailsNVIDIA's open-source toolkit for adding programmable guardrails to LLM apps — input, dialog, retrieval, and output rails defined in the Colang language.
- LLM GuardAn open-source security toolkit of input and output scanners for LLM apps — prompt injection, PII/anonymize, secrets, toxicity, and more, from Protect AI.
- Claude Code Hooks: Automate Formatting, Tests, and GuardrailsHow Claude Code hooks work — the major hook events, the settings.json configuration shape, exit codes and JSON output, plus three hooks worth copying.
- Structured OutputStructured output makes an LLM return data in a guaranteed shape — JSON matching your schema — so code can consume model responses without parsing prose.
- Sandboxing AI-Generated Code: E2B vs Modal vs Daytona vs Vercel SandboxWhere should agent-written code run? The four sandbox platforms compared — isolation models, persistence, economics — plus the design rules that keep execution safe.
- Constitutional AIConstitutional AI trains models against written principles — the model critiques and revises its own outputs by them, reducing reliance on human labels.
- JailbreakA jailbreak is a prompt crafted to bypass a model's safety training and policies — making it produce output it was trained to refuse.
- Prompt InjectionPrompt injection is an attack where untrusted content carries instructions an LLM then follows — overriding its task, leaking data, or triggering tool calls.
- Red-Teaming (AI)AI red-teaming is adversarial testing — attacking your model or agent with jailbreaks, injections, and misuse scenarios to find failures before users do.