Guardrails

Guardrails are deterministic checks wrapped around a language model — code that validates what goes in and what comes out, enforcing the rules a prompt can only request.

The distinction that matters is ask versus enforce. Everything inside the model is probabilistic: instructions usually hold, until a prompt injection or an odd input bends them. Guardrails sit outside that uncertainty: an input scanner that strips PII before the model sees it, an output validator that rejects malformed JSON, a policy classifier that blocks disallowed content, a permission gate that stops a dangerous tool call. The model proposes; the rails dispose.

In practice they're layered at three chokepoints — input, output, and around tool/action execution — using a mix of plain validators (structured-output schemas), specialized scanners (LLM Guard), and rule engines (NeMo Guardrails). Agentic systems add a fourth surface: deterministic action gates, which is exactly what Claude Code hooks implement. Designing the right set for an app — without strangling it — is the llm-guardrails-designer skill's job.

Frequently asked questions

How are guardrails different from the system prompt?

A system prompt asks; a guardrail enforces. Instructions shape model behavior probabilistically and can be overridden or ignored. Guardrails run as code outside the model — schema validators, PII scanners, policy classifiers, permission gates — and deterministically block, redact, or rewrite what violates the rules, no matter what the model 'wants'.

What do guardrails typically check?

Inbound: prompt-injection patterns, PII and secrets, jailbreak attempts, off-topic abuse. Outbound: format and schema validity, toxicity and policy compliance, leaked secrets, hallucinated claims against sources, and unsafe tool calls. Each check sits at a chokepoint — before the model, after it, or around a tool invocation.

Frequently asked questions

Related