Production Tool & Function Calling: Feed Errors Back as Observations
How agents use tools — the call/observe/retry loop, why errors must return to the model, and the schemas, idempotency, and limits that keep it reliable.
Tool calling is a loop: the model proposes a call, your code runs it, and the result — success OR error — goes back to the model as an observation it reasons about. The reliability comes from the engineering around that loop: schemas the model can't misuse, errors returned (never swallowed), bounded retries, idempotent side effects, and human gates on irreversible actions.
Steps at a glance
- Define tools precisely. Give each tool a model-facing description, typed arguments, honest required fields, and enums. A well-specified schema prevents most malformed calls (see tool-definition-generator).
- Let the model propose a call. The model returns a structured tool call (name + arguments) when it decides a tool is needed, rather than free text.
- Validate, then execute. Validate the arguments against the schema before running. For consequential actions, pass through a human-approval gate first.
- Return the result as an observation. Feed the tool's output back into the conversation as an observation the model reasons about on the next step.
- Feed errors back too. On failure, return a clear, structured error (e.g. '404: invoice not found') as the observation so the agent can adapt — never swallow it or crash silently.
- Retry within limits. Retry transient failures (timeout, rate limit) with backoff and a hard cap; don't retry non-retryable errors (bad request, auth). Make side-effecting tools idempotent so retries are safe.
- Loop to completion. Repeat propose-execute-observe until the task is done or a step/budget cap is hit — always have a termination condition.
Key takeaways
- Tool calling is a loop: propose call → execute → return result as an observation → repeat until done.
- Feed errors back to the model as observations — a swallowed error makes the agent assume success and act wrongly.
- Tool definitions are prompt surface: precise schemas, enums, and descriptions prevent most bad calls.
- Bound retries, distinguish retryable from non-retryable failures, and make side-effecting tools idempotent.
- Gate irreversible actions behind human approval, enforced at the tool layer, not in the prompt.
An agent is a language model in a loop with tools. The model can't do anything in the world by itself — it can only emit text, including a structured request to call a tool. Everything an agent does — search, query a database, send an email, run code — happens because your code executed a tool call and handed the result back. Getting that loop right is most of what makes an agent reliable.
The loop
Tool calling is a cycle, not a one-shot:
- You give the model a set of tools with schemas.
- The model proposes a call — a tool name and arguments — when it decides one is needed.
- Your code validates and executes it.
- You return the result as an observation to the model.
- The model reads the observation and either calls another tool or answers.
Repeat until the task is done. The model is the planner; your tool layer is the hands — and the safety system.
The one rule: errors are observations, not exceptions
The single most important — and most violated — principle: when a tool fails, return the error to the model as an observation. Not a swallowed exception, not a crash, not nothing. An agent that receives "404: invoice not found" can adapt: fix the ID, try another tool, or tell the user. An agent that receives nothing assumes the call worked and proceeds on a result that doesn't exist — the classic "silent failure, then confidently wrong action."
WARNING
Swallowing tool errors is the most common and most damaging agent bug. A failed payment that the agent thinks succeeded, a missing record it hallucinates around — these come from errors that never made it back to the model.
What makes it production-grade
The loop is simple; the reliability is in the engineering around it:
- Schemas the model can't misuse. Tool definitions are prompt surface — precise types, enums, honest required fields, and model-facing descriptions prevent most bad calls before they happen (the tool-definition-generator skill builds these). See also Effective Tool Use on scoping the toolset.
- Bounded retries. Retry transient failures (timeouts, rate limits) with backoff and a hard cap; don't retry non-retryable ones (bad request, auth) — that just burns budget.
- Idempotent side effects. For tools that change state, use idempotency keys or pre-checks so a retry or re-run can't double-charge or duplicate.
- Human gates on irreversible actions. Payments, deletions, deploys, outbound messages — gate behind approval enforced at the tool layer, not requested in the prompt (human-in-the-loop-gate).
- Termination. Always cap steps and budget so the loop can't run forever.
- Safe parallelism. Run independent calls concurrently for latency, but keep dependent or state-mutating calls ordered.
Most agent frameworks (the comparison) implement the loop for you — but the schema quality, error handling, idempotency, and gates are still yours to get right. The agent-tool-integration-engineer builds this layer, and the agent-reliability-reviewer audits it before you ship.
Frequently asked questions
- How does function calling work in an LLM agent?
- You describe a set of tools (functions) with typed schemas. When the model decides a tool is needed, it returns a structured call — the tool name plus arguments — instead of plain text. Your code validates and executes that call, then returns the result back to the model as an observation. The model reads the observation and either calls another tool or produces a final answer. It's a loop of propose → execute → observe, repeated until the task is done.
- What should an agent do when a tool call fails?
- Return the error to the model as an observation — a clear, structured message like '404: invoice not found' or 'rate limited, retry later' — so the agent can reason about it and adapt (fix the arguments, try another tool, or report the problem). The worst thing you can do is swallow the error or crash: a silent failure makes the agent assume the call succeeded and act on a result that doesn't exist, producing a confidently wrong outcome.
- Why does my agent call the wrong tool or pass bad arguments?
- Almost always because the tool definitions are weak, not because the model is dumb. Vague descriptions, free-string arguments that should be enums, and required fields marked optional all invite mistakes. Tighten the schemas: write descriptions for the model (what it does and when to use it), constrain arguments with types and enums, and keep the tool set small so tools aren't confusable.
- How do I make tool calls safe in production?
- Validate arguments before executing; feed errors back as observations; bound retries and distinguish retryable from non-retryable failures; make side-effecting tools idempotent so a retry can't double-charge or duplicate; and gate irreversible actions (payments, deletes, deploys, sends) behind human approval enforced at the tool layer. Then add step/budget caps so the loop always terminates.
Related
- Which Agent Framework in 2026? LangGraph vs CrewAI vs AutoGen vs OpenAI Agents SDK vs Claude Agent SDKA decision guide to the major AI agent frameworks — control vs. abstraction, multi-agent models, state and durability, and which fits your project.
- Agent Tool Integration EngineerUse this agent to wire tools and function-calling into an agent loop reliably — clean tool schemas, errors fed back as observations, retries with limits, idempotency, and parallel calls. Examples — "connect our APIs as agent tools", "our agent calls tools wrong / ignores tool errors", "add function-calling with proper error recovery to our agent".
- Tool Definition GeneratorGenerate clean function/tool schemas for an LLM agent from existing code or a spec — accurate JSON Schema, model-facing descriptions, honest required fields, and enums that make invalid calls impossible. Use when wiring functions into an agent's tool-calling loop.
- Agent Reliability ReviewerUse this agent to make an AI agent production-ready — reviewing its loops, cost controls, error handling, tool use, human-in-the-loop gates, checkpointing, and observability, then reporting concrete failure modes and fixes. Examples — "is our agent safe to ship?", "our agent loops forever / burns tokens, harden it", "add guardrails and recovery before we put this agent in front of users".
- Effective Tool Use: Scoping an Agent's ToolsetHow to scope tools and permissions so an agent reaches for the right one and can't do damage.
- Human In The Loop GateAdd a human approval checkpoint to an agent so it pauses before a risky or irreversible action (spending money, deleting data, sending messages, merging code) and resumes only after a human approves. Use when an agent acts autonomously on consequential operations.
- Defending Against Prompt Injection: A Practical Guide for LLM AppsPrompt injection can't be solved at the model layer — so you defend in depth: trust boundaries, least privilege, human approval, guardrails, and red-teaming.
- Structured Output vs JSON Mode vs Function Calling: Which to Use in 2026The reliable ways to get typed data out of an LLM — what JSON mode, function calling, and native structured outputs each guarantee, and when to use which.
- Add Human Approval StepScaffold a human-in-the-loop approval gate into an agent so it pauses before a consequential action and resumes after approval.