Tracing (LLM)
LLM tracing records every step of a model-driven request — prompts, tool calls, retrievals, tokens, latency — so multi-step behavior is debuggable.
LLM tracing is recording the complete execution of a model-driven request — every prompt, response, tool call, retrieval, token count, and latency, structured as nested spans — making systems whose behavior is probabilistic at least inspectable.
It's distributed tracing adapted to a new failure surface: in LLM apps the bug is rarely an exception — it's a wrong retrieval at step 3, a malformed tool argument at step 7, a context that drifted. The trace is where those become visible (the first move in agent debugging), and it moonlights as the system's economic ledger (cost per request, per step, per user — the raw data of cost engineering) and as the quarry for eval datasets — yesterday's traced failure is tomorrow's regression case.
The tooling is mature: Langfuse and LangSmith lead the dedicated platforms (with Phoenix, Braintrust, and OpenTelemetry-native options around them), all converging on the same model — instrument once, then debug, monitor, and evaluate from the same captured truth. The production discipline this enables — tracing every step, scoring live traffic — is the llm-observability-engineer's whole brief.
Frequently asked questions
- What does an LLM trace actually contain?
- The full request tree: each model call with its exact prompt and response, every tool invocation with arguments and results, retrieval steps with what was fetched, token counts and cost per step, latency per span, and errors — nested to mirror the application's structure (a trace contains spans; an agent run contains its tool-call spans).
- Why is tracing non-negotiable for agents?
- Because agent failures hide in the middle of multi-step runs: without the trace you see 'the answer was wrong'; with it you see step 7 retrieved the wrong document and everything after faithfully built on the mistake. Debugging, cost attribution, and eval-case mining all start from the same trace data.
Related
- Langfuse vs LangSmith: LLM Observability Compared (2026)Langfuse vs LangSmith — open-source self-hostable observability vs LangChain's first-party platform. Tracing, evals, prompt management, and which to adopt.
- LLM Observability EngineerUse this agent to make a production LLM app observable — tracing every step, scoring live traffic with online evals, and monitoring quality, cost, and latency — so you can debug agent runs and catch regressions in production. Examples — "add tracing to our RAG/agent so we can debug bad answers", "set up online evals and cost/latency dashboards", "production quality is slipping and we're flying blind".
- Why Your Agent Loops: Debugging AI AgentsThe recurring agent failure modes — loops, premature victory, tool misuse, context poisoning, scope creep — diagnosed by their signatures, with fixes.
- Write Evals for an LLM App: From Zero to a CI GateHow to evaluate an LLM feature — build a dataset, choose metrics, set a baseline, score offline, add an LLM judge, and gate CI so quality changes are measured.
- LLM Cost and Latency Engineering: Caching, Right-Sizing, and p95 BudgetsA practical playbook for cutting LLM cost and tail latency — caching, model right-sizing, prompt trimming, and enforced p95 budgets — without losing quality.