Arize Phoenix
An open-source LLM observability and evaluation tool built on OpenTelemetry, runnable anywhere.
Arize Phoenix is an open-source LLM tracing and evaluation tool built on OpenTelemetry/OpenInference. Run it locally in a notebook or self-host it to capture traces, run evals (including LLM-as-judge), and debug RAG and agent runs without sending data to a vendor.
Arize Phoenix is an open-source observability and evaluation tool for LLM applications. Built on OpenTelemetry and the OpenInference tracing standard, it captures the full trace of a run and lets you evaluate outputs — and because it's open source and runs locally or self-hosted, your traces never have to leave your environment.
It is aimed at engineers who want vendor-neutral observability they can spin up in a notebook during development and self-host in production. Phoenix is the open-source companion to Arize's commercial platform, so you can start free and graduate to the managed product if you outgrow it.
Highlights
- OpenTelemetry-native tracing — instrument with open standards (OpenInference), avoiding lock-in to one vendor's SDK.
- Run anywhere — launch locally in a notebook for dev, or self-host for team/production use.
- Built-in evals — LLM-as-judge and other evaluators for relevance, hallucination, and RAG quality.
- RAG & agent debugging — inspect retrieval steps, tool calls, and the full span tree behind an answer.
- Framework-agnostic — works across common LLM and orchestration stacks via auto-instrumentation.
In an AI-assisted workflow
import phoenix as px
px.launch_app() # local UI for traces + evals
# auto-instrument your LLM/agent calls, then inspect spans and run evaluatorsTIP
Because Phoenix speaks OpenTelemetry, the instrumentation you add is portable — you can ship the same traces to another OTel-compatible backend later without re-instrumenting.
Good to know
Phoenix is open source and free to self-host; you bring an LLM provider for judge-based evals. Arize also offers a managed platform for teams that want hosted scale and support. For a hosted-first open-source option, compare Langfuse; for the commercial LangChain-native option, LangSmith.
Related
- LangfuseAn open-source LLM engineering platform for tracing, evals, prompt management, and metrics.
- LangSmithLangChain's platform for tracing, evaluating, and monitoring LLM apps — framework-agnostic.
- Best LLM & RAG Evaluation Tools in 2026: DeepEval vs RAGAS vs LangSmith vs Phoenix vs promptfooA decision guide to the LLM eval landscape — code-first frameworks vs. eval-and-observability platforms, open-source vs. hosted, and which fits your stack.
- LLM Observability EngineerUse this agent to make a production LLM app observable — tracing every step, scoring live traffic with online evals, and monitoring quality, cost, and latency — so you can debug agent runs and catch regressions in production. Examples — "add tracing to our RAG/agent so we can debug bad answers", "set up online evals and cost/latency dashboards", "production quality is slipping and we're flying blind".
- AgentOpsObservability for AI agents — session replay, cost and latency tracking, and debugging for multi-step runs.