Open-source AI observability platform that ships tracing, evaluation, dataset versioning, prompt playground and an MCP server for unified access to these capabilities.
https://github.com/Arize-ai/phoenixStop juggling multiple dashboards to debug your AI applications. Phoenix's MCP server puts comprehensive AI observability, evaluation, and debugging tools directly in your development workflow—where they belong.
Building production AI applications means dealing with invisible failures, inconsistent outputs, and performance bottlenecks that traditional debugging tools can't catch. You're probably familiar with this workflow: check logs, examine traces in one tool, run evaluations in another, manage datasets somewhere else, then piece together what actually went wrong.
Phoenix's MCP server consolidates all of this into a single, unified interface that integrates directly with your existing tools.
Complete AI Observability: OpenTelemetry-based tracing that automatically captures your LLM calls, embeddings, and retrieval operations across 20+ frameworks (LangChain, LlamaIndex, OpenAI, Anthropic, etc.). No manual instrumentation required.
Real-time Evaluation: Built-in evaluation tools that measure response quality, toxicity, hallucination detection, and retrieval relevance—while your application runs. Skip the separate evaluation pipeline.
Dataset Management: Version your training data, evaluation sets, and production logs in one place. Track drift and quality degradation automatically.
Interactive Debugging: A prompt playground that lets you replay traced calls, adjust parameters, and compare model outputs side-by-side. Debug production issues without recreating complex application state.
Production Debugging: When your RAG system starts returning irrelevant results, Phoenix traces show you exactly which retrieval step failed and why. One developer reported finding a vector similarity threshold bug in 5 minutes that would have taken hours to track down manually.
Model Comparison: A/B test different LLMs by routing traffic through Phoenix's tracing layer. Compare response quality, latency, and cost across providers without changing your application code.
Evaluation Automation: Set up continuous evaluation of your AI pipeline. Phoenix automatically scores new responses against your criteria and alerts you when quality drops below thresholds.
Team Collaboration: Share traced conversations with your team. Product managers can see exactly what the AI produced, engineers can debug the underlying calls, and everyone works from the same data.
Phoenix's MCP server integrates with your existing development environment rather than forcing you to adopt new workflows:
// Your existing code stays the same
const completion = await openai.chat.completions.create({
messages: [{ role: "user", content: "Explain quantum computing" }],
model: "gpt-4"
});
// Phoenix automatically captures everything via MCP
The server runs locally during development, deploys to containers in staging, and scales to production without code changes. Your instrumentation works everywhere.
Zero Vendor Lock-in: Built on OpenTelemetry standards. Your trace data exports to any OTEL-compatible system.
Framework Agnostic: Works with Python, JavaScript, LangChain, LlamaIndex, raw OpenAI calls, Anthropic, and more. Instrument what you're already using.
Production Ready: Used by teams shipping AI applications to millions of users. Handles high-volume tracing without impacting application performance.
Open Source: 6000+ GitHub stars, active community, and transparent development. No surprise licensing changes or feature restrictions.
The MCP server gives you programmatic access to all Phoenix capabilities through a standardized interface. Query traces, run evaluations, and manage datasets directly from your development tools or CI/CD pipelines.
Phoenix turns AI application debugging from a guessing game into a systematic process. Install it once, instrument your code, and get visibility into every aspect of your AI pipeline.