What Is AI Observability?

Q: What Is AI Observability?

AI observability explained: how it differs from LLM and ML observability, the signals each AI layer needs, and how OpenTelemetry ties it together.

"AI observability" gets used as a synonym for "LLM observability" almost everywhere you look, and that conflation will bite you the moment your stack contains anything other than a chatbot. A fraud-detection model drifting out of calibration, a recommendation pipeline trained on stale features, a retrieval step returning the wrong documents, an agent picking the wrong tool: these are all AI failures in production, and only some of them are LLM failures.

AI observability is the practice of collecting telemetry that explains how an AI system behaves end to end: the model, the data feeding it, the infrastructure running it, and the decisions it makes once real traffic hits. LLM observability is one slice of that. So is the older discipline of ML model monitoring.

This article maps what the full picture actually covers, where it overlaps with the observability you already run, and where AI introduces signals that your existing tooling was never built to capture.

AI observability vs LLM and ML observability

The reason the terms get muddled is that they describe overlapping concerns at different scopes. Three points on a spectrum are worth pulling apart.

Predictive ML observability is the original. If you run a classification, regression, ranking, or forecasting model, your failure modes are statistical: the input distribution shifts away from what the model trained on (data drift), the relationship between inputs and the target changes (concept drift), or the features computed at serving time differ from the ones computed during training (training-serving skew). None of these throw an error. The model keeps returning confident predictions that quietly get worse, and you often find out from a downstream business metric weeks later. The signals here are feature distributions, prediction distributions, and accuracy measured against delayed ground truth.

LLM observability is the GenAI-era slice. A large language model is stochastic, so the same prompt can produce different outputs, and a response can be fluent, fast, and completely wrong. The signals shift toward token usage and cost, prompt and response capture, latency broken down by reasoning step, and semantic quality measures like relevance, faithfulness, and hallucination rate. We cover this in depth in what is LLM observability.

Agentic observability is the newest and the hardest. An agent plans, retrieves, calls tools, feeds results back to a model, and loops until it decides it's done. A single user request can fan out into dozens of model calls and tool executions across multiple services. When it goes wrong, you need the full execution tree: which tool ran with which arguments, which branch the agent chose, where the run derailed.

The reason this matters is that real systems mix all three. A modern product might route a request through a classifier, hand it to an LLM, let an agent call three tools, and pull context from a vector database, all in one trace. Watch only the LLM layer and you're blind to the drift in the classifier and the stale embeddings in the retriever.

What each layer actually needs

The practical scope is the entire path a request takes through your AI system. Some of these layers are served well by the observability you already run; others need signals that don't exist in a traditional Application Performance Monitoring (APM) tool.

Infrastructure is the most familiar ground. AI workloads run on the same compute, network, and storage as everything else, plus GPUs. GPU utilization, memory pressure, inference queue depth, and cold-start latency on serverless model endpoints are the metrics to watch, and standard infrastructure monitoring handles most of it.

The model and serving layer is where the request hits a model. For a self-hosted model that means request latency, throughput, error rates, and saturation of the serving runtime. For a managed provider like OpenAI, Anthropic, or Bedrock, you're observing a black box you don't control, so the signals you can capture are latency, token counts, rate-limit responses, and provider-side errors. Instrumenting the call yourself matters here: the provider won't tell you when their model quietly changes behavior.

The data layer is the one teams most often skip, and it's where the expensive failures hide. For predictive models this is feature freshness and distribution drift. For Retrieval-Augmented Generation (RAG) systems it's embedding quality, retrieval relevance, and how many documents you're pulling into context. A retriever that starts returning subtly worse matches degrades answer quality with zero infrastructure symptoms. Nothing alerts. Users just get worse answers.

The orchestration and agentic layer is the control flow: prompt and response pairs, retries, tool-call timing, and the decision branches an agent takes. This is where distributed tracing becomes load-bearing in AI systems, because a coherent agent run only makes sense as a single trace with a span per step.

The application layer is where users actually experience the system, and it's the cheapest quality signal you have. A thumbs-up/thumbs-down ratio, correction rates, and abandonment give you a feedback loop you can alert on long before a formal evaluation pipeline is in place.

How OpenTelemetry ties it together

The temptation is to buy a separate AI-specific tool that captures prompts and token counts in its own proprietary schema, sitting next to the observability stack you already run. That leaves you correlating across two systems by hand every time an agent run touches a database and a downstream service, which is most of the time.

OpenTelemetry is the cleaner path. The GenAI semantic conventions, which started in early 2024, standardize AI telemetry under the gen_ai.* attribute namespace: a shared vocabulary for model calls, agent operations, and tool executions. A model call emits a span with attributes like gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.usage.output_tokens. Agent and tool steps set gen_ai.operation.name to values like invoke_agent or execute_tool. Because these are ordinary OpenTelemetry spans, your AI telemetry lands in the same trace as the HTTP request, the database query, and the cache lookup that surround it, correlated by trace context and queryable with the same tooling you already use.

One caveat: as of mid-2026 the GenAI conventions are still in development status. The attribute names are stable enough for production and already supported by major backends, but parts of the spec covering agent-to-agent communication and multi-session tracking are still being finalized. The specific pitfalls around schema movement are covered in the LLM observability FAQ.

Things that go wrong

Two failure modes show up more than any others.

The first is standing up a beautiful prompt-and-token dashboard for the chatbot while the fraud model or recommender runs with no drift detection at all. The predictive model fails statistically and silently, the dashboards stay green, and the first signal is often a downstream business metric — a revenue dip, a drop in conversion — rather than anything in your observability tooling. If you have classical models in production, they need distribution monitoring regardless of what else you're watching.

The second is watching outputs but not the data layer. Training-serving skew and stale features produce a model that looks healthy on every operational metric while making worse decisions every day. The same is true of a RAG retriever silently returning weaker matches. Output quality is downstream of data quality, and that relationship doesn't show up in latency or error rates.

Worth keeping straight as well: there's a difference between monitoring and observability. A dashboard that alerts when latency crosses a threshold is monitoring. Observability is the ability to answer a question you didn't predict in advance, like why this specific agent run cost ten times the median, which requires capturing enough span-level context to reconstruct any individual execution after the fact. For probabilistic systems where you can't enumerate the failure modes ahead of time, that reconstruction ability is the point.

Final thoughts

The AI observability problem isn't a separate discipline to bolt on next to your existing stack. It's an extension of what you already do, applied to the parts of an AI system that traditional tooling never had to reason about: statistical drift in predictive models, semantic quality in generative ones, execution paths in agentic ones. Capturing all of it as OpenTelemetry keeps AI signals in the same pipeline as the rest of your traces, metrics, and logs.

Dash0 is an OpenTelemetry-native observability platform that ingests AI telemetry through the same pipeline as your infrastructure metrics, logs, and distributed traces. You can follow a request from a user's thumbs-down click through the agent's decision graph to the GPU it ran on, in one view. Agent0 can investigate what it finds there and act on it, with human review at each stage until you're confident enough to let it run further on its own. The agentic observability guide shows what that looks like in practice. Start a free trial to see your AI traces, token usage, and agent spans alongside the rest of your stack. No credit card required.