What Is LLM Observability?

Q: What Is LLM Observability?

LLM observability explained: why HTTP 200 isn't enough for AI apps, what signals to collect, how OTel GenAI conventions work, and common pitfalls to avoid.

Your LLM application returned a response in 800ms with a clean HTTP 200. Every infrastructure dashboard is green. Your user is still wrong about something that matters.

Traditional Application Performance Monitoring (APM) tells you whether your service is alive and responsive. For conventional software, that's usually sufficient — a function that runs without errors typically did the right thing. Large language models break that assumption. They're stochastic: the same input can produce different outputs on every call, and a response can be factually incorrect, irrelevant, or harmful while looking operationally perfect.

LLM observability is the practice of collecting telemetry that covers what APM misses: what the model actually said, what it cost, how long each reasoning step took, and whether the output was useful. Those questions sit outside what traditional monitoring was built to answer.

Why traditional observability isn't enough

Standard observability is still necessary. Latency, error rates, and throughput all matter for LLM applications. But they're not sufficient.

With a conventional service, a slow p99 usually points to a slow database query or resource contention. You look at your traces, find the slow span, and fix it. With an LLM, the same elevated latency could mean the model is working through a long reasoning chain, a retrieval pipeline is fetching too many documents, or a tool call is timing out three steps into an agent workflow. Same symptom, completely different causes, completely different fixes.

Correctness is the harder problem. Status codes tell you nothing about it. A response that hallucinates a product feature, quotes an incorrect policy, or confidently answers the wrong question returns HTTP 200 every single time. You need a second observability layer, one that captures the semantic quality of outputs, not just their operational characteristics.

What LLM observability actually measures

Most teams start by monitoring only one layer, and it's usually insufficient on its own.

The first is what your existing observability already covers: latency per request, token throughput, error rates from provider APIs. Token usage deserves particular attention here because it maps directly to cost. An application that works correctly but sends 4× more input tokens than intended is burning money invisibly, and that burn won't show up anywhere except your provider bill.

The second layer is semantic quality: measurements on what the model actually produced. Relevance scores (did the response address the question?), faithfulness scores in Retrieval-Augmented Generation (RAG) workflows (did the model stick to the retrieved context or invent details?), hallucination detection, user feedback. These can't come from infrastructure alone. They require evaluation frameworks that score outputs against some ground truth or rubric. This is the layer that tells you when your application is quietly degrading while operational metrics stay green, which is why it matters to get it in place before you need it.

The third is agentic tracing: full execution trees for multi-step workflows. Modern AI products rarely issue a single LLM call. They plan, retrieve documents, call external tools, pass results back to the model, and repeat. When something breaks in that chain, you need span-level visibility into every step: which tool was called with which arguments, which reasoning branch went wrong, where token costs are accumulating. Without it, debugging agent failures is guesswork.

How OpenTelemetry standardizes LLM telemetry

For years, every LLM observability tool had its own proprietary schema. The OpenTelemetry GenAI semantic conventions repository, which started work in early 2024, is standardizing this under the gen_ai.* attribute namespace: a shared vocabulary for LLM calls, agent steps, tool executions, and quality metrics that any OTel-compliant backend can understand.

When your application calls an LLM provider, the instrumentation emits a span with attributes like gen_ai.provider.name, gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.usage.output_tokens. Agent operations get their own span types: gen_ai.operation.name set to invoke_agent or execute_tool, with the agent identity in gen_ai.agent.name. Your LLM traces live in the same distributed trace as the rest of your application, correlated by trace context, routed through the same OTel Collector, queryable with the same tooling you already use.

As of mid-2026, most GenAI semantic conventions are in development status. The attribute names are usable in production and already natively supported by major observability platforms, but conventions around agent-to-agent communication and multi-session tracking are still being finalized. Use the current gen_ai.* attributes, and expect some schema movement over the next few releases.

Libraries like OpenLLMetry and the native OTel SDKs for Python, Node.js, and Java handle auto-instrumentation for the most common LLM frameworks and providers, so you don't need to write instrumentation from scratch.

Common pitfalls

The failure mode that catches most teams off guard is a fully green operational dashboard while users are getting bad responses. If you're only watching latency and error rates, you have no signal until complaints come in. Add at least basic quality evaluation early. Even simple thumbs-up/thumbs-down feedback gives you something to alert on, and that's a much better feedback loop than waiting for user reports.

Token creep in long conversations is a real cost trap. When a chatbot keeps prior conversation turns in memory, input token counts grow with every message. A session that costs 1,200 tokens at turn one might hit 4,000 by turn four. That's invisible without instrumenting gen_ai.usage.input_tokens across the session, and it maps directly to your provider bill. Worth checking early.

In multi-step agent systems, trace context breaks across async boundaries more often than you'd expect. A tool call that spawns a goroutine or async task without propagating W3C trace context shows up as an orphaned trace rather than a child span. What should look like one coherent agent execution looks like five disconnected requests. This is much easier to get right at the start than to diagnose after the fact.

The GenAI semantic conventions have gone through several revisions since 2024, and many instrumentation libraries are still emitting older attribute names from earlier spec versions. Querying across mixed-schema data without normalization produces confusing gaps: missing token counts, broken attribute filters. Check what version your instrumentation library emits before trusting the data.

Final thoughts

Adding an LLM to your stack doesn't mean rebuilding your observability from scratch. You still need traces, metrics, and logs. What changes is scope: you also need coverage of semantic quality and agent execution paths that conventional tooling never had to care about.

OTel's GenAI conventions are the right foundation. Instrument against them now and your telemetry stays portable across backends and tools.

Dash0 is an OpenTelemetry-native observability platform that treats GenAI signals as first-class telemetry, so LLM traces, token metrics, and agent spans flow through the same pipeline as the rest of your stack. The vLLM observability guide and the guide to agentic observability in Dash0 show what that looks like in a real production setup. Start a free trial to see your LLM traces, token usage, and agent spans in one place. No credit card required.