Observing vLLM with OpenTelemetry and Dash0

vLLM ships with OpenTelemetry instrumentation built in, but wiring it up for production requires more than passing a single flag. Standard APM tells you a request was slow. It won't tell you whether the latency came from KV cache preemptions, scheduler queue pressure, a long prefill phase, or a decode bottleneck. Those distinctions require inference-specific signals: cache utilization, time to first token, preemption rate, queue depth.

This post covers how to collect those signals using the OTel Collector and Dash0 as the observability backend, what each signal means in practice, and how to use them for capacity planning and latency debugging. The full working example (a Docker Compose stack with a FastAPI RAG app, vLLM server, and OTel Collector) is in the dash0-examples repository.

Why LLM inference observability is its own problem

A slow HTTP service and a slow LLM inference server look the same from the outside. Both show elevated p99 latency, timeouts, and degraded UX. The causes are completely different, and so are the fixes.

With a standard service, high latency usually points to an upstream dependency, a slow query, or resource saturation. You look at your spans, find the slow component, and fix it.

For vLLM, latency has distinct phases. Scheduling, prefill, and decode behave differently under load and require different tuning strategies. KV cache pressure causes preemption events that degrade throughput without showing up as errors. Time to first token and time per output token are independent metrics that can diverge significantly under batched workloads. Queue depth tells you whether you're approaching capacity before users start noticing.

None of that is visible without LLM-specific instrumentation, and none of it maps cleanly to standard HTTP or RPC semantics. vLLM's OpenTelemetry integration addresses this by exposing distributed traces for per-request visibility and a Prometheus-compatible metrics endpoint for the inference-specific signals you need for dashboards and alerting.

What vLLM emits

Traces

vLLM exports OTel spans for each inference request when the --otlp-traces-endpoint flag is set. OTel support is an optional dependency. Install it with pip install vllm[otel], which pulls in opentelemetry-sdk, opentelemetry-api, opentelemetry-exporter-otlp, and opentelemetry-semantic-conventions-ai (all >=1.26.0). Without it, the flag is silently ignored.

The span attributes are defined in vLLM's own SpanAttributes class in vllm/tracing/utils.py. The code comment there says it directly: these are "copied from OTel semantic conventions to avoid version conflicts." vLLM deliberately pins its own attribute names rather than tracking the evolving OTel GenAI semconv spec. This is intentional stability. It also means vLLM does not emit OTel span events at all. The tracing is entirely attribute-based and does not capture prompt or completion content.

Span attributes

Attribute	What it tells you
gen_ai.request.model / gen_ai.response.model	What model handled the request
gen_ai.usage.prompt_tokens	Input token count, for cost and capacity tracking
gen_ai.usage.completion_tokens	Output token count
gen_ai.latency.e2e	Full request duration
gen_ai.latency.time_to_first_token	TTFT, the most user-visible latency signal for streaming applications
gen_ai.latency.time_in_queue	Time waiting before execution started; rising values signal capacity pressure before latency visibly degrades

vLLM also emits a full set of internal latency breakdowns (gen_ai.latency.time_in_model_prefill, gen_ai.latency.time_in_model_decode, gen_ai.latency.time_in_model_forward, and others). These are useful for debugging individual slow traces but not needed in dashboards.

One naming difference to be aware of: vLLM uses gen_ai.usage.prompt_tokens and gen_ai.usage.completion_tokens, not the input_tokens / output_tokens names in the current OTel GenAI semconv spec. The gen_ai.latency.* namespace is also vLLM-specific. These are the exact names you will query in Dash0.

Resource attributes

Resource attributes travel with every span from the process. The ones worth setting explicitly and the ones vLLM adds automatically:

Attribute	Source
service.name	Set via OTEL_SERVICE_NAME. The primary attribute for filtering in Dash0
service.version	Set via OTEL_SERVICE_VERSION
deployment.environment.name	Set via the Collector's resource processor
vllm.instrumenting_module_name	Added by vLLM automatically
vllm.process_id	Added by vLLM automatically
vllm.process_kind / vllm.process_name	Added automatically on GPU worker subprocesses

Set OTEL_SERVICE_NAME, OTEL_SERVICE_VERSION, and deployment.environment.name explicitly. These are what you will filter by in Dash0. Note that deployment.environment.name is the current attribute name as of OTel semantic conventions 1.27; the older deployment.environment still works in most tools but is deprecated.

Trace propagation

vLLM reads W3C traceparent context from both HTTP request headers and environment variables. The environment variable path is how it propagates context to its own GPU worker subprocesses: when a request arrives, the main process injects traceparent into the environment before spawning workers, so all worker spans link back to the same trace. If your application code creates a span and injects into the outbound request headers, the vLLM span becomes a child of yours and the GPU worker spans become children of that, giving you a single trace covering the full request path.

Metric	Type	What it tells you
vllm:e2e_request_latency_seconds	Histogram	Full request latency. Watch p95/p99; averages hide long-tail issues
vllm:time_to_first_token_seconds	Histogram	Time until the first token streams. Most important latency signal for interactive apps
vllm:time_per_output_token_seconds	Histogram	Per-token decode latency, useful for detecting decode-phase bottlenecks
vllm:inter_token_latency_seconds	Histogram	Time between consecutive tokens, directly user-visible in streaming UIs
vllm:prompt_tokens_total	Counter	Cumulative input tokens. Use rate() for tokens/second
vllm:generation_tokens_total	Counter	Cumulative output tokens. Primary capacity metric for GPU sizing
vllm:gpu_cache_usage_perc	Gauge	KV cache fill percentage. Approaching 1.0 means preemptions are imminent
vllm:num_requests_waiting	Gauge	Queue depth. Rising before latency degrades is your earliest capacity warning
vllm:num_preemptions_total	Counter	Requests evicted from memory. A growing rate correlated with high cache usage means your configuration needs adjustment
vllm:prefix_cache_hit_rate	Gauge	Prefix cache efficiency. A low rate is a tuning opportunity if you are serving repeated system prompts or RAG with shared context

When the Collector scrapes the /metrics endpoint, Prometheus automatically adds job and instance labels. Additional context like environment, cluster, or namespace requires explicit configuration of the Collector's resource processor or the Kubernetes attributes processor.

The pipeline

The collection architecture keeps things OTel-native end to end. vLLM pushes traces to the OTel Collector via OTLP/gRPC. The Collector scrapes vLLM's /metrics endpoint using the Prometheus receiver. Both signals flow through the same pipeline and are exported to Dash0 over OTLP.

(1) A user request hits the RAG app over HTTP. (2) The RAG app calls vLLM via HTTP POST */v1/completions,* injecting a W3C traceparent header so the two services share a single trace. (3) The RAG app pushes its spans to the OTel Collector via OTLP gRPC. (4) vLLM does the same, emitting its inference spans as children of the RAG app's span. (5, 6) The Collector scrapes */metrics* from both services on a 15-second interval. (7) Traces from both services and metrics from both scrape jobs flow out through a single OTLP gRPC export to Dash0.

Why route through the Collector instead of exporting directly to Dash0?

Direct export works, but you lose the ability to enrich telemetry with resource attributes, filter or sample before sending, route signals to multiple backends, and handle retries independently of the application. The Collector decouples instrumentation from export policy. For development this matters less, but for any production deployment the Collector belongs in the pipeline.

Why does the Collector use both OTLP push and Prometheus pull?

vLLM sends traces via OTLP push: each request generates a span that is sent immediately. Metrics come from the Prometheus scrape endpoint, which requires periodic pull. The OTel Collector handles both natively. The prometheus receiver converts scraped metrics into OTel data points and forwards them through the same otlp/dash0 exporter as traces, so Dash0 receives everything over a single OTLP connection.

Setup

The full working example is at dash0-examples/vllm. Here's a walkthrough of the key configuration pieces.

vLLM

The vLLM Docker image (vllm/vllm-openai) requires an NVIDIA GPU. For testing without production hardware, a g4dn.xlarge EC2 instance (one T4 GPU) is sufficient to run facebook/opt-125m and validate the full telemetry pipeline. Without a GPU, you can still test the pipeline on CPU by removing the deploy.resources block from docker-compose.yml. Inference will be slow, but traces and metrics flow correctly.

Enable tracing by passing the OTLP endpoint:

12
vllm serve facebook/opt-125m \
  --otlp-traces-endpoint=http://otel-collector:4317

Set these environment variables alongside it:

123
OTEL_SERVICE_NAME=vllm-server
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://otel-collector:4317
OTEL_EXPORTER_OTLP_TRACES_INSECURE=true

OTel Collector configuration

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
extensions:
  health_check:
    endpoint: 0.0.0.0:13133


extensions:
  health_check:
    endpoint: 0.0.0.0:13133

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      global:
        scrape_interval: 15s
      scrape_configs:
        - job_name: vllm
          static_configs:
            - targets: ["vllm:8000"]
        - job_name: rag-app
          static_configs:
            - targets: ["rag-app:8001"]

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  debug:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 200

  otlp/dash0:
    endpoint: ${env:DASH0_ENDPOINT_OTLP_GRPC_HOSTNAME}:${env:DASH0_ENDPOINT_OTLP_GRPC_PORT}
    headers:
      Authorization: Bearer ${env:DASH0_AUTH_TOKEN}
      Dash0-Dataset: ${env:DASH0_DATASET}

service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, otlp/dash0]
    metrics:
      receivers: [otlp, prometheus]
      processors: [batch]
      exporters: [debug, otlp/dash0]

A few things worth noting here:

The batch processor reduces overhead and is fine for development. For production, the OpenTelemetry community recommends moving away from it: the batch processor acknowledges data before durably storing it, which means a Collector restart can silently drop spans. The alternative is exporter-level batching with persistent storage. See Why the OpenTelemetry Batch Processor Is Going Away Eventually for the details.

The debug exporter logs telemetry to stdout. The sampling configuration above prints the first 5 items then 1 in every 200, which is enough to confirm data is flowing during setup without flooding logs. Remove it for production.

The Prometheus scrape interval of 15 seconds is a reasonable default. vLLM's metrics endpoint updates on the scheduler cycle, so scraping more frequently than every 5 seconds adds Collector CPU overhead without giving you meaningfully fresher data.

Instrumenting the application layer

If you are calling vLLM from application code (a RAG pipeline, an agent, or anything else), you need to propagate trace context so the vLLM span connects to your application span.

Use the OTel API directly in your application code. Configure the SDK via environment variables and let opentelemetry.trace.get_tracer() pick up the provider automatically. This keeps application code decoupled from SDK initialization, which is the same separation vLLM uses internally in vllm/tracing/__init__.py where it exposes a clean instrument decorator rather than leaking TracerProvider and BatchSpanProcessor to callers.

python

12345678910111213141516171819202122232425262728293031
from opentelemetry import trace
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
import requests

# SDK auto-configures from environment variables:
# OTEL_SERVICE_NAME=rag-app
# OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
# OTEL_EXPORTER_OTLP_INSECURE=true
tracer = trace.get_tracer("rag-app")

def call_vllm(prompt: str) -> str:
    with tracer.start_as_current_span("rag.generate") as span:
        span.set_attribute("gen_ai.request.model", "facebook/opt-125m")
        span.set_attribute("gen_ai.request.max_tokens", 100)

        # Inject trace context into outgoing request headers
        headers = {}
        TraceContextTextMapPropagator().inject(headers)

        response = requests.post(
            "http://vllm:8000/v1/completions",
            headers=headers,
            json={"model": "facebook/opt-125m", "prompt": prompt, "max_tokens": 100}
        )
        result = response.json()

        usage = result.get("usage", {})
        span.set_attribute("gen_ai.usage.prompt_tokens", usage.get("prompt_tokens", 0))
        span.set_attribute("gen_ai.usage.completion_tokens", usage.get("completion_tokens", 0))

        return result["choices"][0]["text"]

TraceContextTextMapPropagator().inject(headers) serializes the current span context into the traceparent header. vLLM reads that header when it starts processing the request and creates its span as a child of yours. Without this step, you get two disconnected traces instead of one continuous waterfall.

The attribute names in the example above (gen_ai.usage.prompt_tokens and gen_ai.usage.completion_tokens) match what vLLM emits. Using consistent names means both your application span and the vLLM span appear correctly in the same Dash0 GenAI views.

What you see in Dash0

Traces

A fully propagated trace from the RAG app through to vLLM looks like this in the waterfall view:

*Dash0 waterfall view showing the full trace*

This single trace gives you answers that would otherwise require instrumenting each service separately. You can see whether your latency problem is in document retrieval or in model generation. A growing gap between rag.generate starting and llm_request starting indicates queue pressure before the request has even begun executing.

Clicking into the llm_request span reveals the full set of gen_ai.latency.* attributes vLLM populates on that span: gen_ai.latency.time_in_queue, gen_ai.latency.time_in_model_prefill, gen_ai.latency.time_in_model_decode, and others. These are span attributes, not child spans, so they do not appear as rows in the waterfall. They are the per-phase latency breakdown for that single inference request. A long time_in_queue means scheduler pressure, not a slow model. A long time_in_model_prefill means your prompt is large relative to your hardware.

Metrics in Dash0

Once the Prometheus receiver is scraping and the pipeline is running, vLLM metrics appear in Dash0 as standard OTel metrics, queryable alongside trace data from the same services.

*Dash0 metrics explorer showing the full list of vllm:* metrics with vllm:time_to_first_token_seconds selected*

The full vllm:* metric list is available as soon as the Collector starts scraping. Selecting any metric surfaces its description, available attributes, and a pre-built query for common aggregations.

For capacity planning: watch vllm:gpu_cache_usage_perc and vllm:num_requests_waiting together. Cache utilization above 90% combined with a non-zero wait queue is a reliable signal that you need more GPU memory or additional replicas.

For latency SLOs: use vllm:time_to_first_token_seconds at p95 for streaming applications and vllm:e2e_request_latency_seconds at p99 for non-streaming. Set alerts on both before they affect users. Dash0 supports Prometheus-format alert rules natively — see Configure Alert Checks.

For debugging latency spikes: check vllm:num_preemptions_total as a rate. Preemptions do not show up as errors. They show up as requests that suddenly take much longer because the scheduler had to evict in-flight KV cache state. If you are seeing unexplained p99 spikes and gpu_cache_usage_perc is high, the preemption rate is where to look.

For throughput monitoring: rate(vllm:generation_tokens_total[5m]) gives you tokens per second, the most direct measure of whether your serving configuration is efficiently utilizing available compute.

Agent0

Agent0 can help you investigate and act on your vLLM telemetry directly. For example, you might notice that vLLM spans appear in the traces view but are grouped differently from your HTTP spans. Asking Agent0 "why does dash0.operation.name show Unknown operation for my vLLM spans?" gives you an immediate explanation: GenAI spans use a different attribute vocabulary than HTTP spans (gen_ai.* attributes rather than http.*), and you can add a custom operation naming rule in Dash0's settings to match them. Agent0 identifies the affected spans, explains why it happened, and tells you exactly what to configure.

You can also ask Agent0 to create a monitoring dashboard from your metric names directly. Giving it the list of vllm:* metrics you care about produces a working dashboard in seconds.

Agent0-generated vLLM monitoring dashboard with latency

Extending to agent pipelines

The same pipeline works when vLLM sits inside a multi-agent system. Each turn of a conversation with an agent may involve multiple tool calls and LLM calls. If each of those calls propagates trace context, they all become spans in the same trace, so a single user message produces one trace that shows everything the agent did to respond, with vLLM's inference spans as leaves in the tree.

The GenAI semantic conventions define the attributes you would want on those agent spans: gen_ai.operation.name for the type of operation, gen_ai.agent.name for the agent identity, gen_ai.tool.name for tool invocations. When your agent framework uses these attributes and vLLM emits its own gen_ai.* span attributes, both sets of spans live in the same namespace and are queryable together in Dash0.

The inference observability in this post is the foundation. Agent observability applies the same pattern one layer up. Dash0's Practical Guide to Agentic Observability covers how to extend this to full agent pipelines. For AI observability using higher-level SDKs rather than raw OTel, see the OpenLIT and OpenLLMetry integrations as complementary approaches.

Running the example

Clone the repository and navigate to the vllm directory:

12
git clone https://github.com/dash0hq/dash0-examples.git
cd dash0-examples/vllm

Set your Dash0 credentials in the root .env file:

.env1234
DASH0_AUTH_TOKEN=your_auth_token
DASH0_DATASET=default
DASH0_ENDPOINT_OTLP_GRPC_HOSTNAME=ingress.eu-west-1.aws.dash0.com
DASH0_ENDPOINT_OTLP_GRPC_PORT=4317

Start the stack (requires an NVIDIA GPU - or see the vLLM setup section above for how to run on CPU):

1
docker compose up -build

Wait for vLLM to finish loading the model (~2–5 minutes on a T4), then send test requests:

1
python scripts/send-request.py

The README in the example directory covers prerequisites, the full expected output, and what to look for in Dash0 once data is flowing.

vLLM's built-in OTel support means the instrumentation cost is low. The configuration needed to connect it to a production-grade pipeline is small. The signals you get in return (trace-level visibility into inference phases and metric-level visibility into GPU utilization, cache pressure, and queue depth) are the ones that actually let you operate an LLM serving layer with confidence.

The full working example is at dash0-examples/vllm. Clone it, point it at your model, and see your inference layer become observable in minutes. Start your free Dash0 trial if you don't have an account yet.

Observing vLLM with OpenTelemetry and Dash0

Why LLM inference observability is its own problem

What vLLM emits

Traces

The pipeline

Setup

vLLM

OTel Collector configuration

Instrumenting the application layer

What you see in Dash0

Traces

Metrics in Dash0

Agent0

Extending to agent pipelines

Running the example

Related Reads

Observing Spring AI Applications with OpenTelemetry and Dash0

Related Reads

Observing Spring AI Applications with OpenTelemetry and Dash0