Kill the Bill: Our $10,000 Challenge is on

Optimizing Metric Query Performance

Fast, accurate metric queries are essential for effective observability. This guide explains how Dash0's metrics system works, why certain queries can be slow, and practical strategies to dramatically improve query performance—ordered from most to least impactful. Whether you're building dashboards, setting up alerts, or investigating incidents, these techniques will help you get answers faster.


Understanding how Dash0 calculates metrics

Dash0 provides two fundamentally different types of metrics, each with distinct performance characteristics that directly affect your query times.

Pre-computed at source metrics: the fast path

When you send metrics directly to Dash0 using OpenTelemetry SDKs, the Prometheus receiver, or other metric-producing integrations, these values are pre-computed at the source. Querying pre-computed metrics is fast because Dash0 simply retrieves already-calculated values from optimized storage. These metrics enjoy 13-month retention and are ideal for long-term trend analysis, capacity planning, and SLO tracking.

Synthetic metrics: flexibility with a performance tradeoff

Dash0's synthetic metrics — dash0.spans, dash0.logs, dash0.spans.duration, and dash0.span.events — work differently. Rather than pre-computing aggregations, Dash0 calculates these metrics on-the-fly at query time by scanning raw span and log data stored in our database. This approach provides remarkable flexibility: you can filter, group, and aggregate by any attribute without defining metrics upfront, and you pay only for raw telemetry ingestion.

The tradeoff is performance. Every synthetic metric query must scan the underlying raw data, meaning longer time ranges require scanning more data, which takes more time. Queries over recent data (last 24 hours) typically perform well since this data resides in fast local storage. Older data lives in S3-backed storage, which adds retrieval latency even with query caching. Additionally, synthetic span and log metrics have a 30-day retention limit, compared to 13 months for regular metrics.

PromQL query types and when to use each

PromQL supports two execution modes that affect both what data you get and how quickly you get it. Understanding this distinction helps you choose the right approach for each visualization.

Instant queries return a single point in time

An instant query evaluates your PromQL expression at one specific timestamp and returns a single result per matching time series. Crucially, you can still use range selectors within instant queries—the "instant" refers to when the expression is evaluated, not the data window it considers.

text
123
# Instant query with a range selector - evaluates once, returns one value per series
rate(http_requests_total[5m])

This query calculates the per-second rate over the last 5 minutes, but only at one point in time. Instant queries are fast because they perform one evaluation regardless of how much historical data you're viewing in your dashboard.

A range query evaluates your expression at multiple timestamps across a time range, essentially running many instant queries at regular intervals (the "step"). This produces the time series data needed for graphs.

Performance scales with (end_time - start_time) / step—more evaluation points means more work. A 7-day query with 15-second steps requires 40,320 evaluations, while the same query with 5-minute steps requires only 2,016.

text
12
# Request rate over time (range query for graphing)
sum by(service_name) (rate({otel_metric_name="dash0.spans"}[$__rate_interval]))

The "last value" alternative

When you need a single recent value but want explicit control over the lookback window, use last_over_time():

text
1
last_over_time(http_requests_total[5m])

This function retrieves the most recent sample within the specified window, giving you precise control over staleness tolerance—useful for metrics with irregular scrape intervals.

Performance optimization strategies

These techniques are ordered from highest to lowest impact on query performance.

Filter by resource attributes first

This is the single most effective optimization. Dash0's storage is optimized for queries filtered by resource attributes. When you filter by attributes like service.name, the query engine can leverage internal indexing to scan far less data.

text
12
# Good: filters before any computation
sum(rate({otel_metric_name="dash0.spans", service_name="checkout-service"}[$__rate_interval]))

The most effective resource attributes for filtering include:

  • service.name: Almost always your first filter
  • k8s.namespace.name: Kubernetes namespace isolation
  • k8s.deployment.name: Workload-level filtering
  • deployment.environment.name: Separate production from staging
  • dash0.resource.name: Name filter that works for all resource types

Adding a service.name filter to a query scanning millions of spans can reduce execution time from seconds to milliseconds.

Note that in PromQL, dots in attribute names are replaced with underscores (e.g., k8s.deployment.name becomes k8s_deployment_name).

Use OpenTelemetry attribute filters

Beyond resource attributes, filtering on span and log attributes reduces the volume of data processed. For log queries, otel.log.severity_range is particularly powerful:

text
12
# Only scan ERROR-level logs
sum by(k8s_deployment_name) (increase({otel_metric_name="dash0.logs", otel_log_severity_range="ERROR"}[$__interval])) > 0

For span queries, otel.span.status_code quickly isolates errors:

text
12
# Error spans only
sum by(service_name) (rate({otel_metric_name="dash0.spans", otel_span_status_code="ERROR"}[$__rate_interval]))

Align range selectors with step intervals

When range selectors and chart step sizes are misaligned, you either miss data (if range < step) or perform redundant computation (if range > step). Proper alignment improves both accuracy and performance.

Use $__rate_interval for all rate() and increase() queries. This variable automatically calculates the optimal range based on your scrape interval and chart step size:

text
12345
# Correct - adapts automatically to zoom level
rate(http_requests_total[$__rate_interval])
# Risky - may return "No data" when zoomed in
rate(http_requests_total[30s])

For avg_over_time() and similar aggregation functions, use $__interval directly since these functions don't have the same minimum-samples requirement as rate().

Manage metric cardinality

Cardinality—the total number of unique time series—grows multiplicatively with each label. A metric with 5 services × 10 endpoints × 3 methods × 12 histogram buckets creates 1,800 time series from just four labels. High cardinality strains memory, slows queries, and increases costs.

Avoid high-cardinality labels entirely:

Problematic labelBetter alternative
user.iduser.tier (free/premium)
request.idRemove entirely
k8s.pod.uidk8s.deployment.name
net.sock.peer.addrcloud.region

Check your cardinality in Dash0:

Use the Metric Explorer in Dash0 to view cardinality information at a glance. For each metric, the explorer displays the number of unique time series (cardinality), total data points, and the number of resources contributing to that metric—helping you quickly identify problematic metrics before they impact query performance.

Materialize frequently-used calculations

If you repeatedly query the same expensive aggregation, consider emitting it as a pre-aggregated metric from your application. This shifts computation from query-time to ingest-time:

  • OpenTelemetry SDK metrics: Emit counters and histograms for key business metrics directly
  • Prometheus client libraries: Create pre-aggregated metrics at the source

For latency percentiles you check constantly, emitting a histogram metric from your service will always outperform calculating histogram_quantile() over synthetic span duration data.

Choose appropriate time ranges

Since synthetic metrics scan raw data proportional to the time range, narrowing your query window directly improves performance. Consider:

  • Dashboards: Use relative time ranges ("Last 6 hours") rather than fixed ranges spanning weeks
  • Alerts: Evaluate over short windows (1-5 minutes) where possible
  • Investigations: Start narrow, then widen only if needed

Data within the last 24 hours resides in fast local storage; older data requires object storage retrieval. Queries spanning object storage take longer even with caching.

Expectations for historical span and log queries

Engineers often expect to query weeks of span or log data as quickly as they'd query pre-aggregated metrics. Understanding why this isn't possible helps set realistic expectations.

Why large historical queries are slow

When you query dash0.spans or dash0.logs over a 7-day window, Dash0 must:

  1. Retrieve raw span/log records from object storage (S3)
  2. Scan potentially billions of individual records
  3. Filter by your label selectors
  4. Compute aggregations on-the-fly
  5. Return results

This is fundamentally different from querying pre-aggregated metrics, where the heavy lifting happened at ingest time. For a high-volume service producing millions of spans per day, a 7-day synthetic metric query might scan billions of records.

What you can do about it

For real-time monitoring and alerting, synthetic metrics work excellently over short time ranges. A 5-minute rate() query filtered by service_name executes in milliseconds.

For historical analysis and trend visualization, emit pre-aggregated metrics:

text
12345
# Fast: pre-aggregated metric, 13-month retention
rate(http_server_request_duration_seconds_count{service_name="checkout"}[5m])
# Slow over long ranges: synthetic metric scanning raw spans
rate({otel_metric_name="dash0.spans", service_name="checkout"}[5m])

For ad-hoc investigation, start with narrow time windows and specific filters. If you need to find when errors started spiking, query the last hour first with tight service and status filters, then widen gradually.

Retention considerations

  • Spans, logs, span events: 30-day retention
  • Pre-aggregated metrics: 13-month retention

If you need year-over-year comparisons or long-term SLO tracking, pre-aggregated metrics are your only option.

Conclusion

Query performance in Dash0 depends primarily on what type of data you're querying and how much of it you're scanning. Synthetic metrics provide unmatched flexibility for real-time analysis but require scanning raw telemetry at query time. Pre-aggregated metrics sacrifice flexibility for speed and retention.

The most impactful optimizations are filtering by resource attributes (especially service_name), using proper interval variables ($__rate_interval), and choosing pre-aggregated metrics for historical analysis. Managing cardinality and keeping query time ranges appropriate for your use case further improve the experience.

For dashboards that need both real-time detail and historical context, consider a hybrid approach: synthetic metrics for recent data with tight filters, and pre-aggregated metrics for trend lines and long-term views.


Last updated: December 15, 2025