Optimizing Metric Query Performance
Fast, accurate metric queries are essential for effective observability. This guide explains how Dash0's metrics system works, why certain queries can be slow, and practical strategies to dramatically improve query performance—ordered from most to least impactful. Whether you're building dashboards, setting up alerts, or investigating incidents, these techniques will help you get answers faster.
Understanding how Dash0 calculates metrics
Dash0 provides two fundamentally different types of metrics, each with distinct performance characteristics that directly affect your query times.
Pre-computed at source metrics: the fast path
When you send metrics directly to Dash0 using OpenTelemetry SDKs, the Prometheus receiver, or other metric-producing integrations, these values are pre-computed at the source. Querying pre-computed metrics is fast because Dash0 simply retrieves already-calculated values from optimized storage. These metrics enjoy 13-month retention and are ideal for long-term trend analysis, capacity planning, and SLO tracking.
Synthetic metrics: flexibility with a performance tradeoff
Dash0's synthetic metrics — dash0.spans, dash0.logs, dash0.spans.duration, and dash0.span.events — work differently. Rather than pre-computing aggregations, Dash0 calculates these metrics on-the-fly at query time by scanning raw span and log data stored in our database. This approach provides remarkable flexibility: you can filter, group, and aggregate by any attribute without defining metrics upfront, and you pay only for raw telemetry ingestion.
The tradeoff is performance. Every synthetic metric query must scan the underlying raw data, meaning longer time ranges require scanning more data, which takes more time. Queries over recent data (last 24 hours) typically perform well since this data resides in fast local storage. Older data lives in S3-backed storage, which adds retrieval latency even with query caching. Additionally, synthetic span and log metrics have a 30-day retention limit, compared to 13 months for regular metrics.
PromQL query types and when to use each
PromQL supports two execution modes that affect both what data you get and how quickly you get it. Understanding this distinction helps you choose the right approach for each visualization.
Instant queries return a single point in time
An instant query evaluates your PromQL expression at one specific timestamp and returns a single result per matching time series. Crucially, you can still use range selectors within instant queries—the "instant" refers to when the expression is evaluated, not the data window it considers.
123# Instant query with a range selector - evaluates once, returns one value per seriesrate(http_requests_total[5m])
This query calculates the per-second rate over the last 5 minutes, but only at one point in time. Instant queries are fast because they perform one evaluation regardless of how much historical data you're viewing in your dashboard.
Range queries show trends over time
A range query evaluates your expression at multiple timestamps across a time range, essentially running many instant queries at regular intervals (the "step"). This produces the time series data needed for graphs.
Performance scales with (end_time - start_time) / step—more evaluation points means more work. A 7-day query with 15-second steps requires 40,320 evaluations, while the same query with 5-minute steps requires only 2,016.
12# Request rate over time (range query for graphing)sum by(service_name) (rate({otel_metric_name="dash0.spans"}[$__rate_interval]))
The "last value" alternative
When you need a single recent value but want explicit control over the lookback window, use last_over_time():
1last_over_time(http_requests_total[5m])
This function retrieves the most recent sample within the specified window, giving you precise control over staleness tolerance—useful for metrics with irregular scrape intervals.
Performance optimization strategies
These techniques are ordered from highest to lowest impact on query performance.
Filter by resource attributes first
This is the single most effective optimization. Dash0's storage is optimized for queries filtered by resource attributes. When you filter by attributes like service.name, the query engine can leverage internal indexing to scan far less data.
12# Good: filters before any computationsum(rate({otel_metric_name="dash0.spans", service_name="checkout-service"}[$__rate_interval]))
The most effective resource attributes for filtering include:
service.name: Almost always your first filterk8s.namespace.name: Kubernetes namespace isolationk8s.deployment.name: Workload-level filteringdeployment.environment.name: Separate production from stagingdash0.resource.name: Name filter that works for all resource types
Adding a service.name filter to a query scanning millions of spans can reduce execution time from seconds to milliseconds.
Note that in PromQL, dots in attribute names are replaced with underscores (e.g., k8s.deployment.name becomes k8s_deployment_name).
Use OpenTelemetry attribute filters
Beyond resource attributes, filtering on span and log attributes reduces the volume of data processed. For log queries, otel.log.severity_range is particularly powerful:
12# Only scan ERROR-level logssum by(k8s_deployment_name) (increase({otel_metric_name="dash0.logs", otel_log_severity_range="ERROR"}[$__interval])) > 0
For span queries, otel.span.status_code quickly isolates errors:
12# Error spans onlysum by(service_name) (rate({otel_metric_name="dash0.spans", otel_span_status_code="ERROR"}[$__rate_interval]))
Align range selectors with step intervals
When range selectors and chart step sizes are misaligned, you either miss data (if range < step) or perform redundant computation (if range > step). Proper alignment improves both accuracy and performance.
Use $__rate_interval for all rate() and increase() queries. This variable automatically calculates the optimal range based on your scrape interval and chart step size:
12345# Correct - adapts automatically to zoom levelrate(http_requests_total[$__rate_interval])# Risky - may return "No data" when zoomed inrate(http_requests_total[30s])
For avg_over_time() and similar aggregation functions, use $__interval directly since these functions don't have the same minimum-samples requirement as rate().
Manage metric cardinality
Cardinality—the total number of unique time series—grows multiplicatively with each label. A metric with 5 services × 10 endpoints × 3 methods × 12 histogram buckets creates 1,800 time series from just four labels. High cardinality strains memory, slows queries, and increases costs.
Avoid high-cardinality labels entirely:
| Problematic label | Better alternative |
|---|---|
| user.id | user.tier (free/premium) |
| request.id | Remove entirely |
| k8s.pod.uid | k8s.deployment.name |
| net.sock.peer.addr | cloud.region |
Check your cardinality in Dash0:
Use the Metric Explorer in Dash0 to view cardinality information at a glance. For each metric, the explorer displays the number of unique time series (cardinality), total data points, and the number of resources contributing to that metric—helping you quickly identify problematic metrics before they impact query performance.
Materialize frequently-used calculations
If you repeatedly query the same expensive aggregation, consider emitting it as a pre-aggregated metric from your application. This shifts computation from query-time to ingest-time:
- OpenTelemetry SDK metrics: Emit counters and histograms for key business metrics directly
- Prometheus client libraries: Create pre-aggregated metrics at the source
For latency percentiles you check constantly, emitting a histogram metric from your service will always outperform calculating histogram_quantile() over synthetic span duration data.
Choose appropriate time ranges
Since synthetic metrics scan raw data proportional to the time range, narrowing your query window directly improves performance. Consider:
- Dashboards: Use relative time ranges ("Last 6 hours") rather than fixed ranges spanning weeks
- Alerts: Evaluate over short windows (1-5 minutes) where possible
- Investigations: Start narrow, then widen only if needed
Data within the last 24 hours resides in fast local storage; older data requires object storage retrieval. Queries spanning object storage take longer even with caching.
Expectations for historical span and log queries
Engineers often expect to query weeks of span or log data as quickly as they'd query pre-aggregated metrics. Understanding why this isn't possible helps set realistic expectations.
Why large historical queries are slow
When you query dash0.spans or dash0.logs over a 7-day window, Dash0 must:
- Retrieve raw span/log records from object storage (S3)
- Scan potentially billions of individual records
- Filter by your label selectors
- Compute aggregations on-the-fly
- Return results
This is fundamentally different from querying pre-aggregated metrics, where the heavy lifting happened at ingest time. For a high-volume service producing millions of spans per day, a 7-day synthetic metric query might scan billions of records.
What you can do about it
For real-time monitoring and alerting, synthetic metrics work excellently over short time ranges. A 5-minute rate() query filtered by service_name executes in milliseconds.
For historical analysis and trend visualization, emit pre-aggregated metrics:
12345# Fast: pre-aggregated metric, 13-month retentionrate(http_server_request_duration_seconds_count{service_name="checkout"}[5m])# Slow over long ranges: synthetic metric scanning raw spansrate({otel_metric_name="dash0.spans", service_name="checkout"}[5m])
For ad-hoc investigation, start with narrow time windows and specific filters. If you need to find when errors started spiking, query the last hour first with tight service and status filters, then widen gradually.
Retention considerations
- Spans, logs, span events: 30-day retention
- Pre-aggregated metrics: 13-month retention
If you need year-over-year comparisons or long-term SLO tracking, pre-aggregated metrics are your only option.
Conclusion
Query performance in Dash0 depends primarily on what type of data you're querying and how much of it you're scanning. Synthetic metrics provide unmatched flexibility for real-time analysis but require scanning raw telemetry at query time. Pre-aggregated metrics sacrifice flexibility for speed and retention.
The most impactful optimizations are filtering by resource attributes (especially service_name), using proper interval variables ($__rate_interval), and choosing pre-aggregated metrics for historical analysis. Managing cardinality and keeping query time ranges appropriate for your use case further improve the experience.
For dashboards that need both real-time detail and historical context, consider a hybrid approach: synthetic metrics for recent data with tight filters, and pre-aggregated metrics for trend lines and long-term views.
Last updated: December 15, 2025