Improved Failed Checks Drilldown

When a check fails, the first thing you want to do is understand why. Until now, drilling down from a failed check into the underlying telemetry meant losing context — metric filters weren't carried over, and the charts you landed on showed largely unfiltered data, often not relevant for your investigation. Dash0 now provides a seamless drilldown experience from failed checks directly into the relevant spans, logs, web events, and metrics — with filters automatically extracted from the check's PromQL expression.

Open App

How It Works

When you open a failed check's details view, Dash0 analyzes the PromQL expression behind the check and extracts the metric names and label filters. The drilldown then uses these filters to show you only the telemetry that matters:

Spans, Logs, and Web Events cards appear only when the check uses a related synthetic metric (e.g., dash0.spans or dash0.spans.duration), with extracted filters pre-applied for a precise view.
Metrics cards list all other metrics used in the check, each filtered to the relevant label matchers.

Not every filter from a PromQL expression can always be mapped to a telemetry query — Dash0 applies what it can and skips what it can't, so you always get the most relevant view possible. Although, the only times we need to skip some of the filters are for the most complex PromQL queries. (If you want to know more, hit Michele up and prepare for an entire discussion about symbolic model-checking on the PromQL AST. It is great fun!)

OK, but can you be more practical about it?

The following PromQL query:

(sum(increase({otel_metric_name="dash0.spans", otel_span_status_code="ERROR"}[5m])) * histogram_quantile(0.95, sum by(le) (rate({otel_metric_name="dash0.spans.duration", otel_span_status_code="ERROR"}[5m])))) / (sum(increase({otel_metric_name="dash0.logs", otel_log_severity_range="ERROR", service_name="checkoutservice"}[5m])) + 1)

Error span processing time per error log in the checkout service

The metric shows how much error span processing time (at P95) occurs for each error log generated by the checkout service. This is useful for:

Identifying if checkout errors generate disproportionate tracing overhead
Correlating checkout service error logging patterns with distributed trace complexity
Detecting when checkout errors trigger cascading failures across services (high values = many error spans per checkout log)

When looking at the Failed Check Details page for an alert using this query:

The spans shown are only those with status code ERROR, across all services
The logs shown are those with severity range ERROR for the checkoutservice.

Now, we could have picked a simpler PromQL expression, but where's the fun in that. You would have looked at it, assumed it just extracted somehow the filters from the timeseries, and phoned it in. In reality, the algorithm does a lot of work with binary operators (+, -, *, /), logic operators (AND, OR, UNLESS), and more.

Want to Try It?

Open the Failed Checks page, select a failed check, and explore the drilldown cards. Brush a time range in the timeline chart to jump directly into the relevant telemetry.

Kudos where kudos are due

The lion's share of the research that led to this breakthrough has been performed in collaboration with Julius Volz (who else to rely upon to know all the dark corners of PromQL?!?).

Julius' course about "Understanding PromQL" is a rite of passage for technical staff at Dash0, and we cannot endorse it enough!