Reference guide for Dash0's alerting model, an extension of the Prometheus alerting model.

Understanding the underlying mechanics helps when writing advanced check rules, debugging unexpected firing behavior, or importing existing Prometheus alerts.

Prometheus Foundation

Dash0 builds on the standard Prometheus alerting model, extending it with additional severity levels while maintaining full compatibility. In the standard Prometheus model, a check rule contains a PromQL expression. The rule fires if that expression returns any results during evaluation — there is no built-in concept of severity. Each distinct result represents a separate firing instance.

Dash0 check rules are fully compatible with this model. Any valid Prometheus alerting rule can be used in Dash0 without modification.

Failed Check Severity

In Prometheus, an alert is created when an alerting rule returns a value. There is no built-in notion of severity of the alert rule, which you usually manage with labels, and routing based on them in Alertmanager.

In Dash0, Failed checks that are still ongoing can have two severities:

(DEGRADED)
CRITICAL

By default, a failed check has severity CRITICAL, but you can change that by using thresholds.

The `$__threshold` Extension

Dash0 extends the Prometheus model with an optional $__threshold symbol that enables dual-severity alerting. When used in a check rule expression, it lets you specify two named severity levels — one for degraded and one for critical — with separate numeric thresholds. You can configure either one or both.

promql

1
sum(rate({otel_metric_name="http.server.errors", service_name="checkout"}[5m])) > $__threshold

If a check rule does not use $__threshold, it behaves exactly like a Prometheus alert: any non-empty result set fires the failed check. If both thresholds are configured, the higher value maps to critical and the lower to degraded. To express this in Prometheus, you would have to have two different alerting rules.

Health Status on the Service Map

Failed checks directly affect service health visualization in the Dash0 service map, providing at-a-glance operational status. Each service monitored by Dash0 has one of three health states, determined by its associated failed checks:

Gray — healthy, no active check failures
Yellow — degraded threshold exceeded
Red — critical threshold exceeded

A failed check colors a service when the query result includes that service's service_name label. Aggregating away the service_name (for example, with an unqualified sum) produces results that are not associated with any service.

Affected Resource

Every failed check has an Affected Resource — the entity whose health the failed check reflects. Dash0 derives the Affected Resource entirely from the labels present on the PromQL result time series; there is no separate field to configure on the check rule. The Affected Resource is displayed in the Failed Checks View and the Failed Check Details View.

Scope Types

There are three scope types, determined by which labels are present on the result series:

Service scope — the failed check rolls up into the target service on the service map and service catalog. This scope requires service_name to be present on the result series (along with service_namespace when your services use namespaces to disambiguate). Add these labels via an aggregation by clause or preserve them with an equality matcher in the metric selector.
Resource scope — the failed check rolls up into a specific resource in the resource inventory. This scope requires dash0_resource_id to be present on the result series. When service_name and service_namespace are also present, Dash0 ties the resource back to the owning service so resource health rolls up to service health as well.
Free-standing — neither service_name nor dash0_resource_id is present on the result series. The failed check appears under Unknown in the Failed Checks View and does not influence any service or resource health indicator. This is appropriate for org-wide or cross-service checks where no single resource is the natural owner.

How PromQL Expressions Determine the Scope

The Affected Resource is set entirely by the labels on the emitted time series. Two mechanisms preserve labels in PromQL results:

Equality matcher in the metric selector — a label filter of the form label="value" in a stream selector preserves that label on the result. This matters most for absent() and absent_over_time(), which inherit labels only from equality matchers in the selector, not from a by clause.
Aggregation by (...) clause — a sum by (service_name, service_namespace) (...) keeps those labels and drops all others. A bare sum(...) with no by clause drops every label and produces a free-standing failed check, even when the selector was scoped to a single service.

PromQL Examples

The examples below use Prometheus/Dash0 label names (underscores, not dots).

Service-scoped error rate (using dash0.spans):

promql

123
sum by (service_name, service_namespace) (
  rate({otel_metric_name="dash0.spans", otel_span_status_code="ERROR", service_name="checkoutservice"}[5m])
) > $__threshold

Service-scoped P99 latency (using dash0.spans.duration histogram):

promql

12345
histogram_quantile(0.99,
  sum by (service_name, service_namespace, le) (
    rate({otel_metric_name="dash0.spans.duration", service_name="checkoutservice"}[5m])
  )
) > $__threshold

Resource-scoped custom gauge (active HTTP requests per pod):

promql

123
sum by (dash0_resource_id, service_name, service_namespace) (
  {otel_metric_name="http.server.active.requests", service_name="api"}
) > $__threshold

Missing-data detection with absent_over_time (no by needed — selector labels are inherited automatically):

promql

1
absent_over_time({otel_metric_name="dash0.spans", service_name="checkoutservice"}[10m])

Free-standing check (intentional — org-wide ingest monitoring with no specific owner):

promql

1
sum(rate({otel_metric_name="dash0.spans"}[5m])) < $__threshold

Common Mistake: Bare `sum(...)` Drops the Service Label

A frequent issue is writing a service-scoped check like this:

promql

12
# Broken: bare sum() drops service_name from the result
sum(rate({otel_metric_name="dash0.spans", otel_span_status_code="ERROR", service_name="checkoutservice"}[5m])) > $__threshold

Even though the selector is filtered to checkoutservice, the bare sum(...) with no by clause removes service_name from the output. Dash0 cannot determine the Affected Resource, so the failed check appears under Unknown and does not color the service on the service map.

The fix is to include the correlation labels in the by clause:

promql

1234
# Correct: service_name preserved in the result
sum by (service_name, service_namespace) (
  rate({otel_metric_name="dash0.spans", otel_span_status_code="ERROR", service_name="checkoutservice"}[5m])
) > $__threshold

One Rule, Multiple Failed Checks

Understanding how check rules produce failed checks helps you design queries that generate the appropriate number of alerts for your use case. A single check rule can produce any number of simultaneously failed checks — one per distinct result returned by the expression. A rule with no by clause produces a single aggregated result.

A rule grouped by service_name or operation_name produces one failed check per unique value of that label, each independently tracked and colored.

Finite State Machine

Grace periods control when checks fire and resolve, preventing alert noise from transient spikes and flapping metrics. Each failed check transitions through states based on the configured grace periods:

Trigger grace period — how many consecutive evaluation intervals the expression must exceed the threshold before the check fires and notifications are sent. Specified as a multiplier of the evaluation interval (e.g., 2× a 1-minute interval = 2 minutes). Prevents noise from transient spikes.
Keep-firing grace period — how many evaluation intervals the check remains in a degraded or critical state after the expression drops below the threshold. Also specified as a multiplier. Prevents flapping.

Enablement Conditions

Enablement conditions provide a way to gate failed checks based on additional criteria evaluated alongside the main query. If the enablement condition is not met, the failed check does not fire even if the main query exceeds its thresholds.

This is useful for failed checks that should only be active when the system has relevant traffic or context.

Understand the Dash0 Alerting Model

Prometheus Foundation

Failed Check Severity

The `$__threshold` Extension

Health Status on the Service Map

Affected Resource

Scope Types

How PromQL Expressions Determine the Scope

PromQL Examples

Common Mistake: Bare `sum(...)` Drops the Service Label

One Rule, Multiple Failed Checks

Finite State Machine

Enablement Conditions

Further Reading