Dash0 Raises $110M Series B at $1B Valuation

Last updated: May 29, 2026

Troubleshoot Check Rule Issues

This guide helps you diagnose and resolve common issues with check rules, from alerts not firing when expected to dealing with alert fatigue from too many notifications.

Check Rule Not Firing

If your check rule isn't triggering alerts when expected, work through these diagnostic steps to identify whether the issue is with the query, thresholds, grace periods, enablement conditions, or notification configuration.

Verify Query Returns Data

The most common reason for alerts not firing is that the query doesn't match any telemetry.

  1. Open the check rule
  2. Review the preview chart below the Query Builder
  3. Verify that data points appear in the chart
  4. Check the time range covers recent data

If no data appears:

  • Verify the service name, metric name, or label filters are correct
  • Check that telemetry is flowing into Dash0 for the queried service
  • Review the time range selector (e.g., [5m]) matches your data freshness
  • Test with a simpler query to isolate the issue

Example diagnostic query:

promql
12345678
# Start simple
{service_name="frontend"}
# Add complexity gradually
rate({service_name="frontend"}[5m])
# Add filters one at a time
rate({service_name="frontend", http_status_code="500"}[5m])

Check Thresholds

Ensure your threshold values are appropriate for the metric scale.

  1. Review the preview chart to see typical metric values
  2. Compare the threshold to the current metric value
  3. Verify the comparison operator is correct (>, <, >=, <=, ==, !=)

Common threshold mistakes:

  • Milliseconds vs seconds: Threshold is 500 but metric is in seconds (0.5)
  • Percentage as decimal: Threshold is 0.05 but expecting 5%
  • Wrong operator: Using > when you need < for error detection
  • Too aggressive: Threshold is too sensitive, catching normal variation

Fix:

promql
12345
# Wrong: comparing milliseconds to seconds
histogram_quantile(0.99, ...) > 500 # metric is in seconds
# Right: normalize units
histogram_quantile(0.99, ...) * 1000 > 500 # convert to milliseconds

Review Grace Periods

Long grace periods delay alerts by requiring sustained threshold violations.

  1. Check the For (trigger) grace period
  2. Check the Keep firing for grace period
  3. Temporarily set both to "None (0m)" for testing

Grace period behavior:

  • 2x (2m) with 1m interval = Alert fires after 2 consecutive threshold violations
  • If the metric drops below threshold once, the grace period resets
  • Grace periods prevent noisy alerts but delay detection

Testing tip: Set grace periods to 0× temporarily to verify the query and threshold work, then restore appropriate grace periods.

Verify Enablement Conditions

If configured, enablement conditions must evaluate to true for the check to fire.

  1. Scroll to the Enablement Conditions section
  2. Check if any conditions are configured
  3. Verify the condition evaluates to true for your current data

Common enablement condition issues:

  • Condition requires minimum traffic, but service has low traffic
  • Time-based condition (business hours only) doesn't match current time
  • Service label doesn't match what the condition expects

Fix: Temporarily remove the enablement condition to test the main query, then refine the condition logic.

Check Notification Channels

Ensure notification channels are properly configured and not muted.

  1. Navigate to the Notification Channels settings.
  2. Verify the channels assigned to the check rule are enabled
  3. Test the notification channel to ensure delivery works
  4. Check for muted channels or routing rules that filter the alert

Channel issues:

  • Channel is disabled
  • Slack webhook URL expired
  • PagerDuty integration key changed
  • Email SMTP credentials invalid
  • Label-based routing doesn't match the check's labels

Too Many Alerts

If your check rule is too noisy, use these techniques to reduce alert fatigue by requiring sustained threshold violations, aggregating results, adjusting thresholds, or filtering out irrelevant conditions.

Increase Grace Periods

Require sustained threshold violations before alerting.

  1. Open the check rule editor
  2. Increase the For (trigger) grace period to 2× or 3×
  3. Save and monitor the alert frequency

Recommended grace periods:

  • Low noise services: 1-2× grace period
  • Noisy services: 3-4× grace period
  • Flapping metrics: 5× or higher grace period

Impact: With a 3× grace period and 1m interval, the metric must exceed the threshold for 3 consecutive minutes before alerting.

Add Aggregation

Use sum by or avg by to group results and reduce alert count.

Problem: Without aggregation, separate alerts fire for each unique label combination.

promql
123
# Without aggregation: separate alert per pod
rate(d.logs{severity="ERROR", service_name="frontend"}[5m]) > 10
# Result: 50 alerts (one per pod)

Solution: Add aggregation to create a single alert.

promql
123
# With aggregation: single alert per service
sum(rate(d.logs{severity="ERROR", service_name="frontend"}[5m])) > 10
# Result: 1 alert (aggregated across all pods)

Aggregation strategies:

  • Service-level: sum by (service_name) — One alert per service
  • Environment-level: sum by (deployment_environment_name) — One alert per environment
  • No grouping: sum(...) — Single alert total

Adjust Thresholds

Increase threshold values to reduce sensitivity.

  1. Review the preview chart to see typical metric values
  2. Identify the baseline and normal variation range
  3. Set thresholds above the normal variation

Threshold tuning:

  • Baseline: Typical value during normal operation
  • Variation: Expected fluctuation range
  • Threshold: Set above baseline + variation to catch true anomalies

Example:

  • Baseline: 100ms latency
  • Normal variation: ±20ms
  • Degraded threshold: 150ms (baseline + 50%)
  • Critical threshold: 200ms (baseline + 100%)

Add Enablement Conditions

Filter out low-traffic or maintenance periods.

Problem: Alerts fire during scheduled maintenance or when traffic is too low to be meaningful.

promql
12
# Main query
rate(http_server_errors[5m]) / rate(http_server_requests[5m]) > 0.05

Solution: Add enablement condition requiring minimum traffic.

promql
12
# Enablement condition: only alert when traffic > 10 req/min
sum(rate(http_server_requests[5m])) > 10

Common enablement patterns:

  • Minimum traffic: sum(rate(requests[5m])) > threshold
  • Business hours: hour() >= 9 AND hour() < 17
  • Specific environments: deployment_environment_name = "production"
  • Service health: up{service="..."} == 1

Query Performance Issues

If check evaluation is slow or timing out, optimize the query by adjusting evaluation frequency, reducing cardinality, adding filters, or shortening time windows.

Increase Evaluation Frequency

Use longer evaluation intervals to reduce load.

  1. Change Evaluate every from 1m to 5m or 10m
  2. Adjust grace period multipliers accordingly
  3. Monitor evaluation duration

Trade-offs:

  • Shorter intervals (1m): Faster detection, higher cost
  • Longer intervals (10m): Slower detection, lower cost

Recommendation: Use 1m for critical alerts, 5-10m for non-critical alerts.

Simplify Query

Reduce the number of labels in by clauses.

Problem: Too many grouping labels create high cardinality.

promql
12
# High cardinality: groups by pod, node, and container
sum by (k8s_pod_name, k8s_node_name, k8s_container_name) (rate(metric[5m]))

Solution: Group by fewer labels.

promql
12
# Lower cardinality: groups by service only
sum by (service_name) (rate(metric[5m]))

Impact: Fewer unique label combinations = faster query execution.

Add Filters

Narrow the query scope with specific label filters.

Problem: Query scans all services and metrics.

promql
12
# Scans all services
rate(d.span_durations[5m])

Solution: Add service filter.

promql
12
# Only scans frontend service
rate(d.span_durations{service_name="frontend"}[5m])

Use Shorter Time Ranges

Reduce the lookback window in range selectors.

Problem: Long lookback windows process more data.

promql
12
# 5-minute window
rate(metric[5m])

Solution: Use shorter window if appropriate.

promql
12
# 1-minute window (less data to process)
rate(metric[1m])

Trade-offs:

  • Longer windows: Smoother data, less noise, slower queries
  • Shorter windows: Faster queries, more noise, less smoothing

Debug PromQL Expressions

Use these techniques to troubleshoot PromQL syntax and logic errors by validating syntax, testing in the query builder, and building queries incrementally.

Validate Syntax

Use the PromQL Preview section to check for syntax errors.

  1. Scroll to the PromQL Preview section
  2. Review any error messages displayed
  3. Fix syntax issues highlighted in red

Common syntax errors:

  • Missing closing parenthesis
  • Incorrect label matcher syntax (use =~ for regex, not ~)
  • Invalid function arguments
  • Mismatched brackets in label selectors

Test in Query Builder

Use the Query Builder to validate your query returns data.

  1. Navigate to Query DataQuery Builder
  2. Enter your PromQL expression
  3. Run the query and review results
  4. Debug any issues before adding to check rule

Start Simple, Add Complexity

Build queries incrementally to isolate issues.

promql
1234567891011121314
# Step 1: Basic selector
{service_name="frontend"}
# Step 2: Add rate
rate({service_name="frontend"}[5m])
# Step 3: Add aggregation
sum(rate({service_name="frontend"}[5m]))
# Step 4: Add grouping
sum by (http_status_code) (rate({service_name="frontend"}[5m]))
# Step 5: Add filters
sum by (http_status_code) (rate({service_name="frontend", http_status_code=~"5.."}[5m]))

At each step, verify the query works before adding the next piece.

Get Help

If you've tried these troubleshooting steps and still have issues, consult the documentation, contact support, or reach out to the community for assistance.

  1. Check Documentation: Review About Alert Monitoring for alerting concepts
  2. Contact Support: Reach out to Dash0 support with:
    • Check rule ID or name
    • Expected behavior vs actual behavior
    • Screenshots of the configuration
    • Preview chart showing data (or lack thereof)
  3. Community: Join the Dash0 community Slack for peer support

Further Reading