This guide helps you diagnose and resolve common issues with check rules, from alerts not firing when expected to dealing with alert fatigue from too many notifications.

Tip

Agent0 can help diagnose check rule issues through natural language investigation. Describe the problem ("why isn't my frontend latency alert firing?"), and Agent0 analyzes your telemetry, queries, and thresholds to identify potential issues. See Investigation and Analysis for details.

Check Rule Not Firing

If your check rule isn't triggering alerts when expected, work through these diagnostic steps to identify whether the issue is with the query, thresholds, grace periods, enablement conditions, or notification configuration.

Verify Query Returns Data

The most common reason for alerts not firing is that the query doesn't match any telemetry.

Open the check rule
Review the preview chart below the Query Builder
Verify that data points appear in the chart
Check the time range covers recent data

If no data appears:

Verify the service name, metric name, or label filters are correct
Check that telemetry is flowing into Dash0 for the queried service
Review the time range selector (e.g., [5m]) matches your data freshness
Test with a simpler query to isolate the issue

Example diagnostic query:

promql

12345678
# Start simple
{service_name="frontend"}

# Add complexity gradually
rate({service_name="frontend"}[5m])

# Add filters one at a time
rate({service_name="frontend", http_status_code="500"}[5m])

Check Thresholds

Ensure your threshold values are appropriate for the metric scale.

Review the preview chart to see typical metric values
Compare the threshold to the current metric value
Verify the comparison operator is correct (>, <, >=, <=, ==, !=)

Common threshold mistakes:

Milliseconds vs seconds: Threshold is 500 but metric is in seconds (0.5)
Percentage as decimal: Threshold is 0.05 but expecting 5%
Wrong operator: Using > when you need < for error detection
Too aggressive: Threshold is too sensitive, catching normal variation

Fix:

promql

12345
# Wrong: comparing milliseconds to seconds
histogram_quantile(0.99, ...) > 500  # metric is in seconds

# Right: normalize units
histogram_quantile(0.99, ...) * 1000 > 500  # convert to milliseconds

Review Grace Periods

Long grace periods delay alerts by requiring sustained threshold violations.

Check the For (trigger) grace period
Check the Keep firing for grace period
Temporarily set both to "None (0m)" for testing

Grace period behavior:

2x (2m) with 1m interval = Alert fires after 2 consecutive threshold violations
If the metric drops below threshold once, the grace period resets
Grace periods prevent noisy alerts but delay detection

Testing tip: Set grace periods to 0× temporarily to verify the query and threshold work, then restore appropriate grace periods.

Verify Enablement Conditions

If configured, enablement conditions must evaluate to true for the check to fire.

Scroll to the Enablement Conditions section
Check if any conditions are configured
Verify the condition evaluates to true for your current data

Common enablement condition issues:

Condition requires minimum traffic, but service has low traffic
Time-based condition (business hours only) doesn't match current time
Service label doesn't match what the condition expects

Fix: Temporarily remove the enablement condition to test the main query, then refine the condition logic.

Check Notification Channels

Ensure notification channels are properly configured and not muted.

Navigate to the Notification Channels settings.
Verify the channels assigned to the check rule are enabled
Test the notification channel to ensure delivery works
Check for muted channels or routing rules that filter the alert

Channel issues:

Channel is disabled
Slack webhook URL expired
PagerDuty integration key changed
Email SMTP credentials invalid
Label-based routing doesn't match the check's labels

Too Many Alerts

If your check rule is too noisy, use these techniques to reduce alert fatigue by requiring sustained threshold violations, aggregating results, adjusting thresholds, or filtering out irrelevant conditions.

Increase Grace Periods

Require sustained threshold violations before alerting.

Open the check rule editor
Increase the For (trigger) grace period to 2× or 3×
Save and monitor the alert frequency

Recommended grace periods:

Low noise services: 1-2× grace period
Noisy services: 3-4× grace period
Flapping metrics: 5× or higher grace period

Impact: With a 3× grace period and 1m interval, the metric must exceed the threshold for 3 consecutive minutes before alerting.

Add Aggregation

Use sum by or avg by to group results and reduce alert count.

Problem: Without aggregation, separate alerts fire for each unique label combination.

promql

123
# Without aggregation: separate alert per pod
rate(d.logs{severity="ERROR", service_name="frontend"}[5m]) > 10
# Result: 50 alerts (one per pod)

Solution: Add aggregation to create a single alert.

promql

123
# With aggregation: single alert per service
sum(rate(d.logs{severity="ERROR", service_name="frontend"}[5m])) > 10
# Result: 1 alert (aggregated across all pods)

Aggregation strategies:

Service-level: sum by (service_name) — One alert per service
Environment-level: sum by (deployment_environment_name) — One alert per environment
No grouping: sum(...) — Single alert total

Note

The labels you keep in the by clause also determine the Affected Resource for each failed check. A bare sum(...) with no by clause drops service_name from the result, so the failed check appears under Unknown and does not color the service on the service map — even when the selector was filtered to a single service. See Affected Resource for details.

Adjust Thresholds

Increase threshold values to reduce sensitivity.

Review the preview chart to see typical metric values
Identify the baseline and normal variation range
Set thresholds above the normal variation

Threshold tuning:

Baseline: Typical value during normal operation
Variation: Expected fluctuation range
Threshold: Set above baseline + variation to catch true anomalies

Example:

Baseline: 100ms latency
Normal variation: ±20ms
Degraded threshold: 150ms (baseline + 50%)
Critical threshold: 200ms (baseline + 100%)

Add Enablement Conditions

Filter out low-traffic or maintenance periods.

Problem: Alerts fire during scheduled maintenance or when traffic is too low to be meaningful.

promql

12
# Main query
rate(http_server_errors[5m]) / rate(http_server_requests[5m]) > 0.05

Solution: Add enablement condition requiring minimum traffic.

promql

12
# Enablement condition: only alert when traffic > 10 req/min
sum(rate(http_server_requests[5m])) > 10

Common enablement patterns:

Minimum traffic: sum(rate(requests[5m])) > threshold
Business hours: hour() >= 9 AND hour() < 17
Specific environments: deployment_environment_name = "production"
Service health: up{service="..."} == 1

Query Performance Issues

If check evaluation is slow or timing out, optimize the query by adjusting evaluation frequency, reducing cardinality, adding filters, or shortening time windows.

Increase Evaluation Frequency

Use longer evaluation intervals to reduce load.

Change Evaluate every from 1m to 5m or 10m
Adjust grace period multipliers accordingly
Monitor evaluation duration

Trade-offs:

Shorter intervals (1m): Faster detection, higher cost
Longer intervals (10m): Slower detection, lower cost

Recommendation: Use 1m for critical alerts, 5-10m for non-critical alerts.

Simplify Query

Reduce the number of labels in by clauses.

Problem: Too many grouping labels create high cardinality.

promql

12
# High cardinality: groups by pod, node, and container
sum by (k8s_pod_name, k8s_node_name, k8s_container_name) (rate(metric[5m]))

Solution: Group by fewer labels.

promql

12
# Lower cardinality: groups by service only
sum by (service_name) (rate(metric[5m]))

Impact: Fewer unique label combinations = faster query execution.

Add Filters

Narrow the query scope with specific label filters.

Problem: Query scans all services and metrics.

promql

12
# Scans all services
rate(d.span_durations[5m])

Solution: Add service filter.

promql

12
# Only scans frontend service
rate(d.span_durations{service_name="frontend"}[5m])

Use Shorter Time Ranges

Reduce the lookback window in range selectors.

Problem: Long lookback windows process more data.

promql

12
# 5-minute window
rate(metric[5m])

Solution: Use shorter window if appropriate.

promql

12
# 1-minute window (less data to process)
rate(metric[1m])

Trade-offs:

Longer windows: Smoother data, less noise, slower queries
Shorter windows: Faster queries, more noise, less smoothing

Debug PromQL Expressions

Use these techniques to troubleshoot PromQL syntax and logic errors by validating syntax, testing in the query builder, and building queries incrementally.

Validate Syntax

Use the PromQL Preview section to check for syntax errors.

Scroll to the PromQL Preview section
Review any error messages displayed
Fix syntax issues highlighted in red

Common syntax errors:

Missing closing parenthesis
Incorrect label matcher syntax (use =~ for regex, not ~)
Invalid function arguments
Mismatched brackets in label selectors

Test in Query Builder

Use the Query Builder to validate your query returns data.

Navigate to Query Data → Query Builder
Enter your PromQL expression
Run the query and review results
Debug any issues before adding to check rule

Start Simple, Add Complexity

Build queries incrementally to isolate issues.

promql

1234567891011121314
# Step 1: Basic selector
{service_name="frontend"}

# Step 2: Add rate
rate({service_name="frontend"}[5m])

# Step 3: Add aggregation
sum(rate({service_name="frontend"}[5m]))

# Step 4: Add grouping
sum by (http_status_code) (rate({service_name="frontend"}[5m]))

# Step 5: Add filters
sum by (http_status_code) (rate({service_name="frontend", http_status_code=~"5.."}[5m]))

At each step, verify the query works before adding the next piece.

Get Help

If you've tried these troubleshooting steps and still have issues, consult the documentation, contact support, or reach out to the community for assistance.

Check Documentation: Review About Alert Monitoring for alerting concepts
Contact Support: Reach out to Dash0 support with:
- Check rule ID or name
- Expected behavior vs actual behavior
- Screenshots of the configuration
- Preview chart showing data (or lack thereof)
Community: Join the Dash0 community Slack for peer support

Troubleshoot Check Rule Issues

Check Rule Not Firing

Verify Query Returns Data

Check Thresholds

Review Grace Periods

Verify Enablement Conditions

Check Notification Channels

Too Many Alerts

Increase Grace Periods

Add Aggregation

Adjust Thresholds

Add Enablement Conditions

Query Performance Issues

Increase Evaluation Frequency

Simplify Query

Add Filters

Use Shorter Time Ranges

Debug PromQL Expressions

Validate Syntax

Test in Query Builder

Start Simple, Add Complexity

Get Help

Further Reading