Last updated: May 29, 2026
Troubleshoot Check Rule Issues
This guide helps you diagnose and resolve common issues with check rules, from alerts not firing when expected to dealing with alert fatigue from too many notifications.
Check Rule Not Firing
If your check rule isn't triggering alerts when expected, work through these diagnostic steps to identify whether the issue is with the query, thresholds, grace periods, enablement conditions, or notification configuration.
Verify Query Returns Data
The most common reason for alerts not firing is that the query doesn't match any telemetry.
- Open the check rule
- Review the preview chart below the Query Builder
- Verify that data points appear in the chart
- Check the time range covers recent data
If no data appears:
- Verify the service name, metric name, or label filters are correct
- Check that telemetry is flowing into Dash0 for the queried service
- Review the time range selector (e.g.,
[5m]) matches your data freshness - Test with a simpler query to isolate the issue
Example diagnostic query:
12345678# Start simple{service_name="frontend"}# Add complexity graduallyrate({service_name="frontend"}[5m])# Add filters one at a timerate({service_name="frontend", http_status_code="500"}[5m])
Check Thresholds
Ensure your threshold values are appropriate for the metric scale.
- Review the preview chart to see typical metric values
- Compare the threshold to the current metric value
- Verify the comparison operator is correct (>, <, >=, <=, ==, !=)
Common threshold mistakes:
- Milliseconds vs seconds: Threshold is 500 but metric is in seconds (0.5)
- Percentage as decimal: Threshold is 0.05 but expecting 5%
- Wrong operator: Using
>when you need<for error detection - Too aggressive: Threshold is too sensitive, catching normal variation
Fix:
12345# Wrong: comparing milliseconds to secondshistogram_quantile(0.99, ...) > 500 # metric is in seconds# Right: normalize unitshistogram_quantile(0.99, ...) * 1000 > 500 # convert to milliseconds
Review Grace Periods
Long grace periods delay alerts by requiring sustained threshold violations.
- Check the For (trigger) grace period
- Check the Keep firing for grace period
- Temporarily set both to "None (0m)" for testing
Grace period behavior:
- 2x (2m) with 1m interval = Alert fires after 2 consecutive threshold violations
- If the metric drops below threshold once, the grace period resets
- Grace periods prevent noisy alerts but delay detection
Testing tip: Set grace periods to 0× temporarily to verify the query and threshold work, then restore appropriate grace periods.
Verify Enablement Conditions
If configured, enablement conditions must evaluate to true for the check to fire.
- Scroll to the Enablement Conditions section
- Check if any conditions are configured
- Verify the condition evaluates to true for your current data
Common enablement condition issues:
- Condition requires minimum traffic, but service has low traffic
- Time-based condition (business hours only) doesn't match current time
- Service label doesn't match what the condition expects
Fix: Temporarily remove the enablement condition to test the main query, then refine the condition logic.
Check Notification Channels
Ensure notification channels are properly configured and not muted.
- Navigate to the Notification Channels settings.
- Verify the channels assigned to the check rule are enabled
- Test the notification channel to ensure delivery works
- Check for muted channels or routing rules that filter the alert
Channel issues:
- Channel is disabled
- Slack webhook URL expired
- PagerDuty integration key changed
- Email SMTP credentials invalid
- Label-based routing doesn't match the check's labels
Too Many Alerts
If your check rule is too noisy, use these techniques to reduce alert fatigue by requiring sustained threshold violations, aggregating results, adjusting thresholds, or filtering out irrelevant conditions.
Increase Grace Periods
Require sustained threshold violations before alerting.
- Open the check rule editor
- Increase the For (trigger) grace period to 2× or 3×
- Save and monitor the alert frequency
Recommended grace periods:
- Low noise services: 1-2× grace period
- Noisy services: 3-4× grace period
- Flapping metrics: 5× or higher grace period
Impact: With a 3× grace period and 1m interval, the metric must exceed the threshold for 3 consecutive minutes before alerting.
Add Aggregation
Use sum by or avg by to group results and reduce alert count.
Problem: Without aggregation, separate alerts fire for each unique label combination.
123# Without aggregation: separate alert per podrate(d.logs{severity="ERROR", service_name="frontend"}[5m]) > 10# Result: 50 alerts (one per pod)
Solution: Add aggregation to create a single alert.
123# With aggregation: single alert per servicesum(rate(d.logs{severity="ERROR", service_name="frontend"}[5m])) > 10# Result: 1 alert (aggregated across all pods)
Aggregation strategies:
- Service-level:
sum by (service_name)— One alert per service - Environment-level:
sum by (deployment_environment_name)— One alert per environment - No grouping:
sum(...)— Single alert total
Adjust Thresholds
Increase threshold values to reduce sensitivity.
- Review the preview chart to see typical metric values
- Identify the baseline and normal variation range
- Set thresholds above the normal variation
Threshold tuning:
- Baseline: Typical value during normal operation
- Variation: Expected fluctuation range
- Threshold: Set above baseline + variation to catch true anomalies
Example:
- Baseline: 100ms latency
- Normal variation: ±20ms
- Degraded threshold: 150ms (baseline + 50%)
- Critical threshold: 200ms (baseline + 100%)
Add Enablement Conditions
Filter out low-traffic or maintenance periods.
Problem: Alerts fire during scheduled maintenance or when traffic is too low to be meaningful.
12# Main queryrate(http_server_errors[5m]) / rate(http_server_requests[5m]) > 0.05
Solution: Add enablement condition requiring minimum traffic.
12# Enablement condition: only alert when traffic > 10 req/minsum(rate(http_server_requests[5m])) > 10
Common enablement patterns:
- Minimum traffic:
sum(rate(requests[5m])) > threshold - Business hours:
hour() >= 9 AND hour() < 17 - Specific environments:
deployment_environment_name = "production" - Service health:
up{service="..."} == 1
Query Performance Issues
If check evaluation is slow or timing out, optimize the query by adjusting evaluation frequency, reducing cardinality, adding filters, or shortening time windows.
Increase Evaluation Frequency
Use longer evaluation intervals to reduce load.
- Change Evaluate every from 1m to 5m or 10m
- Adjust grace period multipliers accordingly
- Monitor evaluation duration
Trade-offs:
- Shorter intervals (1m): Faster detection, higher cost
- Longer intervals (10m): Slower detection, lower cost
Recommendation: Use 1m for critical alerts, 5-10m for non-critical alerts.
Simplify Query
Reduce the number of labels in by clauses.
Problem: Too many grouping labels create high cardinality.
12# High cardinality: groups by pod, node, and containersum by (k8s_pod_name, k8s_node_name, k8s_container_name) (rate(metric[5m]))
Solution: Group by fewer labels.
12# Lower cardinality: groups by service onlysum by (service_name) (rate(metric[5m]))
Impact: Fewer unique label combinations = faster query execution.
Add Filters
Narrow the query scope with specific label filters.
Problem: Query scans all services and metrics.
12# Scans all servicesrate(d.span_durations[5m])
Solution: Add service filter.
12# Only scans frontend servicerate(d.span_durations{service_name="frontend"}[5m])
Use Shorter Time Ranges
Reduce the lookback window in range selectors.
Problem: Long lookback windows process more data.
12# 5-minute windowrate(metric[5m])
Solution: Use shorter window if appropriate.
12# 1-minute window (less data to process)rate(metric[1m])
Trade-offs:
- Longer windows: Smoother data, less noise, slower queries
- Shorter windows: Faster queries, more noise, less smoothing
Debug PromQL Expressions
Use these techniques to troubleshoot PromQL syntax and logic errors by validating syntax, testing in the query builder, and building queries incrementally.
Validate Syntax
Use the PromQL Preview section to check for syntax errors.
- Scroll to the PromQL Preview section
- Review any error messages displayed
- Fix syntax issues highlighted in red
Common syntax errors:
- Missing closing parenthesis
- Incorrect label matcher syntax (use
=~for regex, not~) - Invalid function arguments
- Mismatched brackets in label selectors
Test in Query Builder
Use the Query Builder to validate your query returns data.
- Navigate to Query Data → Query Builder
- Enter your PromQL expression
- Run the query and review results
- Debug any issues before adding to check rule
Start Simple, Add Complexity
Build queries incrementally to isolate issues.
1234567891011121314# Step 1: Basic selector{service_name="frontend"}# Step 2: Add raterate({service_name="frontend"}[5m])# Step 3: Add aggregationsum(rate({service_name="frontend"}[5m]))# Step 4: Add groupingsum by (http_status_code) (rate({service_name="frontend"}[5m]))# Step 5: Add filterssum by (http_status_code) (rate({service_name="frontend", http_status_code=~"5.."}[5m]))
At each step, verify the query works before adding the next piece.
Get Help
If you've tried these troubleshooting steps and still have issues, consult the documentation, contact support, or reach out to the community for assistance.
- Check Documentation: Review About Alert Monitoring for alerting concepts
- Contact Support: Reach out to Dash0 support with:
- Check rule ID or name
- Expected behavior vs actual behavior
- Screenshots of the configuration
- Preview chart showing data (or lack thereof)
- Community: Join the Dash0 community Slack for peer support
Further Reading
- Create Check Rules — Detailed guide to creating and configuring check rules.
- Investigate Failed Checks — Explore failed checks to understand alert triggers.
- Route Check Rule Notifications — Configure label-based notification routing.
- How the Alerting Model Works — Technical details of Dash0's alerting system.