When you record a metric like http.server.request.duration, you're not tracking one number over time. You're tracking one number per unique combination of labels. Add http.request.method, http.response.status_code, and http.route, and Prometheus stores a separate time series for every distinct pairing of those values. Three HTTP methods, five status codes, and fifty routes: that's 750 time series for a single metric. Add a region label with ten values and you're at 7,500. This multiplicative growth, one series per unique label set, is what people mean when they talk about a cardinality explosion.
This article explains what cardinality is, how to measure it, what causes it to blow up, and how to tame it.
How cardinality works
Cardinality is why a metric that looks innocent at design time can bring Prometheus to its knees. It's a number, the count of unique time series a metric produces, and it grows fast when you're not watching. Think of a metric name as a template. Every unique combination of label values fills in that template to produce a separate time series. The OpenTelemetry metrics data model defines the total count of such combinations as the metric's cardinality, calculated as the product of unique values across all labels:
1cardinality = |label_1_values| × |label_2_values| × ... × |label_N_values|
So a metric with http.request.method (GET, POST, PUT, DELETE), http.response.status_code (200, 201, 400, 404, 500), and http.route across 20 routes produces up to 4 × 5 × 20 = 400 unique time series. That's still relatively tame. Add a pod name label in Kubernetes with 50 replicas, and you're at 20,000.
Each of those time series costs memory. Prometheus keeps active series in RAM, running at roughly 25 MB per 1,000 series. So 100k series is about 2.5 GB, and 1 million series is 25 GB. Push past 2 million and OOMKills become common.
Diagnosing cardinality problems
If Prometheus is slow or running out of memory, start with the TSDB status page. Navigate to http://your-prometheus:9090/tsdb-status in the Prometheus UI. It shows series count by metric name, label names ranked by value count, and the label pairs consuming the most memory.
You can also pull the same data from the API:
1curl http://your-prometheus:9090/api/v1/status/tsdb
The response looks like this (truncated for readability):
1234567891011121314151617{"status": "success","data": {"headStats": {"numSeries": 847293,"chunkCount": 1694586},"seriesCountByMetricName": [{ "name": "http_server_request_duration_seconds_bucket", "value": 312400 },{ "name": "http_server_active_requests", "value": 98200 }],"labelValueCountByLabelName": [{ "name": "http_route", "value": 4120 },{ "name": "pod", "value": 380 }]}}
seriesCountByMetricName ranks your highest-cardinality metrics. labelValueCountByLabelName shows which label names carry the most distinct values. When http_route shows 4,120 distinct values, you've found your problem.
To check total series count directly in PromQL:
1prometheus_tsdb_head_series
A mid-sized production environment typically sits between 100k and 2 million series. Above 5 million, you have a problem. Above 10 million, it's urgent.
To rank metrics by series count:
123topk(20,count by (__name__) ({__name__!=""}))
The top entries are where to focus.
What causes cardinality to explode
Most cardinality problems trace back to a handful of label anti-patterns.
Unbounded label values are the leading cause. User IDs, request IDs, session tokens, IP addresses, and full URL paths all fall into this category. Each unique value creates a new series, and if the value space is unbounded, the series count grows without limit. The Prometheus documentation is explicit: do not use labels for high-cardinality dimensions like email addresses or user IDs. Those belong in traces or logs.
Kubernetes pod names are a subtler version of the same problem. If you label metrics by pod name and your deployment scales up and down, you get a fresh batch of series every time. The old series don't disappear immediately; they sit in TSDB until the retention window expires. A deployment that scales from 10 to 50 pods and back to 10 will accumulate series from all three states until they age out.
Dynamic routing paths are common in frameworks that don't normalize routes. If http.route gets set to the raw request path, /api/users/12345 instead of /api/users/{id}, every user ID becomes part of the label value and your route label explodes. Always normalize routes before they reach your metrics.
Histograms multiply cardinality by bucket count. A histogram with 10 buckets produces 10x the series of a counter with the same labels. If you have a high-cardinality histogram, the problem compounds quickly. Use native histograms (experimental in Prometheus 2.40+, GA in Prometheus 3.0) where possible — they're more efficient and don't require pre-defined bucket boundaries.
Controlling cardinality
The most effective intervention is prevention: design labels with bounded value sets before your metrics hit production.
A good rule of thumb is that if you can't enumerate all possible values of a label on a whiteboard, it probably shouldn't be a label. Status codes, HTTP methods, service names, environment names, regions: all fine. User IDs, request paths, container IDs: use traces instead.
When you're already in production and a label is causing problems, the OpenTelemetry Collector's transform processor lets you rewrite or drop label values before they reach your backend. Here's a config that normalizes raw URL paths and removes a high-cardinality identifier. Add this under the processors: key in your existing Collector config and reference it in your pipeline:
123456789processors:transform/reduce_cardinality:metric_statements:- context: datapointstatements:# Normalize raw paths to parameterized routes- replace_pattern(attributes["http.route"], "^/api/users/\\d+", "/api/users/{id}")# Drop a high-cardinality label entirely- delete_key(attributes, "request.id")
For Prometheus specifically, metric_relabel_configs can drop or replace label values at scrape time:
12345metric_relabel_configs:- source_labels: [http_route]regex: "^/api/users/[0-9]+"target_label: http_routereplacement: "/api/users/{id}"
The OTel View API goes one step further: it drops label dimensions at the SDK level, before data leaves your application. This Java example keeps only three attributes on the request duration metric and discards everything else:
123456SdkMeterProvider.builder().registerView(InstrumentSelector.builder().setName("http.server.request.duration").build(),View.builder().setAttributeFilter(Set.of("http.request.method", "http.response.status_code", "http.route")).build()).build();
If you want to remove entire metrics rather than trim their attributes, the OTel Collector filter processor is the right tool. It excludes metrics by name or attribute value before they reach your backend at all.
If you'd rather not touch YAML, Dash0's spam filters give you a point-and-click interface to drop high-cardinality series at the backend level. You define rules in the UI, and Dash0 stops storing the matching series with no Collector config changes required.
The difference between cardinality and churn
These two terms get conflated, but they describe different problems. Cardinality is a count: how many unique time series exist right now. Churn is a rate: how fast new series are created while old ones disappear.
High cardinality stresses memory, index size, and query fanout. Prometheus has to hold all active series in RAM and scan across them on every query. High churn stresses write paths and compaction; constant creation and deletion of series forces frequent compaction cycles and can overwhelm the write-ahead log.
You can have high cardinality with low churn (a stable set of millions of series), or low cardinality with high churn (a small set of series that rolls over rapidly). Both hurt performance, but they call for different fixes.
When high cardinality is actually fine
Not all high-cardinality data is a metrics problem. It's often a sign that you're using the wrong telemetry type. Traces handle high-cardinality data well because each trace is a single event with arbitrary attributes attached, not a time series tracked continuously. A user_id attribute on a span is cheap; the same value as a metric label is expensive.
Logs work similarly. A structured log entry can carry dozens of high-cardinality fields without affecting any series count.
The general heuristic: aggregate in metrics, investigate in traces, store per-event detail in logs. If you find yourself fighting cardinality to preserve granularity, that's a sign the data belongs in a different signal type. The Dash0 knowledge base has a good primer on how the three signal types divide responsibilities.
Final thoughts
Managing cardinality well is the difference between a metrics stack that stays responsive at scale and one that starts timing out dashboards just when you need them most.
Dash0's Metric Explorer gives you a Tree Map view of your metrics so you can spot which ones are driving series count up before they cascade into memory pressure or slow queries. Since Dash0 is OpenTelemetry-native, you can use the OTel View API and Collector transforms to control cardinality at the source, and spam filters to handle the rest without touching config files.
Start a free trial to see your metrics, traces, and logs in one place. No credit card required.