What Is Cloud Monitoring?

Q: What Is Cloud Monitoring?

Cloud monitoring explained: what it collects, why native provider consoles lag and fragment, and how to unify metrics, logs, and traces across clouds.

Cloud monitoring is the practice of collecting metrics, logs, and traces from cloud-hosted infrastructure and applications, then turning that data into alerts and dashboards you can act on. The definition is the easy part. The hard part is that the data lives in at least three different places, and every cloud hands you a separate console to look at its own slice of it.

This article covers what cloud monitoring actually collects, how the data flows from a running instance to an alert, and where the built-in provider tooling quietly leaves you blind.

Where the data comes from

There is no single source of truth for "how your cloud is doing." The signals come from three distinct layers, and understanding them is what makes the rest of cloud monitoring make sense.

The first layer is the provider's own metrics API. AWS exposes CloudWatch, Azure has Azure Monitor, and Google Cloud has Cloud Monitoring. These emit infrastructure-level metrics for the resources you provision: CPU and network for an EC2 instance, request counts for a load balancer, replication lag for a managed database. You get this data for free (or nearly free) without installing anything, because the provider is already measuring its own hardware.

The second layer is agent or collector telemetry running on your hosts. The provider knows the hypervisor sees 60% CPU, but it can't see memory pressure inside the guest OS, disk usage on a mounted volume, or per-process resource consumption. For that you run an agent on the machine, and increasingly that agent is the OpenTelemetry Collector scraping host metrics locally.

The third layer is the application itself. Provider metrics tell you an instance is busy; they don't tell you that a specific checkout request spent 800ms waiting on a downstream payment API. That requires instrumenting your code to emit traces and application metrics. This is where most real incidents are actually diagnosed, and it's the layer provider-native tooling covers worst.

How cloud monitoring works, mechanically

Every cloud monitoring setup runs the same pipeline, regardless of vendor. Data is collected from the three sources above, aggregated into fixed time windows (a raw stream of CPU samples becomes a one-minute average), stored in a time-series backend, and then evaluated against alert rules while being rendered on dashboards. The aggregation step is where a lot of subtle behavior hides. When CloudWatch stores a metric at a five-minute period, it isn't keeping every sample; it's collapsing everything in that window into one number using the statistic you asked for. A five-minute average of 30% CPU can contain a 100% spike that lasted 40 seconds. The spike happened, but the aggregation erased it before you ever saw a data point.

Where provider-native monitoring falls short

The native tools are convenient and the right starting point. But teams consistently hit the same walls, and knowing about them ahead of time saves a lot of confused debugging.

The most immediate problem is that metrics arrive late and coarse. By default, EC2 publishes at five-minute intervals under basic monitoring; one-minute detailed monitoring costs extra. On top of the collection interval there's real reporting lag before a data point is queryable. If you pull EC2 CPU with the default period, you get exactly what basic monitoring gives you:

bash

12345678
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123def456 \
  --start-time 2026-07-02T09:00:00Z \
  --end-time 2026-07-02T09:15:00Z \
  --period 300 \
  --statistics Average

The output comes back in five-minute buckets, which is fine for capacity trends but useless for catching a 90-second latency spike:

json

12345678
{
  "Label": "CPUUtilization",
  "Datapoints": [
    { "Timestamp": "2026-07-02T09:00:00Z", "Average": 12.4, "Unit": "Percent" },
    { "Timestamp": "2026-07-02T09:05:00Z", "Average": 13.1, "Unit": "Percent" },
    { "Timestamp": "2026-07-02T09:10:00Z", "Average": 47.8, "Unit": "Percent" }
  ]
}

Combined with reporting lag, alerting on five-minute metrics means you routinely learn about a problem 15 or more minutes after it started.

Cost is the next trap. CloudWatch custom metrics are priced per metric per month, and every unique combination of dimensions counts as a separate metric. Add an InstanceId dimension across a fleet, then split by Region and Endpoint, and one logical metric quietly becomes thousands of billable ones. The bill scales along exactly the same axis that makes metrics useful, so the teams who instrument most thoughtfully are the ones who get surprised by the invoice.

Then there's fragmentation. CloudWatch is region-scoped by default. A cross-region or multi-account view requires manual aggregation through additional services. Run workloads on more than one provider and the problem compounds: Azure Monitor and Cloud Monitoring each have their own query language, their own dashboards, and their own conventions, so "how is my system doing" turns into three browser tabs with no correlated view.

The deepest problem, though, is that you're watching infrastructure rather than requests. Provider consoles give you metrics and logs out of the box, but distributed tracing is a separate product bolted on the side (AWS X-Ray, for example, sits behind CloudWatch). You can see an instance is healthy and still have no idea why a user's request is slow, because the slow part is a call path across four services that no single infrastructure dashboard represents.

What this looks like in practice

Picture a checkout service that starts timing out during a sale. The CloudWatch dashboard is entirely green: CPU is at 35%, memory is fine, the load balancer shows healthy targets. Every infrastructure signal says the system is healthy, and yet customers are getting spinner-of-death on payment.

The answer isn't in any infrastructure metric. A trace of a single slow checkout shows the request spending three seconds inside a call to an inventory service, which is itself blocked on a database connection pool that maxed out. The instance running checkout was never busy because it was sitting idle waiting on a downstream dependency. Infrastructure monitoring showed you the hosts. It couldn't show you the request, and the request was the whole problem.

This is the core reason cloud monitoring built purely on provider-native metrics leaves teams stuck. It answers "are my machines okay" and stays silent on "is my application okay."

Unifying the layers with OpenTelemetry

The way out of console-hopping is to collect all three layers through one vendor-neutral pipeline. OpenTelemetry gives you a standard format for metrics, logs, and traces, and the Collector can gather host metrics, scrape provider metrics, and receive application telemetry in a single process, then ship everything to one backend.

A minimal Collector configuration that pulls host metrics locally and accepts application telemetry over OTLP (OpenTelemetry Protocol) looks like this:

yaml

123456789101112131415161718192021222324252627282930
receivers:
  hostmetrics:
    collection_interval: 10s
    scrapers:
      cpu:
      memory:
      disk:
      network:
  otlp:
    protocols:
      grpc:
      http:

exporters:
  otlp:
    endpoint: ${env:DASH0_ENDPOINT}
    headers:
      Authorization: Bearer ${env:DASH0_AUTH_TOKEN}

service:
  pipelines:
    metrics:
      receivers: [hostmetrics, otlp]
      exporters: [otlp]
    traces:
      receivers: [otlp]
      exporters: [otlp]
    logs:
      receivers: [otlp]
      exporters: [otlp]

The hostmetrics receiver fills the gap the provider can't see inside the guest, at a ten-second interval rather than five minutes. If you're running the Collector in Kubernetes, you'll need additional configuration — host filesystem mounts and security context settings — for complete visibility. The otlp receiver takes in traces from your instrumented services. All three signals land in the same place with the same resource attributes, so a slow trace links directly to the host and container it ran on.

For a deeper walkthrough of the hostmetrics receiver, including dashboards and alerting patterns, see the Infrastructure Monitoring with OpenTelemetry Host Metrics guide.

Common pitfalls

A few failure modes catch experienced engineers off guard because they look correct right up until an incident.

Watch out for alerting on averages instead of percentiles. A five-minute average latency of 200ms feels safe, but if your P99 is 4 seconds, one in a hundred users is having a terrible time and your dashboard is hiding it. Alert on high percentiles for anything user-facing.

Also don't assume high-resolution metrics stay high-resolution. CloudWatch rolls sub-minute custom metrics up after a few hours and one-minute data after 15 days. If you investigate an incident from last month, the second-by-second detail you paid extra to collect has already been aggregated away, so postmortems on older incidents are coarser than you expect.

Finally, pulling metrics straight from a provider API needs nothing installed, which is exactly why it can't see memory pressure, thread pool exhaustion, or garbage collection pauses happening inside your process. Those in-process problems are the ones that page you at 3am, and they only appear when something is actually running on the host.

Final thoughts

Cloud monitoring is easy to start and hard to do well, because the signals that explain your incidents live in the application layer that provider consoles cover least. The teams that stay ahead of problems collect infrastructure metrics, host metrics, logs, and traces through one pipeline and correlate them in a single view, rather than reconstructing the story across a CloudWatch tab, an Azure Monitor tab, and a separate tracing tool.

Dash0 is OpenTelemetry-native, so the same Collector you'd run anyway feeds infrastructure monitoring, real-time log management, and distributed tracing into one place, across every cloud and region, with no per-provider console to stitch together. When a checkout starts timing out, you can jump from the slow trace to the host metrics to the logs without changing tools. Start a free trial to see your metrics, logs, and traces from every cloud in a single view. No credit card required.