It's 2am. Your checkout service is returning 5xx errors. You open your logging tool and find 800 nearly identical PaymentTimeout lines, over and over, no trace context, no service identifiers, three services involved and no clear way to tell which one is actually broken. You scroll. You grep. You ask colleagues in Slack if they touched anything. Thirty minutes later you're still not sure where to look.

That scenario is a log management problem, not a code problem. In this article we'll walk through the principles behind effective modern log management and how to put them into practice.

In monolithic or single-server deployments, log management mostly meant gathering log files from a few servers and making them searchable in one place. That definition is too narrow for modern distributed systems. Today, effective log management also means reducing noise, standardizing structure, preserving context, and keeping the system affordable as log volume grows across containers, services, cloud platforms, and serverless infrastructure.

At a minimum, effective log management includes collecting logs from different sources, sending them to a central system, organizing and storing them, and searching and analyzing them when something goes wrong.

In modern environments, it also extends to filtering noisy or low-value logs, deduplicating repeated events, standardizing fields and formats, and managing retention and pricing.

Why log management matters

Logs are still the most widely used record of what happened inside a system. They capture request failures, retries, startup problems, authentication events, background job activity, deployment side effects, and infrastructure issues.

When something breaks, logs are often the first place engineers look because they contain the event-level detail that tells you what happened. The value of log management goes beyond centralization: it's about making those records reliable enough to support debugging, operations, security work, and incident response while you're under pressure and the clock is running.

Types of logs

Most teams aren't managing a single kind of log. They're managing several at once.

Application logs describe what the application is doing: handling requests, validating data, processing business events, and reporting warnings or exceptions.

System and host logs come from the operating system and runtime environment. They help explain lower-level failures like process crashes, disk pressure, or network issues.

Infrastructure and platform logs come from components like proxies, orchestrators, load balancers, databases, and managed cloud services.

Security and audit logs record authentication, authorization, administrative changes, and policy activity.

In distributed systems these categories constantly overlap, so log management needs to work across layers, connecting signals in context rather than stitching together isolated logs after the fact.

Log management starts at the source

You can have the best log management process in the world, but if your logs are bad it won't matter. OpenTelemetry is an observability framework and toolkit that gives you the APIs, SDKs, and Collector you need for accurate, scalable logging.

Adopting OpenTelemetry may take some adjustment in your tech stack, but it doesn't mean rebuilding everything. In many cases, applications already emit logs in JSON format, which is ideal. But even when logs are JSON there's a distinction that matters: structured versus unstructured output.

Unstructured logs are plain text messages that are easy for humans to read but difficult for tools to reliably parse. Structured logs are emitted as well-defined fields with embedded context that systems can process directly.

That difference matters more as volume grows. Plain text may be familiar, but it's much harder to query, enrich, group, and standardize across services. Structured output is what makes logs manageable in high-volume distributed environments.

Below is a simple comparison.

Unstructured:

12
[2025-04-13 10:32:01] ERROR Failed to process payment
for user 8821 — timeout after 3000ms

Structured:

json

123456789101112
{
  "timestamp": "2025-04-13T10:32:01Z",
  "level": "ERROR",
  "message": "Failed to process payment",
  "user": {
    "id": 8821
  },
  "error": {
    "type": "PaymentTimeout",
    "timeout_ms": 3000
  }
}

At a glance, both logs contain similar information. The difference is how that information can be used.

In the structured example, each piece of information lives in a dedicated field. You can filter, group, and analyze logs using those fields directly, without relying on fragile text parsing.

To find all payment timeout errors for a specific user, you'd query:

json

1234
service.name = "payment-service"
AND error.type = "PaymentTimeout"
AND user.id = 8821
AND timestamp >= now() - 1h

With unstructured logs, the same query would require a regex against the message body, and would break the moment someone changed the wording. With structured fields, the query is stable and composable.

The OpenTelemetry logs data model

Structure defines how logs are formatted, but consistency is what makes them usable across systems. To get consistency, logs need a shared data model. OpenTelemetry provides one: a vendor-neutral framework for collecting, processing, and transmitting telemetry data.

OpenTelemetry defines the logs data model, which standardizes how logs are represented. A LogRecord (an individual entry within this model) includes a timestamp, a severity level, a message body, structured attributes, and trace context, including trace and span identifiers that link the log to the request that produced it.

Fields like trace_id and span_id let logs be connected across services. Logs aren't just strings anymore. They're structured records that can be queried, correlated, and linked to traces and metrics.

Beyond the structure of individual logs, OpenTelemetry also defines semantic conventions: standardized field names and meanings that ensure logs from different services describe the same concepts the same way.

Instead of one service using userId, another using uid, and a third embedding the value inside a message, semantic conventions define a consistent field like user.id. The same applies to attributes like service.name, http.request.method (formerly http.method, now deprecated), or error.type.

Those fields can then be used directly for filtering, grouping, and correlation without any guesswork.

What it means to be OpenTelemetry-native

Adopting OpenTelemetry isn't just about sending logs through a Collector. It requires producing logs that align with the data model, semantic conventions, and context requirements from the start.

In a non-native setup, logs might still be emitted as plain text or loosely structured JSON. They might be missing consistent field naming, lacking trace context, or difficult to correlate across systems.

In an OpenTelemetry-native approach, logs are structured as JSON, aligned with the logs data model, using semantic conventions for field names (service.name, user.id, error.type), and enriched with trace and span context at the time they're created.

This reduces the need for downstream fixes. When logs already follow a consistent structure and naming scheme, collectors and backends don't need to spend as much effort parsing, transforming, or normalizing them later.

Being OpenTelemetry-native isn't binary, and nobody expects perfection on day one. It means designing toward a system where logs are emitted consistently and with enough context to be correlated, queried, and operated at scale.

Log ingestion and standardization in the Collector

Once logs leave your applications, the next step is getting them into a system where they can be processed, standardized, and analyzed.

Logs originate from many places: written to files, printed to stdout, emitted by application libraries, or produced by infrastructure services. In containerized environments, logs are often written to stdout and automatically captured by the platform.

The OpenTelemetry Collector handles this well. It provides a unified way to ingest logs from multiple sources using receivers.

Depending on how logs are produced, different receivers apply. The filelog receiver reads logs from files. The OTLP receiver accepts logs sent over HTTP or gRPC from instrumented applications. Other receivers can ingest logs from system services or streaming platforms.

If you control the application, the ideal approach is to emit structured, single-line JSON logs directly.

But real systems are messier. Some services emit structured JSON, others plain text, and others something in between. The Collector's second role covers this: standardization and enrichment.

Unlike application-level logging (which defines how logs are created), the Collector operates on logs after they've been produced. It can normalize and reshape incoming data so downstream systems receive something more consistent.

The Collector can parse unstructured or semi-structured logs into structured fields (when patterns are predictable), rename inconsistent fields (userId to user.id), map attributes to OpenTelemetry semantic conventions, add missing metadata like service.name, environment, or region, and enrich logs with trace context if that context exists elsewhere in the pipeline.

Even if applications aren't perfectly aligned, whether they're missing structure or using inconsistent field names, the Collector can bring logs closer to a unified format.

At the source, you emit structured JSON logs with intentional context. In the Collector, you normalize, enrich, and align logs across services. Think of the Collector as a force multiplier, not a replacement for good logging practices. It helps enforce consistency at scale, but the quality and usefulness of your logs are still determined upstream.

The telemetry pipeline

With the Collector's role established, here's how it fits into the full telemetry pipeline. After ingestion, logs move through a processing pipeline before reaching storage and analysis. Buffering, parsing, enrichment, filtering, routing, and export all happen at this stage.

The pipeline is typically implemented with a collector as the central processing layer. Instead of every service sending raw records directly to a backend in its own format, a collector receives them, applies logic, and forwards them onward.

The OpenTelemetry Collector is built around receivers, processors, and exporters. It gives you a control point for collecting, enriching, redacting, dropping, batching, and routing telemetry before it reaches storage.

A production-ready pipeline includes a few components beyond receivers and exporters. Here's a realistic starting point:

yaml

123456789101112131415161718192021222324
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
  batch:
    send_batch_size: 1000
    timeout: 5s

exporters:
  otlp:
    endpoint: https://your-backend:4317

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

Even in this basic setup, the separation of concerns is clear: logs are received, stabilized by the memory limiter, batched for efficient transport, and forwarded to storage, without each service needing to manage that complexity itself. The memory_limiter processor should always appear first in the processor chain because it protects the Collector from being overwhelmed by sudden volume spikes. This is also where the earlier concepts (structured logging, semantic conventions, context) get enforced consistently at scale.

Reducing signal noise and lowering costs

It seems logical to assume that more logs means better insight. Uncontrolled volume produces the opposite: more noise, higher cost, and slower investigation.

Log levels as the first line of control

Log levels define the importance of events, and they're one of the primary ways systems control how much signal is produced. They determine what stands out, what gets ignored, and how easy it is to work with logs at scale.

Log levels should be used intentionally. Each level represents a different type of event, and in many systems these levels map to numeric ranges that determine how logs are filtered and processed:

1-4 TRACE: highly detailed, step-by-step diagnostics
5-8 DEBUG: diagnostic information useful during development or troubleshooting
9-12 INFO: normal, meaningful application events and state changes
13-16 WARN: unexpected or undesirable conditions that don't stop execution
17-20 ERROR: failures that prevent a specific operation from completing
21-24 FATAL: failures that may cause the system or service to stop

Each level actually contains four sub-variants (such as FATAL through FATAL4), allowing for finer-grained severity within the same category.

These numeric ranges let systems apply thresholds (for example, "log everything at INFO and above"), making it possible to control how much data is emitted or collected without changing the meaning of individual log entries.

In modern systems, log levels don't have to be static. Instead of fixing verbosity at deploy time, they can be adjusted dynamically at runtime. You can temporarily increase detail for a specific service, request, or user when investigating an issue, then return to normal levels once the problem is understood.

This keeps baseline log volume low while still making detailed context available during incidents, without requiring redeployments. If everything is logged at the same level, the system loses its ability to distinguish signal from noise.

Filtering low-value logs

Not every log message is equally useful. High-frequency entries like debug statements, trace messages, health checks, and routine success confirmations add noise without helping engineers understand failures or investigate incidents.

When these low-value logs are forwarded unchanged, they increase ingestion and storage costs, consume indexing resources, slow down searches, and bury the records that actually matter: warnings, errors, failed requests, and dependency issues.

Filtering should happen as early as possible in the telemetry pipeline. The OpenTelemetry Collector can evaluate logs before export and drop records that match predefined conditions, so you reduce noise at the processing layer rather than paying to ship and store everything.

The following example uses the Collector's filter processor to exclude low-value log records:

yaml

123456
processors:
  filter:
    logs:
      log_record:
        - severity_number <= SEVERITY_NUMBER_DEBUG4
        - 'IsMatch(body, "(?i)(debug|trace)")'

This configuration defines filtering rules at the log_record level. Each rule is evaluated against incoming logs, and matching records are dropped.

The first rule:

1
severity_number <= SEVERITY_NUMBER_DEBUG4

filters out logs whose severity is DEBUG or lower, removing records at the most verbose end of the severity scale. These logs are often useful during active development or deep troubleshooting, but in production they tend to generate large amounts of noise with limited long-term value.

The second rule:

1
'IsMatch(body, "(?i)(debug|trace)")'

filters logs based on their message content. It uses a regular expression to match log bodies containing the words "debug" or "trace." The (?i) flag makes the match case insensitive, so DEBUG, Debug, and trace are all caught. This helps catch logs that don't have a proper severity level assigned but still clearly contain low-value diagnostic output in the message text.

Together, these two rules provide layered filtering. The first removes logs based on structured severity metadata; the second catches noisy messages based on content. This matters because not all applications emit logs consistently. Some services set severity fields correctly, while others place diagnostic clues only in the body text. Using both approaches makes filtering more reliable across mixed workloads.

Applying this kind of filtering in the Collector can significantly reduce unnecessary log volume and improve the usefulness of the data you retain. The result is a logging pipeline that costs less, is easier to search, and stays focused on operationally meaningful events.

You can learn more about filters in the Dash0 OpenTelemetry Filter Processor Guide.

Log deduplication

A single failure rarely produces a single log entry. Retries, cascading errors, and repeated emissions from multiple components can generate dozens or thousands of nearly identical messages. Each entry reflects the same underlying issue, but the volume makes it harder to see patterns and drives up ingestion, storage, and query costs.

Log deduplication groups identical or near-identical entries over a defined time window and represents them as a single event with a count. You preserve the signal that an issue is happening repeatedly without flooding downstream systems with redundant data.

Deduplication depends on defining what "identical" means, typically based on a subset of fields like the message body, error type, or other stable attributes. Choosing the right fields matters: too strict, and you miss duplicates; too loose, and unrelated events get grouped together. Time windows also matter. Short intervals preserve more granularity, while longer ones provide stronger compression but may obscure short-lived spikes.

Where deduplication happens also matters. When applied in the processing pipeline (in a collector, for example), it reduces volume before logs reach storage or indexing. This makes it effective for cost control, since fewer records need to be stored and queried.

With the OpenTelemetry Collector, duplicate logs are handled by the log deduplication processor. Below is an example configuration that deduplicates connection errors while preserving audit logs:

yaml

1234567
processors:
  logdedup:
    interval: 1s
    log_count_attribute: dedup_count
    exclude_fields:
      - attributes.request_id
      - attributes.timestamp

The processor groups identical logs over a defined interval and emits a single record with a count, making patterns easier to spot without overwhelming the system.

Context is what makes logs useful

Logs without context are just data. A log line becomes far more valuable when it tells you where it came from and what it relates to. Fields like service.name, deployment.environment.name, and other resource attributes describe the system that produced the log. You can group logs by service, filter by environment, and understand exactly which part of your infrastructure an event belongs to.

Not all context belongs directly on the log itself, though. Some attributes describe the entity producing telemetry rather than the individual event. In OpenTelemetry, these are resource attributes. They define things like the service, host, environment, or cloud region that generated the log.

Keeping this information consistent across services matters. If one service reports checkout-service and another reports checkout, or if environments are labeled differently, grouping and filtering become unreliable fast.

The resource processor in the OpenTelemetry Collector handles this well.

It operates at the level of the service or infrastructure rather than individual log records. It lets you define, normalize, and enforce resource attributes across all telemetry flowing through the pipeline: ensuring service.name is consistent across deployments, adding environment metadata like deployment.environment=production, attaching infrastructure context like region or host, and overriding incorrectly set attributes.

Here's an example resource processor configuration that normalizes service identity and attaches environment metadata:

yaml

123456789101112
processors:
  resource:
    attributes:
      - key: service.name
        value: checkout-service
        action: upsert
      - key: deployment.environment.name
        value: production
        action: insert
      - key: cloud.region
        value: us-east-1
        action: insert

The upsert action sets the attribute whether or not it already exists, which is useful for enforcing a canonical value. The insert action only sets the attribute if it isn't already present, which is safer for metadata that services may already provide correctly. A full reference of supported actions is available in the Dash0 OpenTelemetry Resource Processor Guide.

By handling this at the resource level, you avoid repeating the same metadata in every log while keeping identity consistent across the service.

Context also includes how a log relates to other events. Identifiers like trace_id and span_id let logs be connected across services, so instead of seeing isolated messages, you can follow a single request as it moves through different components.

Resource attributes tell you where something happened. Trace context tells you how it connects to other events. When both are present and consistent, logs become part of a larger system of signals that can be grouped, filtered, and explored together.

That's the shift: instead of asking "what does this log mean?", you can ask "what happened across the system?" and use logs, traces, and other signals to find the answer.

Exporting logs

Once logs have been collected, structured, and refined, the final step is exporting them to a system where they can be stored, queried, and analyzed. All the earlier decisions (what to log, how to structure it, how much to keep) have a direct impact here.

Exporting isn't simply about sending data somewhere. It's about deciding where different logs should go and in what form. The OpenTelemetry Collector supports multiple exporters simultaneously, so logs can be fanned out to different destinations in a single pipeline. Security logs might go to a SIEM, high-severity errors to a dedicated alerting backend, and everything else to long-term storage.

The Collector provides exporters for a wide range of destinations: OTLP-compatible backends, object storage, logging platforms, and more. Each exporter can be attached to a pipeline independently, giving teams control over what reaches each system without duplicating instrumentation.

Here's an example that routes logs to two backends simultaneously:

yaml

123456789101112
exporters:
  otlp/primary:
    endpoint: https://primary-backend:4317
  otlp/siem:
    endpoint: https://siem-backend:4317

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/primary, otlp/siem]

Both exporters receive the same processed log stream. Your service instrumentation stays unchanged; only the Collector config determines where data lands. This is particularly useful during backend migrations: you can send logs to two platforms in parallel, compare how each surfaces and queries your data, and switch over when you're ready without touching application code.

The earlier pipeline steps matter here too. Log levels reduce unnecessary data at the source. Filtering removes low-value events before they move downstream. Deduplication compresses repeated signals into a single record. By the time logs reach the export stage, they should already be a smaller, higher-quality stream routed to the right destination rather than broadcast everywhere by default.

When logs include trace_id and span_id, exporting them with that context intact preserves their relationship to traces and the broader view of system behavior, regardless of which backend they land in.

Exporting is where log management decisions become concrete. It's where data leaves your control and becomes part of a system you query, monitor, and pay for. Getting this step right means that what reaches your backend is better data, not just more of it.

Putting it together: correlation, context, and triage

When an incident starts, correlation is what separates a quick resolution from a long one. Every log emitted during a request carries the same trace_id, so you can follow a single request across every service it touches, moving from a log entry in one service to the full trace, and from there to any other log or span sharing that identifier.

Resource attributes do the complementary work. Fields like service.name, deployment.environment.name, and host.name let you filter down to a specific service, environment, or host without writing complex queries. A deduplicated log showing a connection error 800 times in a one-second window tells you far more, far faster, than 800 identical lines to scroll through.

What this looks like in practice

Say your Product Catalogue service starts returning 5xx errors at 2am. Here's how structured, contextualized logs speed up triage:

Filter by severity and service. Set otel.log.severity.range = ERROR and service.name = productcatalogservice. The log list immediately narrows to error-level entries from that service alone.
Open a log record. You can see the log body Product Id Lookup Failed: OLJCESPC7Z, its attributes ProductCatalogService Fail Feature Flag Enabled, and a Span Context section with a mini trace waterfall already visible.
Read the span context. Because logs and traces share context identifiers, you can pull up every span from the same request. The span context shows the log was emitted during the ProductCatalogService GetProduct span, which took 4ms and accounted for 34% of the total 11ms trace.
Check related logs. Because logs carry resource context, you can view every other log the same service emitted in that time window, threaded chronologically around the error. The active log is pinned at its exact timestamp, with older entries above and newer ones below, each annotated with a relative time offset.

In this case the picture is immediate. Milliseconds before the Product Id Lookup Failed error, the service was successfully resolving product lookups for multiple items. The failure was isolated, not a cascade. That rules out a broader service outage and points to something specific about this particular product ID.
Jump to the full trace. The trace_id from the error log leads to the full request path. The waterfall shows every service the request touched: it entered through a load generator, passed through a frontend proxy, through the frontend, and finally reached the product catalogue service where the failure occurred. Six spans, three errors, 7ms total.

The failed span tells you exactly what broke. Status ERROR, rpc.grpc.status_code = 13 INTERNAL, on a specific product ID. Not a vague 500. A named operation, in a named service, on a named resource.

The entire sequence takes minutes rather than hours, not because the logs are smarter, but because they're structured, carry context, and have been deduplicated before they reach you.

The export stage pays off here too. Because the Collector can fan logs out to multiple destinations simultaneously, you aren't locked into a single observability platform. In our example we've used Dash0, but you could send the same log stream to two or more backends in parallel, compare how each surfaces and alerts on your data, and switch over when you're ready. The instrumentation stays the same regardless of where the data lands.

Final thoughts

Log management still starts with collecting logs from many places and making them searchable in one. But modern systems ask more of the pipeline. Getting it right means designing the full path from source to storage so logs carry enough structure, context, and signal that your team can figure out what happened without spending thirty minutes scrolling through PaymentTimeout lines at 2am.

What is Log Management?