You probably already knows the logging basics. Use structured logs. Add timestamps. Don’t jam variables into strings. These are all essential practices, but not the whole story.

Even after you’ve checked those boxes, you might still be drowning in JSON from a fleet of microservices. The volume is crushing, the noise overwhelming, and the signal you need is buried deep. You’ve built logging, but you still can’t answer the only question that matters: why did this request fail?

This isn’t a beginner’s checklist. This is for engineers who’ve been dragged out of bed by a pager, who’ve stared into an endless stream of logs, frantically grep-ing for a lifeline.

We’re going to dig into the real challenges of operating at scale: taming high-cardinality data, connecting logs with metrics and traces, and finally finding order in the chaos through emerging standards like OpenTelemetry.

Let's get started!

The foundation is structured logging in JSON

Before you can run, you have to walk, and in logging, that first step is structure. Structured logging isn’t optional; it’s the baseline. Humans read stories, but machines read data. If your logs aren’t structured, they’re just noise.

An unstructured log may look readable at first glance, but it quickly becomes a dead end when you need to extract information at scale. Structured logs, on the other hand, are predictable, consistent, and easy to process. They transform raw events into a stream of data that can be searched, aggregated, and analyzed.

Unstructured example

text
1
INFO: User 12345 successfully updated their profile from IP 192.168.1.100. Request took 52ms.

With unstructured logs, extracting fields like the user_id or duration_ms will require brittle regex patterns that are error-prone and painful to maintain.

Structured example

json
1234567891011121314151617
{
  "timestamp": "2025-09-15T08:10:38.543Z",
  "level": "info",
  "message": "User profile updated",
  "service": "user-api",
  "version": "1.4.2-a8b4c1f",
  "source": { "file": "user_controller.go", "line": 142 },
  "context": {
    "user_id": "12345",
    "request_id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
    "http": {
      "method": "PUT",
      "client_ip": "192.168.1.100"
    },
    "duration_ms": 52
  }
}

Now this is something you can query, analyze, and build on. It gives you the ability to filter by user, trace a request, or measure performance without writing fragile parsing code. Structured logging turns a simple line of text into a foundation for real observability.

Moving from unstructured to structured JSON

This shift usually starts at the application level. Instead of printing plain text with string interpolation, you'll need to configure your logging library to emit JSON objects. Most modern frameworks already support this, and once you’ve agreed on a baseline schema, you can ensure every service produces logs that are consistent and easy to parse.

The next step is at the infrastructure level. System components such as web servers, load balancers, databases, and Kubernetes itself all generate their own logs. By default, these are often text-based and inconsistent across systems.

Converting them to structured JSON, or configuring them to output JSON directly where possible, brings them into the same shape as your application logs. This way, whether you’re looking at NGINX access logs, kubelet events, or database slow queries, you can ingest and analyze them alongside application events using the same tooling and queries.

Unifying Logs, Metrics, and Traces

For too long, we've treated logs, metrics, and traces as separate, siloed disciplines. This is a recipe for operational blindness. True observability comes from weaving them together into a single, cohesive narrative.

Metrics (The "What"): Aggregated, numerical data. They tell you what is happening at a macro level. "The API error rate is 5%."
Traces (The "Where"): A detailed, causal story of a single request as it travels through your distributed system. They tell you where the problem is. "The 5% error rate is coming from the billing-service, which is timing out when calling the credit-card-processor."
Logs (The "Why"): A rich, detailed event with arbitrary context. They tell you why the problem happened. "The call to the credit-card-processor timed out because the database connection pool was exhausted. Pool size: 10, active connections: 10."

The Golden Link: Context Propagation

How do you connect these three pillars? With a single, magical identifier: the Trace ID.

When a request first enters your system (e.g., at your API gateway or load balancer), you generate a unique trace_id. This ID is then passed along in the headers of every subsequent network call that is part of that original request. This is called context propagation.

The modern standard for this is the W3C Trace Context specification, which defines the traceparent HTTP header.

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
            ^  ^                                ^                ^
            |  |                                |                |
      Version  Trace ID                         Parent ID        Flags (e.g., sampled)

Now, every log, every metric, and every span in a trace associated with this request will be tagged with trace_id: 0af7651916cd43dd8448eb211c80319c. This is the key that unlocks true observability.

Your workflow transforms from frantic searching to a seamless investigation:

Alert: A metric-based alert fires: "API error rate > 5%."
Dashboard: You look at a dashboard and see that the errors are all HTTP 500s from the checkout-service.
Logs: You query the logs for service:checkout-service level:error. You find an error log with a stack trace. Critically, it also contains trace_id: "0af7...".
Trace: You click the trace_id. This pivots you directly into your tracing tool, showing the exact path of that failed request across all microservices, with precise timing for every operation.

This interconnectedness is not a "nice-to-have"; it is the fundamental requirement for debugging distributed systems effectively.

OpenTelemetry and a Unified Schema

For years, observability was a Tower of Babel. Every vendor had its own proprietary agent, its own data format, and its own APIs. If you wanted to switch from Datadog to New Relic, you had to re-instrument your entire codebase. This was vendor lock-in at its worst.

OpenTelemetry (OTel) is the solution.

OTel is a vendor-neutral, open-source standard for generating, collecting, and exporting telemetry data (traces, metrics, and logs). It is a single set of APIs, SDKs, and tools backed by the entire industry, including Google, Microsoft, Amazon, and every major observability vendor. By instrumenting your code with OpenTelemetry, you can send your data to any OTel-compliant backend without changing a single line of application code.

The OpenTelemetry Log Data Model

OTel doesn't just standardize the transport; it standardizes the structure of the data itself. The OTel Log Data Model provides a rich, prescriptive "envelope" for your log records.

A log record in OTel consists of:

Timestamp: When the event occurred.
ObservedTimestamp: When the event was received by the collector.
TraceId & SpanId: The golden links to your distributed traces.
SeverityText & SeverityNumber: e.g., "ERROR" and 17.
Body: The payload of the log. This can be a simple string or a complex, structured object.
Attributes: A key-value map for your event-specific context (e.g., http.method, user.id). This is where you put your structured data.
Resource: A key-value map describing the entity that produced the log (e.g., service.name, k8s.pod.name, host.arch).

Here's what a real OTel log record looks like:

json
12345678910111213141516171819202122
{
  "Timestamp": "2025-09-15T07:10:38.123456789Z",
  "ObservedTimestamp": "2025-09-15T07:10:38.124Z",
  "TraceId": "0x5b8aa5a2d2c872e8321cf37308d69df2",
  "SpanId": "0x142279d6c63b4501",
  "SeverityText": "error",
  "SeverityNumber": 17,
  "Body": { "string": "User authentication failed due to invalid credentials" },
  "Attributes": {
    "http.method": "POST",
    "http.status_code": 401,
    "net.peer.ip": "192.168.1.100",
    "user.id": "u-12345",
    "app.feature_flag.new_login_flow": true
  },
  "Resource": {
    "service.name": "auth-service",
    "service.version": "v1.2.3-9b0695e",
    "deployment.environment": "production",
    "cloud.region": "us-east-1"
  }
}

Notice how much richer this is than a simple JSON blob. It has first-class support for trace context, a clear separation between the event payload (Attributes) and the source of the event (Resource), and uses a standardized naming convention (service.name, http.method) defined by the OpenTelemetry Semantic Conventions.

Standardizing Your Attributes

While OTel provides conventions for common fields, your organization will have its own specific data. The temptation is for every team to invent their own field names: userId, user_ID, user_identifier, subject. This chaos makes cross-team analysis impossible.

This is where a logging schema comes in. A schema is a documented convention for your field names. The most comprehensive and battle-tested open-source option is the Elastic Common Schema (ECS).

Recommendation: Adopt the OpenTelemetry Log Model as your data structure and use the OTel Semantic Conventions as your baseline. For your custom business-level attributes, adopt or adapt a schema like ECS. This gives you a robust, consistent, and future-proof foundation for all your telemetry.

Your Production Logging Checklist

Let's distill this down into an actionable checklist.

✅ In Your Application Code:

Choose a Modern Library: Use a logging library that supports structured JSON/logfmt output, context binding (so you don't have to repeat fields), and ideally has direct OpenTelemetry integration.
Inject Trace Context: This is your highest priority. Use OTel auto-instrumentation or manual instrumentation to ensure every log line is decorated with a trace_id and span_id.
Build a Shared Library: Create a small, internal library or module that standardizes the creation of loggers. It should automatically attach resource attributes (service.name, commit_hash) so that application developers don't have to think about it.
Scrub Sensitive Data: Implement a mechanism to prevent Personally Identifiable Information (PII) from ever being written to a log file. Do this via toString() overrides, custom serializers, or transformers in your logging pipeline.
Support Dynamic Log Levels: Expose an administrative endpoint (e.g., /loglevel) that allows you to change the log level of your application at runtime. This is invaluable for targeted debugging in production without requiring a new deployment.

✅ In Your Infrastructure:

Deploy a Collector: Do not send logs directly from your application to a third-party vendor. Place a collector agent like the OpenTelemetry Collector, Fluentd, or Vector on your nodes or as a sidecar.
- Why? The collector acts as a buffer, enriches logs with infrastructure metadata (like pod names and instance IDs), can filter/route data, and allows you to switch backends with a simple configuration change, preventing vendor lock-in.
Configure Your Backend Intelligently: In your log management platform (Splunk, Datadog, Better Stack, etc.), explicitly configure which fields are indexed as low-cardinality facets and which are not. This is the most important step for controlling costs.
Set Up Deep Links: Configure your logging platform so that any trace_id is a hyperlink that takes you directly to that trace in your tracing platform. This one simple integration will revolutionize your debugging workflow.

Final thoughts

Effective logging is not an afterthought; it is a core competency of a high-performing engineering organization. It's an investment in your future self—the one who will be debugging a critical outage under pressure.

By moving beyond the basics and embracing a holistic strategy that includes managing cardinality, unifying logs with metrics and traces, and standardizing on OpenTelemetry, you can transform your logs from a noisy, expensive data dump into a high-signal, cost-effective, and indispensable observability tool.

Stop fighting with your logs. Make them work for you.

Formatting Logs: A Field Guide for Production Observability