Last updated: September 15, 2025

Formatting Logs: A Field Guide for Production Observability

You probably already knows the logging basics. Use structured logs. Add timestamps. Don’t jam variables into strings. These are all essential practices, but not the whole story.

Even after you’ve checked those boxes, you might still be drowning in JSON from a fleet of microservices. The volume is crushing, the noise overwhelming, and the signal you need is buried deep. You’ve built logging, but you still can’t answer the only question that matters: why did this request fail?

This isn’t a beginner’s checklist. This is for engineers who’ve been dragged out of bed by a pager, who’ve stared into an endless stream of logs, frantically grep-ing for a lifeline.

We’re going to dig into the real challenges of operating at scale: taming high-cardinality data, connecting logs with metrics and traces, and finally finding order in the chaos through emerging standards like OpenTelemetry.

Let's get started!

The Foundation - Structured Logging Done Right

Before we can run, we must walk. The principle of structured logging is non-negotiable. Humans read stories; machines read data. Your logs must be data. An unstructured log is a dead end. A structured log is the beginning of an investigation.

Consider the difference:

Unstructured (The Villain):

INFO: User 12345 successfully updated their profile from IP 192.168.1.100. Request took 52ms.

To get the user_id or duration_ms from this, you need a brittle, slow, and infuriating regex.

Structured (The Hero):

json
1234567891011121314151617
{
"timestamp": "2025-09-15T08:10:38.543Z",
"level": "info",
"message": "User profile updated",
"service": "user-api",
"version": "1.4.2-a8b4c1f",
"source": { "file": "user_controller.go", "line": 142 },
"context": {
"user_id": "12345",
"request_id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"http": {
"method": "PUT",
"client_ip": "192.168.1.100"
},
"duration_ms": 52
}
}

This is queryable. This is analyzable. This is the baseline.

Beyond the JSON Monoculture

The world of structured logging is not flat. While JSON is the lingua franca, it's not the only language. Choosing the right format depends on your context.

JSON: The Ubiquitous Default

It's the format every developer, API, and logging tool understands.

  • Pros: Natively supported everywhere, handles complex nested structures perfectly.
  • Cons: Verbose. The punctuation ({, }, ", :) adds significant overhead to the payload size. Can be difficult for a human to scan quickly on a command line without tools.

logfmt: The Command-Line Hero

Developed by Heroku, logfmt is designed for human readability and simple machine parsing.

level=info msg="User profile updated" service=user-api user_id=12345 duration_ms=52 http_method=PUT
  • Pros: Far more compact and readable than JSON for simple key-value data. Great for development environments or kubectl logs.
  • Cons: Doesn't have a formal spec for handling nested objects or arrays, which can be a limitation for complex contexts.

Protobuf: The High-Performance Champion

When you're processing hundreds of thousands of log messages per second, every CPU cycle and byte on the wire counts. Protocol Buffers (Protobuf) is a binary serialization format from Google.

  • Pros: Extremely fast to serialize/deserialize and produces a very small payload. Strongly typed schema ensures data consistency across services.
  • Cons: It's a binary format. You can't just cat a file and read it. It requires a .proto schema definition and tooling (like the OpenTelemetry Collector) to decode it.

Recommendation: Start with JSON. It's the safest bet. Output logfmt in development for readability if your library supports it. Only reach for Protobuf when you have a clear performance bottleneck in your telemetry pipeline.

Taming High Cardinality

So, you followed the advice. You're logging everything. Your logs are beautifully structured JSON, rich with context like user_id, trace_id, session_id, and order_id. You're a logging champion.

Then the bill arrives. And it's five figures.

You've just been mauled by the beast of high cardinality.

Cardinality refers to the number of unique values a field can have.

  • http_status_code has low cardinality. It can only be 200, 404, 500, etc.
  • user_id has high cardinality. It could have millions or billions of unique values.

Modern log management platforms make their money by indexing your data. They build massive, searchable indexes on the fields you send them so you can quickly filter and aggregate—for example, "show me a graph of all http_status_codes over time." This is fast and cheap for low-cardinality fields.

But when you ask them to index a field with millions of unique values, the index size explodes. This explosion in storage and compute translates directly into a higher bill. High cardinality is the silent killer of observability budgets.

Strategies for Management

You can't just stop logging important identifiers. The solution is to be strategic.

  1. Know Your Fields and Index Selectively: Not all fields are created equal. Your primary goal is to tell your logging backend which fields to build expensive indexes (often called "facets" or "tags") on and which to treat as plain text.

    • Index This (Low Cardinality): level, service_name, http_status_code, environment, error_code, customer_tier. These are perfect for dashboards and high-level aggregation.
    • Don't Index This (High Cardinality): user_id, trace_id, request_id, session_id, raw error messages with UUIDs. These fields are critical for finding specific events, but they don't need to be aggregated. They can be searched via full-text search, which is slower but dramatically cheaper.
  2. Use the Right Tool for the Job: A high-cardinality identifier like a trace_id is a key, not a metric. Its purpose is to unlock a different view. Instead of trying to analyze traces in your logging tool, use the log to find the trace_id of a failed request, then pivot to your dedicated tracing tool (like Jaeger or Honeycomb) to see the entire distributed trace. The log is the breadcrumb that leads you to the treasure map (the trace).

  3. Embrace Sampling: Does your service handle 10,000 requests per second? You probably don't need to store the detailed DEBUG-level logs for every single one.

    • Head-based sampling: Make a decision at the beginning of a request to capture telemetry for, say, 1% of all traffic.
    • Error-based sampling: Capture 100% of requests that result in an error.
    • Dynamic sampling: Build your system so you can change log levels and sampling rates at runtime without a redeploy. This allows you to turn up the verbosity for a specific customer or service when you're actively debugging an issue.

Unifying Logs, Metrics, and Traces

For too long, we've treated logs, metrics, and traces as separate, siloed disciplines. This is a recipe for operational blindness. True observability comes from weaving them together into a single, cohesive narrative.

  • Metrics (The "What"): Aggregated, numerical data. They tell you what is happening at a macro level. "The API error rate is 5%."
  • Traces (The "Where"): A detailed, causal story of a single request as it travels through your distributed system. They tell you where the problem is. "The 5% error rate is coming from the billing-service, which is timing out when calling the credit-card-processor."
  • Logs (The "Why"): A rich, detailed event with arbitrary context. They tell you why the problem happened. "The call to the credit-card-processor timed out because the database connection pool was exhausted. Pool size: 10, active connections: 10."

How do you connect these three pillars? With a single, magical identifier: the Trace ID.

When a request first enters your system (e.g., at your API gateway or load balancer), you generate a unique trace_id. This ID is then passed along in the headers of every subsequent network call that is part of that original request. This is called context propagation.

The modern standard for this is the W3C Trace Context specification, which defines the traceparent HTTP header.

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
            ^  ^                                ^                ^
            |  |                                |                |
      Version  Trace ID                         Parent ID        Flags (e.g., sampled)

Now, every log, every metric, and every span in a trace associated with this request will be tagged with trace_id: 0af7651916cd43dd8448eb211c80319c. This is the key that unlocks true observability.

Your workflow transforms from frantic searching to a seamless investigation:

  1. Alert: A metric-based alert fires: "API error rate > 5%."
  2. Dashboard: You look at a dashboard and see that the errors are all HTTP 500s from the checkout-service.
  3. Logs: You query the logs for service:checkout-service level:error. You find an error log with a stack trace. Critically, it also contains trace_id: "0af7...".
  4. Trace: You click the trace_id. This pivots you directly into your tracing tool, showing the exact path of that failed request across all microservices, with precise timing for every operation.

This interconnectedness is not a "nice-to-have"; it is the fundamental requirement for debugging distributed systems effectively.

OpenTelemetry and a Unified Schema

For years, observability was a Tower of Babel. Every vendor had its own proprietary agent, its own data format, and its own APIs. If you wanted to switch from Datadog to New Relic, you had to re-instrument your entire codebase. This was vendor lock-in at its worst.

OpenTelemetry (OTel) is the solution.

OTel is a vendor-neutral, open-source standard for generating, collecting, and exporting telemetry data (traces, metrics, and logs). It is a single set of APIs, SDKs, and tools backed by the entire industry, including Google, Microsoft, Amazon, and every major observability vendor. By instrumenting your code with OpenTelemetry, you can send your data to any OTel-compliant backend without changing a single line of application code.

The OpenTelemetry Log Data Model

OTel doesn't just standardize the transport; it standardizes the structure of the data itself. The OTel Log Data Model provides a rich, prescriptive "envelope" for your log records.

A log record in OTel consists of:

  • Timestamp: When the event occurred.
  • ObservedTimestamp: When the event was received by the collector.
  • TraceId & SpanId: The golden links to your distributed traces.
  • SeverityText & SeverityNumber: e.g., "ERROR" and 17.
  • Body: The payload of the log. This can be a simple string or a complex, structured object.
  • Attributes: A key-value map for your event-specific context (e.g., http.method, user.id). This is where you put your structured data.
  • Resource: A key-value map describing the entity that produced the log (e.g., service.name, k8s.pod.name, host.arch).

Here's what a real OTel log record looks like:

json
12345678910111213141516171819202122
{
"Timestamp": "2025-09-15T07:10:38.123456789Z",
"ObservedTimestamp": "2025-09-15T07:10:38.124Z",
"TraceId": "0x5b8aa5a2d2c872e8321cf37308d69df2",
"SpanId": "0x142279d6c63b4501",
"SeverityText": "error",
"SeverityNumber": 17,
"Body": { "string": "User authentication failed due to invalid credentials" },
"Attributes": {
"http.method": "POST",
"http.status_code": 401,
"net.peer.ip": "192.168.1.100",
"user.id": "u-12345",
"app.feature_flag.new_login_flow": true
},
"Resource": {
"service.name": "auth-service",
"service.version": "v1.2.3-9b0695e",
"deployment.environment": "production",
"cloud.region": "us-east-1"
}
}

Notice how much richer this is than a simple JSON blob. It has first-class support for trace context, a clear separation between the event payload (Attributes) and the source of the event (Resource), and uses a standardized naming convention (service.name, http.method) defined by the OpenTelemetry Semantic Conventions.

Standardizing Your Attributes

While OTel provides conventions for common fields, your organization will have its own specific data. The temptation is for every team to invent their own field names: userId, user_ID, user_identifier, subject. This chaos makes cross-team analysis impossible.

This is where a logging schema comes in. A schema is a documented convention for your field names. The most comprehensive and battle-tested open-source option is the Elastic Common Schema (ECS).

Recommendation: Adopt the OpenTelemetry Log Model as your data structure and use the OTel Semantic Conventions as your baseline. For your custom business-level attributes, adopt or adapt a schema like ECS. This gives you a robust, consistent, and future-proof foundation for all your telemetry.

Your Production Logging Checklist

Let's distill this down into an actionable checklist.

✅ In Your Application Code:

  • Choose a Modern Library: Use a logging library that supports structured JSON/logfmt output, context binding (so you don't have to repeat fields), and ideally has direct OpenTelemetry integration.
  • Inject Trace Context: This is your highest priority. Use OTel auto-instrumentation or manual instrumentation to ensure every log line is decorated with a trace_id and span_id.
  • Build a Shared Library: Create a small, internal library or module that standardizes the creation of loggers. It should automatically attach resource attributes (service.name, commit_hash) so that application developers don't have to think about it.
  • Scrub Sensitive Data: Implement a mechanism to prevent Personally Identifiable Information (PII) from ever being written to a log file. Do this via toString() overrides, custom serializers, or transformers in your logging pipeline.
  • Support Dynamic Log Levels: Expose an administrative endpoint (e.g., /loglevel) that allows you to change the log level of your application at runtime. This is invaluable for targeted debugging in production without requiring a new deployment.

✅ In Your Infrastructure:

  • Deploy a Collector: Do not send logs directly from your application to a third-party vendor. Place a collector agent like the OpenTelemetry Collector, Fluentd, or Vector on your nodes or as a sidecar.
    • Why? The collector acts as a buffer, enriches logs with infrastructure metadata (like pod names and instance IDs), can filter/route data, and allows you to switch backends with a simple configuration change, preventing vendor lock-in.
  • Configure Your Backend Intelligently: In your log management platform (Splunk, Datadog, Better Stack, etc.), explicitly configure which fields are indexed as low-cardinality facets and which are not. This is the most important step for controlling costs.
  • Set Up Deep Links: Configure your logging platform so that any trace_id is a hyperlink that takes you directly to that trace in your tracing platform. This one simple integration will revolutionize your debugging workflow.

Final thoughts

Effective logging is not an afterthought; it is a core competency of a high-performing engineering organization. It's an investment in your future self—the one who will be debugging a critical outage under pressure.

By moving beyond the basics and embracing a holistic strategy that includes managing cardinality, unifying logs with metrics and traces, and standardizing on OpenTelemetry, you can transform your logs from a noisy, expensive data dump into a high-signal, cost-effective, and indispensable observability tool.

Stop fighting with your logs. Make them work for you.

Authors
Ayooluwa Isaiah
Ayooluwa Isaiah