Last updated: August 4, 2025

Mastering the OpenTelemetry Batch Processor

The batch processor is one of the most fundamental and critical components in the OpenTelemetry Collector. It sits in nearly every production pipeline, working quietly to make your telemetry processing more efficient, reliable, and cost-effective. Its job is simple: it collects individual spans, metrics, and logs and groups them into batches before passing them to the next component.

While the concept is straightforward, mastering its configuration is key to building a high-performance observability pipeline. In this article, we'll explore the nuances of its settings, common patterns, and best practices that will help you tune your Collector for any workload.

Let's get started!

Why you absolutely need the batch processor

Before diving into the configuration, it's crucial to understand why batching is so important, and why skipping it is almost always a mistake that leads to inefficient and expensive pipelines.

Here's what batching does for you:

  • Reduces network overhead: Sending thousands of individual log lines or trace spans over the network is incredibly inefficient. Batching combines many data points into a single request, drastically reducing the number of outgoing connections and the associated overhead.

  • Improves compression: Compressing a single 200-byte JSON payload doesn't yield much benefits. Compressing a 2MB batch of 10,000 similar JSON payloads, however, can result in significant size reduction. Batching gives compression algorithms more data to work with, leading to better compression ratios.

  • Lowers egress costs: For cloud-based workloads, network egress can be a major cost driver. By reducing the number of requests and improving compression, batching directly translates to lower cloud bills.

  • Reduces load on backends: Sending a high volume of small requests can overwhelm your observability backend or any intermediate Collectors. Batching smooths out traffic, sending larger, less frequent requests that are often easier for backends to ingest efficiently.

In short, the batch processor is your primary tool for controlling the flow, cost, and performance of your telemetry data.

Quick start: adding the batch processor

For most use cases, the default settings are a great starting point. To enable the batch processor, simply add it to the processors section of your configuration and include it in your pipelines.

yaml
12345678910111213141516171819
processors:
batch:
# Default settings: sends a batch when it reaches 8192 items
# or after 200ms, whichever comes first.
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp]
logs:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp]

This simple configuration already provides significant efficiency gains over a pipeline without batching.

Understanding the available batching mechanisms

You can think of the batch processor as a ferry which has two rules for when it departs:

  1. It leaves when it's full.
  2. It leaves after waiting a certain amount of time at the dock, even if it's not full.

The batch processor operates on the exact same logic, controlled by three key parameters: timeout, send_batch_size, and send_batch_max_size.

1. timeout

This setting controls the maximum latency introduced by the batch processor. If you set timeout: 10s, your telemetry could be delayed by up to 10 seconds before being sent to the next stage.

A shorter timeout is good for near-real-time data, while a longer timeout is better for maximizing batch size and reducing costs, especially for less time-sensitive data like logs.

The default setting is 200ms.

2. send_batch_size

This setting controls the target size of a batch. It is the number of items (spans, log records, or metric data points) that will trigger the batch to be sent, regardless of the timeout.

In a high-traffic environment, you'll likely hit this limit long before the timeout is reached. Increasing this value allows for larger, more efficient batches, but also increases the Collector's memory footprint, as it needs to hold more data in memory.

The default is 8192 items.

3. send_batch_max_size

This is a critical but often misunderstood setting. The send_batch_size is a trigger, but it doesn't prevent a batch from growing larger. For example, if your timeout is long and you receive a sudden flood of data, the batch in memory could grow well beyond send_batch_size.

send_batch_max_size is the safety valve. It ensures that no matter what, the batch sent to the next component is no larger than this value. This is essential when your backend has a strict request size limit (e.g., "no requests over 4MB").

How they interact:

  • A batch is sent when either timeout is reached or the number of items in the buffer reaches send_batch_size.
  • If send_batch_max_size is set, it acts as a final check. Before sending the batch, the processor asks: "Is this batch bigger than send_batch_max_size?" If yes, it splits the batch into smaller chunks that respect the limit.

Note that send_batch_max_size must be greater or equal to send_batch_size to be considered valid. It is set to 0 by default (disabled), so it has no effect.

Configuration recipes

Let's look at how to tune these settings for different goals.

Minimizing latency for real-time debugging

When you need data to appear in your backend as quickly as possible, you want small, frequent batches. The default timeout value of 200ms is already a good starting point, so you probably don't need to change anything.

yaml
12
processors:
batch:

Maximizing cost savings

When you're sending high-volume data to cold storage, latency is less important than cost efficiency, but keep the additional memory requirements in mind.

yaml
1234
processors:
batch/cost_optimized:
timeout: 60s
send_batch_size: 16384

Handling strict backend limits

Many observability backends and APIs impose strict limits on request body size. For example, a backend might reject any request larger than 1MB. If your processor sends batches that exceed this limit, your exporter will receive errors, and you'll risk losing data.

The send_batch_max_size setting is the solution. It acts as a hard ceiling on the size of any batch sent to the next component in the pipeline.

Since the setting is based on the number of items (spans, logs, etc.), you'll need to estimate a safe value based on your average data point size. If your average log record is about 1KB, a send_batch_max_size of 1000 would keep your batches safely under a 1MB limit.

yaml
12345678
processors:
batch/strict_backend:
timeout: 5s
# Trigger a send when the batch reaches 1000 items.
send_batch_size: 1000
# Enforce a hard limit of 1000 items per batch. This prevents the
# timeout from creating a massive batch that would be rejected.
send_batch_max_size: 1000

Setting send_batch_size and send_batch_max_size to the same value, as shown above, is a common and effective pattern for ensuring consistent batch sizes that respect your backend's limits.

Multi-tenant batching with metadata_keys

In a multi-tenant architecture, you might receive data from different customers (or tenants) at a single OTLP endpoint. You may need to process or route this data differently based on which tenant it belongs to. For example, each tenant might have a unique API key or a different backend destination.

The batch processor can handle this by creating separate, independent batchers for each unique combination of metadata values.

Let's say you're running a SaaS platform where customers send telemetry to your Collector, and each request includes an X-Tenant-ID HTTP header. You may need to batch data on a per-tenant basis.

The configuration could look like this:

  1. First, configure your receivers to include the request metadata with (include_metadata).
  2. Then, configure the batch processor to use the relevant metadata as a grouping key.
yaml
123456789101112131415161718192021222324252627
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
# This is ESSENTIAL for metadata batching.
include_metadata: true
processors:
batch/multitenant:
# Create a new, independent batcher for each unique value of 'tenant_id'.
metadata_keys:
- tenant_id
# Set a safety limit on the number of tenants.
metadata_cardinality_limit: 2000
exporters:
# ... your exporters
service:
pipelines:
traces:
receivers: [otlp]
# The processor needs access to the gRPC/HTTP metadata, so it must be
# configured to handle it.
processors: [batch/multitenant]
exporters: [otlphttp] # or routing processor

This is a powerful feature, but it comes with a significant resource cost. Each unique metadata combination spawns a new batcher in memory, each with its own buffer (send_batch_size) and timer (timeout).

If you have 1000 tenants, you will have 1000 batch processors running inside the Collector, consuming 1000 times the memory of a single batcher.

The metadata_cardinality_limit is a crucial safeguard. It prevents a malicious or runaway client from exhausting the Collector's memory by sending thousands of unique metadata values.

Always monitor the otelcol_processor_batch_metadata_cardinality metric to track how many batchers are active.

Batch processor tips and best practices

  • The batch processor should almost always be placed after any sampling or filtering processors (filter, probabilistic_sampler) but before any processors that perform external lookups or heavy computation (k8sattributes, resource).

    You certainly don't want to waste time batching data you're about to drop, and you also want to provide well-formed batches to processors that make network calls to enrich data, as they often operate more efficiently on batches.

  • If users complain that their data is taking too long to appear, the first place to check is the timeout setting in your batch processor. A long timeout is a common cause of perceived data loss, when in fact the data is just waiting in the Collector's buffer.

  • If your exporter logs show 4xx errors indicating that requests are too large, the solution is almost always to configure send_batch_max_size.

  • It is almost never a good idea to have a network-based exporter (otlphttp, jaeger, prometheusremotewrite) without a batch processor immediately before it in the pipeline. The efficiency gains are too significant to ignore.

Final thoughts

The batch processor is a foundational component for building performant, scalable, and cost-effective OpenTelemetry pipelines. By understanding the interplay between its time and size-based triggers, you can fine-tune its behavior to meet the specific needs of your workloads.

Treat it not as an optional add-on, but as a mandatory first step in any production-grade Collector configuration.

Authors
Ayooluwa Isaiah
Ayooluwa Isaiah