Dash0 Raises $35 Million Series A to Build the First AI-Native Observability Platform

  • 7 min read

Why the OpenTelemetry Batch Processor is Going Away (Eventually)

Why the OpenTelemetry Batch Processor is Going Away (Eventually)

The batch processor has historically been a very commonly-adopted component in OpenTelemetry Collector pipelines and used widely in production deployments. Over time, however, operational experience has highlighted limitations in how the batch processor behaves during Collector restarts and other commonplace failure scenarios.

As a result, the OpenTelemetry community is no longer recommending the batch processor for production use. Instead, current guidance favors batching functionality in exporters backed by persistent storage, which offers different durability characteristics under failure.

Data loss during collector restarts

Consider a common production scenario:

An application sends traces to an OpenTelemetry Collector. The Collector appears healthy, dashboards show incoming data, and no alerts fire. During a deployment, a Kubernetes node rotation, or a period of memory pressure, the Collector restarts or is terminated.

Afterward, gaps appear in the observability backend. Traces from the period when the Collector was unavailable are missing. The application received success responses and assumed the data was delivered, but the data never reached its destination.

This reflects a fundamental limitation of the batch processor architecture. Telemetry buffered in memory can be lost during crashes without generating errors or alerts, and the loss is often only discovered after the fact.

Understanding the two approaches

Batch processor: the old way

In a traditional Collector pipeline, batching is handled by the dedicated batch processor:

Batch processor: the old way

One thing I like about this angle is that it meets people where they actually struggle, which is when telemetry volume starts hurting reliability or cost.

In this configuration, telemetry flows through multiple in-memory queues. Once data is received by the Collector and accepted into the pipeline, the receiver returns a successful response to the client, even though the data has not yet been exported. The batch processor then buffers the telemetry in memory until a batch is flushed based on the configured batch size and timeout. To draw a parallel with HTTP status codes, it would be a 202 Accepted, which means that “a request has been accepted for processing, but processing has not been completed or may not have started” (from MDN).

At this point, the data exists only in the memory of the Collector: it has not been exported to a downstream, (hopefully) durable system, and if the Collector processor were to die at this time, the telemetry may be lost forever, as the OpenTelemetry SDK that generated the telemetry would generally delete it from memory. (In case the source of the telemetry is durable, like for example a log file read by the filelogreceiver, then the telemetry could be recovered by the Collector restarting, but the combinations are so many, including the various options to retry sending data on exporting failures, that we cannot begin to cover them all.) In practice, this corresponds to an ‘at-most-once’ delivery model, where acknowledged data may not be delivered in the presence of failures.

In addition, placing batching in a separate processor limits how downstream exporter state is propagated through the pipeline. Backpressure is applied primarily at component boundaries, and context about exporter saturation or transient failures is not reflected upstream before acknowledgments are returned. Together, these characteristics are the primary reason the batch processor is no longer recommended for production deployments where durability across restarts and coordinated backpressure are important.

Exporter helper: the new way

Exporter-level batching integrates batching and queueing directly into the exporter and supports persistent storage:

Exporter helper: the new way

With this approach, telemetry is durably enqueued in the exporter’s persistent sending queue before an acknowledgment is sent back to the sender; therefore, it cannot be lost due to process termination. Because batching and queueing are consolidated within the exporter, data is written to disk early in the pipeline and can be recovered and sent after a Collector restart. This provides stronger delivery guarantees than the in-memory batch processor and moves the pipeline closer to an at-least-once delivery model.

This design simplifies the Collector’s internal architecture by eliminating the in-memory batch processor as an independent buffering stage. Instead of telemetry passing through multiple unsynchronized buffers, data follows a direct path to a durable exporter queue. This consolidation also enables more coordinated backpressure: if persistent storage fills up or the backend becomes saturated, that state can propagate back through the pipeline, preventing the Collector from accepting more data than it can safely handle. The tradeoff shifts from tuning processor behavior to managing disk capacity and I/O performance.

Differences under failure

To better understand the practical impact of these approaches, we built a crash-testing demo that simulates Collector restarts. The test sends a fixed number of traces, terminates the Collector, and measures how much data is available after restart.

ConfigurationData SentCollector RestartedData RecoveredData Loss
**Batch Processor**100 tracesYes0 traces**100%**
**Exporter Helper**100 tracesYes100 traces**0%**

Under these conditions, exporter-level batching preserved queued data across restarts, while batch processor buffering did not. This behavior is reproducible using default configurations and reflects differences in buffering and durability rather than implementation defects.

Performance considerations

Exporter-level batching influences performance characteristics by consolidating batching logic within the exporter. In some deployments, this reduces CPU overhead and memory pressure by avoiding intermediate buffers and queue management. Latency behavior may also differ, as batching decisions are coordinated directly with export operations. Actual performance outcomes depend on configuration, workload, and environment.

Final thoughts

The batch processor and exporter-level batching represent different tradeoffs in how telemetry is buffered and delivered. Based on operational experience, the OpenTelemetry community no longer recommends the batch processor for production use cases where durability across restarts is required.

Exporter-level batching is not a universal solution, but it reflects the current direction of the ecosystem and offers behavior that may better align with the reliability expectations of many production environments. Teams evaluating their Collector configurations may want to consider these differences in the context of their own operational requirements.