Every exporter in the OpenTelemetry Collector that sends data over a network relies on the exporter helper framework under the hood. This shared layer handles the parts that would otherwise be copy-pasted across exporters: batching, queuing, retries, and timeouts. Whether you're using the OTLP gRPC exporter, the OTLP HTTP exporter, or something like the Kafka exporter, the same configuration blocks apply.

Understanding the exporter helper as a standalone concept saves you from re-learning the same knobs for each exporter you configure. It also explains behavior that can seem mysterious if you only read the exporter-specific docs, like why data gets dropped during restarts or why failed batches sometimes succeed on a second attempt without any intervention.

How data flows through a Collector exporter

When telemetry reaches an exporter, it doesn't go straight to the network. The exporter helper sits in between and handles buffering and reliability before the actual export call happens.

Incoming data first enters the sending queue, which buffers it in-memory so the pipeline isn't blocked even if the backend is slow or temporarily unavailable.

If batching is enabled, a batcher then assembles queued items into larger payloads to reduce the number of network calls. Finally, a consumer picks up the data and makes the export call.

If the call fails with a retryable error, the built-in retry sender re-attempts it with exponential backoff.

Not every stage is always active. The queue and retries are enabled by default, but batching is off unless you explicitly enable it. The order is fixed though: data enters the queue first, gets batched (if enabled), then is handed to a consumer that makes the export call and retries on failure.

Note that if your pipeline fans out to more than one exporter (or more than one instance of the same exporter), each instance gets its own queue, batcher, and retry sender and they don't share state. If one backend goes down and its queue fills up, the other exporters continue draining and exporting normally.

Configuring the sending queue

The sending_queue is a buffer between your pipeline and the network call to the backend. Without it, a slow or unavailable backend puts backpressure directly on the receivers, which can cause data loss upstream. With the queue enabled, incoming data is buffered and drained by a pool of parallel consumers.

Here are the default settings:

yaml

123456789
exporters:
  otlp_grpc:
    endpoint: my-backend.example.com:4317
    sending_queue:
      enabled: true
      queue_size: 1000
      num_consumers: 10
      sizer: requests
      block_on_overflow: false

The queue is enabled by default, and you'll almost always want it that way as disabling it means your telemetry data gets dropped the moment the exporter can't keep up.

The queue_size controls how much data the buffer can hold, and num_consumers sets how many workers drain it in parallel. Raising the consumer count increases throughput if the backend can keep up, but if it's already saturated, extra consumers just pile on more connections without clearing the queue any faster.

How the queue capacity is measured

The sizer field determines the unit that queue_size is measured in. It defaults to requests which counts the incoming requests to the exporter.

What counts as a "request" depends on what's upstream of the exporter. If you're running the batch processor, each batch it produces enters the sending_queue as a single request.

Without any batching, each call from the receiver becomes one request. For an OTLP receiver, that's typically one request per incoming OTLP payload from a client. Since SDK-level batching is usually enabled, a single flush might contain hundreds of spans or logs, but it arrives at the exporter as a single request.

If you're using exporter-level batching, the queue counts items as they arrive, before the batch block reassembles them into larger payloads.

With the default sizer: requests, a queue_size of 1,000 means 1,000 queue entries regardless of how many signals each one contains. This makes it the cheapest option to compute, but the tradeoff is that you can't predict memory usage from the queue size alone since one entry might hold 10 spans and another 1000.

The two alternatives give you more control. items counts individual signals (spans, data points, log records), so a queue_size of 10,000 means exactly 10,000 spans across all entries. This makes memory usage more predictable, at the cost of tracking per-item counts on every enqueue. bytes counts serialized payload size, which is the most precise measure of memory impact but it's also the most expensive to compute.

For most deployments, the default requests sizer is fine. Switch to items or bytes if your incoming payloads vary widely in size and you need tighter control over memory.

The memory_limiter processor, which sits upstream in most production pipelines, can refuse data before it ever reaches the exporter. If the Collector's memory usage crosses the limiter's threshold, it starts rejecting data at the receiver, and nothing makes it to the queue at all. A queue_size of 100,000 items won't help if the memory limiter is already shedding load, so you'll want to size them together.

What happens when the queue is full

If data cannot be enqueued, it's dropped by default and reported by the otelcol_exporter_enqueue_failed_* metrics. This happens when the queue has reached its configured queue_size or, for persistent queues, when the underlying storage can't accept more data.

If you'd rather block the pipeline than drop incoming data, you can enable block_on_overflow:

yaml

12
sending_queue:
  block_on_overflow: true

With this setting, the pipeline goroutine that feeds data into the queue blocks instead of dropping it. It stays blocked until space opens up in the queue, which means the receiver that called into the pipeline can't accept new data while it waits.

This protects against data loss during short bursts, but if the queue stays full for an extended period, that backpressure propagates upstream through the receivers and eventually to your applications.

Once the SDK's own export queue fills up, your application starts dropping data at the source. At that point you've moved the data loss from the Collector (where you can monitor it via otelcol_exporter_enqueue_failed_*) to the application (where you often can't), which makes the problem harder to detect.

Note that if an exporter with blocking enabled stalls long enough to backpressure the receiver, it stops accepting data for the entire pipeline, which starves all exporters that receive data from it, not just the one that's struggling.

Surviving restarts with persistent queues

An in-memory queue works well for short hiccups, but it disappears the moment the Collector restarts or crashes. To survive restarts, you can back the queue with a persistent storage extension, typically file_storage:

yaml

12345678910111213141516171819
extensions:
  file_storage/exporter_queue:
    directory: /var/lib/otelcol/storage
    timeout: 1s # maximum time to wait for a file lock

exporters:
  otlp_grpc:
    endpoint: my-backend.example.com:4317
    sending_queue:
      enabled: true
      storage: file_storage/exporter_queue
      queue_size: 10000

service:
  extensions: [file_storage/exporter_queue]
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp_grpc]

With this configuration, batches are written to disk instead of held in memory. If the Collector shuts down while data is in the queue, it picks up where it left off on restart. The queue_size parameter still applies, but now it limits how many batches can be stored on disk (measured in whatever unit sizer is set to).

You only need to make sure the disk has enough space for the volume you configured, and keep an eye on I/O latency since every enqueue and dequeue now hits the filesystem.

For details on tuning file_storage behavior, see its documentation on GitHub.

Auth extensions don't survive the queue

Client metadata (like tenant IDs set via include_metadata) and span context are preserved through persistent queues as of Collector v0.131.0. However, context set by auth extensions (like oauth2client, bearertokenauth, or headers_setter) is still not persisted. When the Collector comes back up, it reads the queued data and tries to export it without the original auth tokens, which results in 401s or 403s from the backend.

If you need both persistent queues and authentication, use static credentials in the exporter's headers config instead of an auth extension:

yaml

1234567
exporters:
  otlp_grpc:
    endpoint: my-backend.example.com:4317
    headers:
      Authorization: "Bearer ${env:API_TOKEN}"
    sending_queue:
      storage: file_storage/exporter_queue

Static headers are included on every export call regardless of whether the data came from disk or memory, so they survive the queue. The tradeoff is that you lose the auth extension's token lifecycle management (automatic refresh, rotation), but for long-lived API keys or tokens injected from a secrets manager, this works reliably.

Batching at the exporter level

Sending many small export requests is inefficient so batching is often recommended for queuing items into larger payloads so each network call carries more data with better compression and less overhead.

Historically, most Collector deployments handled this with the standalone batch processor in the pipeline but the long-term direction is to retire this processor entirely in favor of batching within the exporter itself.

The exporter helper now supports this natively through a batch block inside sending_queue, which moves the assembly step closer to the actual export call.

Batching is disabled by default. When you enable it, the following defaults apply:

yaml

12345678910
exporters:
  otlp_grpc:
    endpoint: my-backend.example.com:4317
    sending_queue:
      enabled: true
      sizer: requests
      batch:
        min_size: 8192
        max_size: 0
        flush_timeout: 200ms

The min_size threshold controls when a batch is considered ready to send. If the threshold isn't reached, flush_timeout ensures data doesn't sit in the buffer indefinitely and is flushed out

Setting max_size enables splitting of oversized batches which is helpful when your backend has a strict request size limit. Leave it at 0 for no upper limit.

How the batch sizer relates to the queue sizer

The batch has its own sizer field that determines the unit for min_size and max_size. Unlike the queue, the batch only supports items and bytes (not requests, since request count isn't meaningful for controlling payload size).

If you don't set batch.sizer explicitly and your batch block contains at least one field (like min_size), it inherits the queue's sizer. If the queue sizer is requests or the batch block is empty, the batch defaults to items.

yaml

1234567
sending_queue:
  sizer: bytes
  queue_size: 10000
  batch:
    # Inherits bytes from the parent because min_size is set
    min_size: 1048576
    max_size: 5242880

Be aware that the inheritance logic has known edge cases where the batch sizer doesn't inherit as expected, so the safest approach is to always set batch.sizer explicitly so there's no ambiguity:

yaml

1234567
sending_queue:
  sizer: requests
  queue_size: 1000
  batch:
    sizer: items
    min_size: 1000
    max_size: 2000

There's also ongoing discussion about supporting compound sizers (such as triggering a flush at 8,192 items or 5 MiB, whichever comes first), which would replace the current single-sizer model. Check the exporter helper documentation for the latest configuration format.

Partitioning batches by metadata

If you're using the Collector's include_metadata feature on your receivers, different requests may carry different client metadata (like tenant IDs or auth tokens). By default, the batcher mixes all incoming data together regardless of metadata.

You can partition batches by specific metadata keys so that data from different clients never gets combined:

yaml

1234
sending_queue:
  batch:
    partition:
      metadata_keys: [tenant_id]

This creates a separate batcher instance per unique value of tenant_id. Each partition adds overhead, so only partition on keys with a bounded set of values.

Migrating from the batch processor

If you're running the batch processor today, the migration is straightforward. The main thing to remember is that you should always set batch.sizer explicitly rather than relying on inheritance from the queue.

Here's a typical batch processor config and its exporter-level equivalent:

Before (batch processor in the pipeline):

yaml

123456789101112
processors:
  batch:
    send_batch_size: 1000
    send_batch_max_size: 5000
    timeout: 200ms

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp_grpc]

After (batching inside the exporter):

yaml

12345678910111213141516
exporters:
  otlp_grpc:
    endpoint: my-backend.example.com:4317
    sending_queue:
      enabled: true
      batch:
        sizer: items
        min_size: 1000
        max_size: 5000
        flush_timeout: 200ms

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp_grpc]

The mapping is:

send_batch_size becomes min_size,
send_batch_max_size becomes max_size, and
timeout becomes flush_timeout.

How the queue and batcher work together

The queue accepts whatever the pipeline hands it and holds each unit as a single queue entry. One entry is one call to the exporter's consume function, which typically maps to one incoming OTLP export from an SDK or one batch from the batch processor (if configured). An entry is opaque to the queue: it could contain 5 spans or 5,000, and the queue doesn't unpack it.

This is also what the sizer setting requests actually counts: queue entries, one per pipeline call. The name is misleading because "request" sounds like an outbound network call, but it refers to the inbound unit of data.

Without exporter-level batching, a consumer grabs one entry from the queue and exports it directly regardless of how big or small the payload is. With num_consumers: 10, up to 10 entries are drained from the sending queue in parallel.

But when batching is enabled, the batcher sits between the queue and the consumers. Instead of consumers pulling entries directly, the batcher drains them and assembles them into larger payloads. Once a payload hits min_size (or flush_timeout fires), the batcher hands it to a consumer as a single export call.

To see how this plays out, say ten OTLP calls arrive from SDKs, each containing 200 spans. The queue now holds 10 entries (10 requests in sizer terms, or 2,000 items).

Without batching, each of the 10 consumers grabs one "request" and fires off an export call containing 200 spans in parallel. The queue drains quickly, but the backend handles 10 separate requests where one or two larger ones would have been more efficient.

With batching (assuming sizer: items, min_size: 1000), the batcher pulls entries from the queue and assembles them. After 5 entries (5 × 200 = 1,000 spans), it hits min_size and hands a single 1,000-span payload to a consumer for export. The remaining five entries go into the next batch so the backend now handles 2 requests instead of 10, each carrying more data and compressing better.

If only 3 requests arrive, the batcher has 600 items so it waits for flush_timeout to elapse before sending the 600-span payload as-is rather than holding it indefinitely. In practice, this timeout-triggered flush is more common than hitting min_size, especially during off-peak hours or for lower-volume signal types like traces.

This behavior is also why the queue and batch can use different sizer units. The queue measures what's sitting in the buffer, while the batch measures what gets packaged into a single export call.

Controlling the per-export timeout

Each individual export attempt has a timeout, configured via the timeout field:

yaml

1234
exporters:
  otlp_grpc:
    endpoint: my-backend.example.com:4317
    timeout: 5s # the default

If an export call takes longer than this, it's cancelled and treated as a failure, which then feeds into the retry logic covered in the next section. If you're seeing frequent DeadlineExceeded errors, this is the first thing to check, especially if your payloads are large or the network path to the backend has high latency.

Retrying failed exports

The retry mechanism is the exporter helper's first line of defense when an export request fails. It's enabled by default and handles transient errors like brief network failures or a rolling restart of the backend.

Not every error triggers a retry. For gRPC-based exporters, only status codes that indicate a temporary, recoverable problem are retried:

UNAVAILABLE: the service is temporarily unreachable (the most common case).
RESOURCE_EXHAUSTED: a downstream service is overloaded.
ABORTED: the operation was aborted, often due to a concurrency issue.

For HTTP-based exporters, the equivalent signals are 503, 429, and similar status codes.

Tuning retry behavior

Retries are controlled by the retry_on_failure block. The defaults are shown below:

yaml

123456789
exporters:
  otlp_grpc:
    endpoint: my-backend.example.com:4317
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      multiplier: 1.5
      max_interval: 30s
      max_elapsed_time: 300s

initial_interval sets the wait before the first retry, and each subsequent attempt multiplies the previous wait by multiplier, which creates the exponential backoff curve.

The max_interval caps how long any single wait can grow, and max_elapsed_time caps the total retry window for one batch. After that window expires, the batch is dropped permanently. Setting max_elapsed_time to 0 means retries continue indefinitely.

With the default config above, the timing plays out like this:

Note that these intervals are idealized. In practice, the Collector adds random jitter to each backoff to prevent multiple Collectors from retrying in lockstep and overwhelming the backend the moment it recovers (a thundering herd problem). The actual wait times will vary around these values, not match them exactly.

The relationship between retries and the queue

Understanding how retries and the queue interact is important for capacity planning. When a consumer picks up a batch and the export fails, that consumer is busy retrying and unavailable to process other items. Meanwhile, new data keeps entering the queue.

If your backend goes down for several minutes and you only have 10 consumers, all 10 can end up stuck in retry loops. At that point no one is draining the queue, and it fills up. Once it's full, new data gets dropped (or blocked, if block_on_overflow is on).

Increasing num_consumers helps here, but only up to a point. If the backend is completely unreachable, more consumers just means more concurrent connections timing out.

A better approach is usually a combination of a larger queue_size to absorb the backlog, a reasonable max_elapsed_time so consumers don't stay stuck forever, and persistent storage so the data can survive a full restart of the Collector process.

Preventing data loss when retries are exhausted

Data that fails to export is not returned to the queue. The consumer holds onto it for the entire retry window, and if max_elapsed_time is reached, the data is dropped permanently.

The Collector logs this as "Exporting failed. Dropping data." with a specific no more retries left" error message along with a dropped_items count:

text

1
2026-05-26T12:10:15.899Z        error   internal/queue_sender.go:50     Exporting failed. Dropping data.        {"resource": {"service.instance.id": "85b28da9-9258-4c3e-88fa-b937a384216d", "service.name": "otelcol-contrib", "service.version": "0.153.0"}, "otelcol.component.id": "otlp_grpc/blackhole", "otelcol.component.kind": "exporter", "otelcol.signal": "traces", "error": "no more retries left: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:55555: connect: connection refused\"", "dropped_items": 2}

For non-retryable errors (like gRPC INVALID_ARGUMENT or HTTP 400), the data is dropped immediately without any retry attempt, logged as "Exporting failed. Dropping data." with "not retryable error".

This means a backend outage can cause data loss in two places:

Consumers stuck in retry loops eventually give up and drop the data they're holding.
While those consumers are busy retrying, they aren't draining the queue. If the queue fills up before the backend recovers, incoming data gets dropped at the queue entrance (counted by otelcol_exporter_enqueue_failed_*).

Increasing num_consumers helps absorb short outages by giving more workers a chance to retry independently, but if the backend is completely unreachable, more consumers just means more connections timing out in parallel.

The real mitigation is a combination of a persistent queue (so data survives restarts), a queue_size large enough to absorb the expected backlog, and a max_elapsed_time that matches the acceptable downtime for the endpoint.

With a persistent queue configured, setting max_elapsed_time: 0 (retry indefinitely) is worth considering. The data is already on disk and will survive restarts, so there's no reason to throw it away after an arbitrary timeout. The consumer stays busy, but the data eventually gets through once the endpoint recovers.

yaml

12345678910
exporters:
  otlp_grpc:
    endpoint: my-backend.example.com:4317
    retry_on_failure:
      enabled: true
      max_elapsed_time: 0
    sending_queue:
      enabled: true
      storage: file_storage/exporter_queue
      queue_size: 10000

The obvious effect during an extended outage is that nothing leaves the system while the backend is down, so the queue fills up and new incoming data gets dropped at the entrance. A larger queue_size and adequate disk space help, but the Collector's built-in queue has limits.

For pipelines where even that level of data loss is unacceptable, you can place a message queue like Kafka between the Collector and the backend so that data accumulates in Kafka's own durable storage while the backend is unavailable.

Monitoring exporter health

The Collector exposes internal metrics that let you see how the exporter helper is performing in real time. A few metrics are worth watching:

otelcol_exporter_queue_size tells you how much data is sitting in the sending queue right now, measured in whatever unit sizer is set to. If the sizer is requests (the default), this counts queue entries. If it's items, it counts individual spans, data points, or log records. If it's bytes, it's the serialized payload size. Compare it to otelcol_exporter_queue_capacity (which uses the same unit) to see how full the buffer is. If the current occupancy keeps climbing toward capacity, it could mean the pipeline is under backpressure and you're at risk of dropping data.
otelcol_exporter_sent_<spans|metric_points|log_records> shows how much telemetry has been delivered successfully. In a healthy pipeline this counter would steadily increase.
otelcol_exporter_in_flight_requests counts how many export calls are currently active, including consumers stuck in retry backoff. If this number matches num_consumers, every consumer is busy and nobody is pulling new data from the queue. This could be an early warning that the backend is struggling before the queue itself starts filling up.
otelcol_exporter_send_failed_<spans|metric_points|log_records> counts failed export attempts. A rising count doesn't necessarily mean data loss (retries may still succeed), but it does signal a problem with the network or backend.
otelcol_exporter_enqueue_failed_<spans|metric_points|log_records> counts items that never made it into the queue at all. If this one starts climbing, the queue is either full or misconfigured, and you're losing data.

Watching these signals gives you an early warning system for exporter health. They tell you whether data is flowing, stuck in the queue, or being dropped outright.

Final thoughts

The exporter helper's defaults work fine for getting started, but they're not optimized for production resilience at scale. An in-memory queue with a 5-minute retry cap and no batching will lose data the first time your backend has a rough morning. Switching to a persistent queue, setting max_elapsed_time: 0, and enabling exporter-level batching are small config changes that make a real difference in how much telemetry survives an outage.

If you're sending data to an OpenTelemetry-native backend like Dash0, getting this layer right means the telemetry data you've carefully instrumented upstream don't get silently lost before they reach the place where they're actually useful.

Batching, Queuing, and Retries in the OpenTelemetry Collector