Dash0 Raises $35 Million Series A to Build the First AI-Native Observability Platform

Last updated: October 3, 2025

Tuning the OTLP gRPC Exporter for Resilient OpenTelemetry Pipelines

The OpenTelemetry Collector is the engine of modern observability pipelines, and the OTLP gRPC exporter is the driveshaft that keeps it moving. It is the standard and most efficient way to send telemetry from one Collector to another or out to a compatible backend. With its mix of performance, security, and reliability, it forms the backbone of most serious OpenTelemetry deployments.

This guide walks through the exporter's configuration in depth, starting with the basics before moving into advanced tuning for security, resilience, and high-throughput scenarios.

By the end, you'll be able to confidently configure the OTLP exporter for any environment, whether it's a lightweight agent on a single host or a complex, multi-stage, load-balanced pipeline.

Let's begin!

Quick start: sending traces to Jaeger

The OTLP exporter has two fundamental requirements: knowing where to send data (endpoint) and how to secure the connection (tls).

To see it in action, let's create a complete pipeline using Docker Compose. This setup will:

  1. Use telemetrygen to create and send trace data.
  2. Use the OpenTelemetry Collector to receive those traces through the OTLP receiver.
  3. Forward the traces from the Collector to a Jaeger instance using the otlp exporter.

Here's the Collector configuration you'll need:

yaml
1234567891011121314151617
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
exporters:
otlp:
endpoint: jaeger:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp]

Next, create a docker-compose.yml file in the same directory that defines the three services and links them.

In this setup, the instrumented application sends traces to the Collector on port 4317, which then forwards these traces to Jaeger. We set tls.insecure: true for testing simplicity in a local development environment but it should always be enabled in production.

yaml
1234567891011121314151617181920212223
services:
otelcol:
image: otel/opentelemetry-collector-contrib:0.136.0
container_name: otelcol
volumes:
- ./otelcol.yaml:/etc/otelcol-contrib/config.yaml
restart: unless-stopped
depends_on:
- jaeger
jaeger:
image: jaegertracing/jaeger:2.10.0
container_name: jaeger
ports:
- 16686:16686
telemetrygen:
image: ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:v0.136.0
container_name: telemetrygen
restart: unless-stopped
command: ["traces", "--rate", "10", "--duration", "1h", "--otlp-endpoint", "otelcol:4317", "--otlp-insecure"]
depends_on:
- otelcol

With both files in place, start the services with:

bash
1
docker compose up -d

Once all three services are up and running, you can verify the pipeline by navigating to http://localhost:16686 in your browser. Select the telemetrygen service from the search panel to see the incoming traces.

Incoming telemetrygen traces in Jaeger

Setting up the OTLP Exporter

The OTLP exporter comes with quite a few knobs you can turn, but in most setups you'll only need to worry about a handful of core settings.

endpoint

This is where you point the exporter. It is simply the host:port of the gRPC server you want to send data to. DNS names work as well, which is helpful if you are using load balancing:

yaml
123
exporters:
otlp:
endpoint: my-backend.example.com:4317

See here for the full list of valid values.

headers

If you need to attach extra metadata with your requests, such as an API key for authentication, you can set custom gRPC headers. These headers are included with every outgoing request:

yaml
1234567
exporters:
otlp:
endpoint: ingress.eu-west-1.aws.dash0.com:4317
headers:
# The header key is case-insensitive
Authorization: "Bearer <your-secret-api-key>"
Dash0-Dataset: "<dash0-demo>"

compression

Sending telemetry over the network can use a lot of bandwidth, so the exporter supports compressing data before it is sent. You can choose from several options depending on your needs:

  • gzip (default): A great all-around choice that offers good compression ratios with reasonable CPU usage. Start with this unless you have a specific reason not to.
  • snappy: A faster compression algorithm that uses less CPU than gzip but results in larger data sizes. Consider this if your Collector is CPU-bound and you have ample network bandwidth.
  • zstd: Often provides better compression ratios than gzip at similar or even faster speeds. It's an excellent choice if its supported at the endpoint.
  • none: No compression. Works fine on fast, low-cost networks (for example within the same VPC) where you want to minimize CPU usage.
yaml
1234
exporters:
otlp:
endpoint: my-backend.example.com:4317
compression: zstd

Securing the connection with TLS

Although telemetry data should not include sensitive information, it still provides insights into how your systems behave. The tls block lets you configure exactly how the connection should be secured to prevent exposure or tampering.

In the most common setup, you will be sending data to a secure public endpoint that uses a valid certificate from a trusted Certificate Authority (CA). In this case, the default settings are often sufficient:

yaml
1234
exporters:
otlp:
endpoint: secure-endpoint.com:4317
# tls is implicitly enabled by default

If you need stronger security, such as when one Collector is sending data to another, you can enable mutual TLS (mTLS). With mTLS, both the client (the exporter) and the server (the receiver) check and verify each other's certificates.

To set this up, you need to provide a client certificate and private key, as well as the CA certificate that signed the server's certificate:

yaml
123456789
exporters:
otlp:
endpoint: internal-gateway.my-corp:4317
tls:
# CA certificate to verify the server's identity
ca_file: /etc/ssl/certs/ca.pem
# Client certificate for the server to verify our identity
cert_file: /etc/ssl/certs/client.pem
key_file: /etc/ssl/private/client.key

On the receiver side, you must configure it to trust the CA that signed the client certificate.

For other TLS configuration settings, be sure to read this document.

The critical role of the batch processor

Before getting into queuing and retries, it's important to clarify what the OTLP exporter actually sends. By default, the exporter forwards whatever comes in: that could be individual spans, metrics, or logs.

In practice, almost every production deployment enables the batch processor, which groups signals together before they reach the exporter. With batching enabled, the exporter works with larger, pre-packaged groups of data instead of single items to improve network efficiency and compression.

yaml
123456
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]

The key takeaway is that the exporter does not operate on individual signals unless you skip batching. In most cases, it holds batches created by the processor. The size and timing of those batches are major factors in exporter performance, memory use, and network efficiency.

Batch processor tuning is a deep topic, so we cover it separately. For the rest of this article, just remember that the exporter is usually working with batches, and the queuing and retry logic applies to those batches.

Building resilient pipelines with queuing and retries

What happens if the backend is down or the network is flaky? Without the right configuration, you will lose valuable data.

To protect against this, the OTLP exporter uses the exporterhelper framework, which adds queuing and retry logic so your telemetry has a much better chance of making it through.

You can think of it as a two-stage defense system:

  1. Retry mechanism: If a batch fails to send, the exporter will automatically retry with an exponential backoff delay.
  2. Sending queue: If new data arrives while the exporter is busy retrying, it gets placed in a queue instead of being dropped.

Let's take a closer look at the specific settings you can use to control how retries behave and ensure your data has the best chance of getting through.

retry_on_failure

The retry mechanism is your first safeguard when something goes wrong, and it is turned on by default. It's designed to handle transient errors like a brief network failure, a rolling restart of the backend, or a momentary spike in load.

Note that the exporter doesn't retry every error. It will only retry on gRPC status codes that indicate a temporary, recoverable problem. These include:

  • UNAVAILABLE: The service is temporarily unavailable (most common).
  • RESOURCE_EXHAUSTED: A downstream service is temporarily overloaded.
  • ABORTED: The operation was aborted, often due to a concurrency issue.

You can fine-tune how this works by adjusting a few key settings:

  • enabled: Controls whether retries are active. It is set to true by default, and you'll almost always want to keep it that way.
  • initial_interval: How long the exporter waits before making the very first retry. The default is 5s, but you can shorten it if you want a faster recovery at the risk of overloading a struggling backend with requests.
  • multiplier: Each retry interval is multiplied by this factor (1.5 by default), which is what creates the exponential backoff pattern.
  • max_interval: This is the longest delay allowed between retries. By default it's set to 30s to prevent the backoff from growing too large.
  • max_elapsed_time: The maximum total time the exporter will spend retrying a single batch. The default is 300s (5 minutes), after which the batch is dropped. Setting it to 0 means retries will continue indefinitely.

With these settings in place, it helps to picture how retries actually play out. Suppose you configure:

yaml
123456789
exporters:
otlp:
endpoint: my-backend.example.com:4317
retry_on_failure:
enabled: true
initial_interval: 1s # Start retrying faster
multiplier: 2.0 # Double the wait time on each attempt
max_interval: 10s
max_elapsed_time: 300s # Give up after 5 minutes

The timing would look like this:

  • 1s after the first failure
  • 2s after the second failure
  • 4s after the third
  • 8s after the fourth
  • 10s after the fifth (capped by max_interval)
  • further attempts also wait 10s, until the 300s total is reached.

sending_queue

The sending queue acts as a shock absorber between receivers and exporters. Instead of dropping data when the exporter is busy retrying or the backend slows down, the queue temporarily holds batches in memory. This makes pipelines much more resilient during spikes or outages.

This queue supports many options, but here are some of the key ones you can adjust:

  • enabled: Almost always left on. Disabling it means data will be dropped as soon as the exporter is back-pressured.
  • queue_size: Controls how many batches can sit in the queue. A larger buffer is useful if you expect traffic spikes or short backend slowdowns, but it also means higher memory usage.
  • num_consumers: Defines how many workers are draining the queue in parallel (default: 10). Raising this increases throughput, but if the backend is already saturated it may just increase load without benefit.

Imagine a sudden burst of telemetry that produces 3,000 batches in a short window. With the default queue_size of 1,000, two-thirds of that data would be dropped immediately. Bumping the queue to 5,000 lets you ride out the spike without data loss, as long as the backlog drains before the next surge.

yaml
1234567
exporters:
otlp:
endpoint: my-backend.example.com:4317
sending_queue:
enabled: true
queue_size: 5000 # Allow a larger buffer for spiky traffic
num_consumers: 20 # Increase parallelism for higher throughput

Surviving restarts with a persistent queue

An in-memory queue works well for short hiccups, but it disappears the moment the Collector restarts or crashes. That means any data still in the buffer is lost. To ensure a more reliable data delivery, you must switch to a persistent queue that writes data to disk.

This requires configuring a storage extension, typically file_storage, which persists state to the filesystem so it can be recovered later:

yaml
123456789101112131415161718192021222324
# 1. Define a storage extension
extensions:
file_storage:
# default settings
directory: /var/lib/otelcol/storage
timeout: 1s
exporters:
otlp:
endpoint: my-backend.example.com:4317
sending_queue:
enabled: true
# 2. Tell the queue to use the storage extension
storage: file_storage
# The queue_size now refers to the number of batches on disk
queue_size: 10000
service:
# 3. Enable the extension in the service
extensions: [file_storage]
pipelines:
traces:
receivers: [otlp]
exporters: [otlp]

With this configuration, if the Collector is shut down while data is in the persistent queue, it will be preserved on disk and exported automatically once the Collector restarts.

Performance tuning & load balancing

In high-throughput environments, small configuration choices can make a big difference. Two important areas to pay attention to are gRPC settings and client-side load balancing.

balancer_name

When your endpoint DNS name resolves to multiple IP addresses, the balancer_name setting controls how the exporter connects to them.

  • pick_first (legacy default): The exporter connects to the first IP address it receives from DNS and uses that single connection exclusively. There is no load balancing.
  • round_robin (current default): The exporter connects to all resolved IP addresses and distributes gRPC calls across them in a round-robin fashion. This is ideal for sending data to a fleet of stateless Collectors.

To make round_robin effective, your endpoint should resolve to multiple addresses. In Kubernetes, this usually means using a headless service. In other environments, you can achieve the same with a DNS record containing multiple A or AAAA records.

yaml
1234
exporters:
otlp:
endpoint: dns:///my-collectors-headless.default.svc.cluster.local:4317
balancer_name: round_robin

With this setup, you'll spread the load evenly across a fleet of Collectors, improving throughput and resilience without needing an external load balancer.

Advanced gRPC Settings

Most environments work fine with the defaults, but in high-throughput or tricky network setups, a few advanced gRPC options can help improve reliability and performance.

  • keepalive: These settings define how often the client checks in with the server to confirm the connection is still alive. They are especially useful if you are going through firewalls or load balancers that silently drop idle connections. By sending regular pings, you can detect and recover from dead connections more quickly than with standard TCP keepalives.
  • write_buffer_size: Controls the size of the TCP write buffer. For most deployments, you do not need to change this. Only adjust it if you have done network performance testing and see evidence that the buffer is a bottleneck.
yaml
12345678910
exporters:
otlp:
endpoint: ...
keepalive:
# Ping the server every 15 seconds if no other traffic
time: 15s
# Wait 5 seconds for the ping ack before closing the connection
timeout: 5s
# Send pings even when there are no active streams
permit_without_stream: true

These options are best thought of as tuning levers: helpful if you are troubleshooting flaky connections in production, but unnecessary for most standard deployments.

Monitoring exporter health

After tuning your pipeline for security, resilience, and performance, how do you verify that it's working as expected in the real world? The answer is to monitor the exporter itself.

The OpenTelemetry Collector exposes internal metrics that let you see the health and performance of your pipeline in real time. Three metrics in particular are worth watching:

  • otelcol_exporter_queue_size: The number of batches currently in the sending_queue. Compare this to otelcol_exporter_queue_capacity (the queue's total capacity) to understand how full the buffer is. A steadily growing queue size means the pipeline is under backpressure and at risk of hitting limits.

  • otelcol_exporter_sent_<spans|metric_points|log_records>: Shows how much telemetry has been delivered successfully. In a healthy pipeline, this number should steadily increase.

  • otelcol_exporter_send_failed_<spans|metric_points|log_records>: Counts failed export attempts. These metrics do not inherently imply data loss since there could be retries but a rising number indicates a problem with your data, network, or backend.

  • otelcol_exporter_enqueue_failed_<spans|metric_points|log_records>: Counts telemetry items that never made it into the queue. If this increases, it usually indicates that the sending queue is full or misconfigured, leading to actual data loss.

Watching these signals gives you an early warning system for exporter health. They tell you whether data is flowing smoothly, stuck in the queue, or being dropped outright allowing you to react quickly to prevent further loss.

Final thoughts

Getting telemetry out of the Collector is not just about flipping a switch. It is about making sure the data is delivered consistently, securely, and at scale. The OTLP gRPC exporter provides the tools to achieve this with support for retries, queuing, encryption, and load balancing. When tuned properly, it can handle production traffic without dropping critical signals.

Once the pipeline is reliable, the real work begins: turning raw telemetry into insights. By sending data to an OpenTelemetry-native backend like Dash0, you can take full advantage of the context preserved along the way. That means faster detection, clearer understanding, and quicker fixes when things go wrong.

Thanks for reading!

Authors