Last updated: October 3, 2025
Tuning the OTLP gRPC Exporter for Resilient OpenTelemetry Pipelines
The OpenTelemetry Collector is the engine of modern observability pipelines, and the OTLP gRPC exporter is the driveshaft that keeps it moving. It is the standard and most efficient way to send telemetry from one Collector to another or out to a compatible backend. With its mix of performance, security, and reliability, it forms the backbone of most serious OpenTelemetry deployments.
This guide walks through the exporter's configuration in depth, starting with the basics before moving into advanced tuning for security, resilience, and high-throughput scenarios.
By the end, you'll be able to confidently configure the OTLP exporter for any environment, whether it's a lightweight agent on a single host or a complex, multi-stage, load-balanced pipeline.
Let's begin!
Quick start: sending traces to Jaeger
The OTLP exporter has two fundamental requirements: knowing where to send data
(endpoint
) and how to secure the connection (tls
).
To see it in action, let's create a complete pipeline using Docker Compose. This setup will:
- Use telemetrygen to create and send trace data.
- Use the OpenTelemetry Collector to receive those traces through the OTLP receiver.
- Forward the traces from the Collector to a Jaeger instance using the
otlp
exporter.
Here's the Collector configuration you'll need:
yaml1234567891011121314151617receivers:otlp:protocols:grpc:endpoint: 0.0.0.0:4317exporters:otlp:endpoint: jaeger:4317tls:insecure: trueservice:pipelines:traces:receivers: [otlp]exporters: [otlp]
Next, create a docker-compose.yml
file in the same directory that defines the
three services and links them.
In this setup, the instrumented application sends traces to the Collector on
port 4317, which then forwards these traces to Jaeger. We set
tls.insecure: true
for testing simplicity in a local development environment
but it should always be enabled in production.
yaml1234567891011121314151617181920212223services:otelcol:image: otel/opentelemetry-collector-contrib:0.136.0container_name: otelcolvolumes:- ./otelcol.yaml:/etc/otelcol-contrib/config.yamlrestart: unless-stoppeddepends_on:- jaegerjaeger:image: jaegertracing/jaeger:2.10.0container_name: jaegerports:- 16686:16686telemetrygen:image: ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:v0.136.0container_name: telemetrygenrestart: unless-stoppedcommand: ["traces", "--rate", "10", "--duration", "1h", "--otlp-endpoint", "otelcol:4317", "--otlp-insecure"]depends_on:- otelcol
With both files in place, start the services with:
bash1docker compose up -d
Once all three services are up and running, you can verify the pipeline by
navigating to http://localhost:16686
in your browser. Select the
telemetrygen
service from the search panel to see the incoming traces.
Setting up the OTLP Exporter
The OTLP exporter comes with quite a few knobs you can turn, but in most setups you'll only need to worry about a handful of core settings.
endpoint
This is where you point the exporter. It is simply the host:port
of the gRPC
server you want to send data to. DNS names work as well, which is helpful if you
are using load balancing:
yaml123exporters:otlp:endpoint: my-backend.example.com:4317
See here for the full list of valid values.
headers
If you need to attach extra metadata with your requests, such as an API key for authentication, you can set custom gRPC headers. These headers are included with every outgoing request:
yaml1234567exporters:otlp:endpoint: ingress.eu-west-1.aws.dash0.com:4317headers:# The header key is case-insensitiveAuthorization: "Bearer <your-secret-api-key>"Dash0-Dataset: "<dash0-demo>"
compression
Sending telemetry over the network can use a lot of bandwidth, so the exporter supports compressing data before it is sent. You can choose from several options depending on your needs:
gzip
(default): A great all-around choice that offers good compression ratios with reasonable CPU usage. Start with this unless you have a specific reason not to.snappy
: A faster compression algorithm that uses less CPU thangzip
but results in larger data sizes. Consider this if your Collector is CPU-bound and you have ample network bandwidth.zstd
: Often provides better compression ratios thangzip
at similar or even faster speeds. It's an excellent choice if its supported at the endpoint.none
: No compression. Works fine on fast, low-cost networks (for example within the same VPC) where you want to minimize CPU usage.
yaml1234exporters:otlp:endpoint: my-backend.example.com:4317compression: zstd
Securing the connection with TLS
Although telemetry data
should not include sensitive information,
it still provides insights into how your systems behave. The tls
block lets
you configure exactly how the connection should be secured to prevent exposure
or tampering.
In the most common setup, you will be sending data to a secure public endpoint that uses a valid certificate from a trusted Certificate Authority (CA). In this case, the default settings are often sufficient:
yaml1234exporters:otlp:endpoint: secure-endpoint.com:4317# tls is implicitly enabled by default
If you need stronger security, such as when one Collector is sending data to another, you can enable mutual TLS (mTLS). With mTLS, both the client (the exporter) and the server (the receiver) check and verify each other's certificates.
To set this up, you need to provide a client certificate and private key, as well as the CA certificate that signed the server's certificate:
yaml123456789exporters:otlp:endpoint: internal-gateway.my-corp:4317tls:# CA certificate to verify the server's identityca_file: /etc/ssl/certs/ca.pem# Client certificate for the server to verify our identitycert_file: /etc/ssl/certs/client.pemkey_file: /etc/ssl/private/client.key
On the receiver side, you must configure it to trust the CA that signed the client certificate.
For other TLS configuration settings, be sure to read this document.
The critical role of the batch processor
Before getting into queuing and retries, it's important to clarify what the OTLP exporter actually sends. By default, the exporter forwards whatever comes in: that could be individual spans, metrics, or logs.
In practice, almost every production deployment enables the batch processor, which groups signals together before they reach the exporter. With batching enabled, the exporter works with larger, pre-packaged groups of data instead of single items to improve network efficiency and compression.
yaml123456service:pipelines:traces:receivers: [otlp]processors: [batch]exporters: [otlp]
The key takeaway is that the exporter does not operate on individual signals unless you skip batching. In most cases, it holds batches created by the processor. The size and timing of those batches are major factors in exporter performance, memory use, and network efficiency.
Batch processor tuning is a deep topic, so we cover it separately. For the rest of this article, just remember that the exporter is usually working with batches, and the queuing and retry logic applies to those batches.
Building resilient pipelines with queuing and retries
What happens if the backend is down or the network is flaky? Without the right configuration, you will lose valuable data.
To protect against this, the OTLP exporter uses the exporterhelper framework, which adds queuing and retry logic so your telemetry has a much better chance of making it through.
You can think of it as a two-stage defense system:
- Retry mechanism: If a batch fails to send, the exporter will automatically retry with an exponential backoff delay.
- Sending queue: If new data arrives while the exporter is busy retrying, it gets placed in a queue instead of being dropped.
Let's take a closer look at the specific settings you can use to control how retries behave and ensure your data has the best chance of getting through.
retry_on_failure
The retry mechanism is your first safeguard when something goes wrong, and it is turned on by default. It's designed to handle transient errors like a brief network failure, a rolling restart of the backend, or a momentary spike in load.
Note that the exporter doesn't retry every error. It will only retry on gRPC status codes that indicate a temporary, recoverable problem. These include:
UNAVAILABLE
: The service is temporarily unavailable (most common).RESOURCE_EXHAUSTED
: A downstream service is temporarily overloaded.ABORTED
: The operation was aborted, often due to a concurrency issue.
You can fine-tune how this works by adjusting a few key settings:
enabled
: Controls whether retries are active. It is set totrue
by default, and you'll almost always want to keep it that way.initial_interval
: How long the exporter waits before making the very first retry. The default is5s
, but you can shorten it if you want a faster recovery at the risk of overloading a struggling backend with requests.multiplier
: Each retry interval is multiplied by this factor (1.5
by default), which is what creates the exponential backoff pattern.max_interval
: This is the longest delay allowed between retries. By default it's set to30s
to prevent the backoff from growing too large.max_elapsed_time
: The maximum total time the exporter will spend retrying a single batch. The default is300s
(5 minutes), after which the batch is dropped. Setting it to0
means retries will continue indefinitely.
With these settings in place, it helps to picture how retries actually play out. Suppose you configure:
yaml123456789exporters:otlp:endpoint: my-backend.example.com:4317retry_on_failure:enabled: trueinitial_interval: 1s # Start retrying fastermultiplier: 2.0 # Double the wait time on each attemptmax_interval: 10smax_elapsed_time: 300s # Give up after 5 minutes
The timing would look like this:
- 1s after the first failure
- 2s after the second failure
- 4s after the third
- 8s after the fourth
- 10s after the fifth (capped by
max_interval
) - further attempts also wait 10s, until the 300s total is reached.
sending_queue
The sending queue acts as a shock absorber between receivers and exporters. Instead of dropping data when the exporter is busy retrying or the backend slows down, the queue temporarily holds batches in memory. This makes pipelines much more resilient during spikes or outages.
This queue supports many options, but here are some of the key ones you can adjust:
enabled
: Almost always left on. Disabling it means data will be dropped as soon as the exporter is back-pressured.queue_size
: Controls how many batches can sit in the queue. A larger buffer is useful if you expect traffic spikes or short backend slowdowns, but it also means higher memory usage.num_consumers
: Defines how many workers are draining the queue in parallel (default:10
). Raising this increases throughput, but if the backend is already saturated it may just increase load without benefit.
Imagine a sudden burst of telemetry that produces 3,000 batches in a short
window. With the default queue_size
of 1,000, two-thirds of that data would be
dropped immediately. Bumping the queue to 5,000 lets you ride out the spike
without data loss, as long as the backlog drains before the next surge.
yaml1234567exporters:otlp:endpoint: my-backend.example.com:4317sending_queue:enabled: truequeue_size: 5000 # Allow a larger buffer for spiky trafficnum_consumers: 20 # Increase parallelism for higher throughput
Surviving restarts with a persistent queue
An in-memory queue works well for short hiccups, but it disappears the moment the Collector restarts or crashes. That means any data still in the buffer is lost. To ensure a more reliable data delivery, you must switch to a persistent queue that writes data to disk.
This requires configuring a storage
extension, typically
file_storage,
which persists state to the filesystem so it can be recovered later:
yaml123456789101112131415161718192021222324# 1. Define a storage extensionextensions:file_storage:# default settingsdirectory: /var/lib/otelcol/storagetimeout: 1sexporters:otlp:endpoint: my-backend.example.com:4317sending_queue:enabled: true# 2. Tell the queue to use the storage extensionstorage: file_storage# The queue_size now refers to the number of batches on diskqueue_size: 10000service:# 3. Enable the extension in the serviceextensions: [file_storage]pipelines:traces:receivers: [otlp]exporters: [otlp]
With this configuration, if the Collector is shut down while data is in the persistent queue, it will be preserved on disk and exported automatically once the Collector restarts.
Performance tuning & load balancing
In high-throughput environments, small configuration choices can make a big difference. Two important areas to pay attention to are gRPC settings and client-side load balancing.
balancer_name
When your endpoint
DNS name resolves to multiple IP addresses, the
balancer_name
setting controls how the exporter connects to them.
pick_first
(legacy default): The exporter connects to the first IP address it receives from DNS and uses that single connection exclusively. There is no load balancing.round_robin
(current default): The exporter connects to all resolved IP addresses and distributes gRPC calls across them in a round-robin fashion. This is ideal for sending data to a fleet of stateless Collectors.
To make round_robin
effective, your endpoint should resolve to multiple
addresses. In Kubernetes, this usually means using a headless service. In other
environments, you can achieve the same with a DNS record containing multiple A
or AAAA
records.
yaml1234exporters:otlp:endpoint: dns:///my-collectors-headless.default.svc.cluster.local:4317balancer_name: round_robin
With this setup, you'll spread the load evenly across a fleet of Collectors, improving throughput and resilience without needing an external load balancer.
Advanced gRPC Settings
Most environments work fine with the defaults, but in high-throughput or tricky network setups, a few advanced gRPC options can help improve reliability and performance.
keepalive
: These settings define how often the client checks in with the server to confirm the connection is still alive. They are especially useful if you are going through firewalls or load balancers that silently drop idle connections. By sending regular pings, you can detect and recover from dead connections more quickly than with standard TCP keepalives.write_buffer_size
: Controls the size of the TCP write buffer. For most deployments, you do not need to change this. Only adjust it if you have done network performance testing and see evidence that the buffer is a bottleneck.
yaml12345678910exporters:otlp:endpoint: ...keepalive:# Ping the server every 15 seconds if no other traffictime: 15s# Wait 5 seconds for the ping ack before closing the connectiontimeout: 5s# Send pings even when there are no active streamspermit_without_stream: true
These options are best thought of as tuning levers: helpful if you are troubleshooting flaky connections in production, but unnecessary for most standard deployments.
Monitoring exporter health
After tuning your pipeline for security, resilience, and performance, how do you verify that it's working as expected in the real world? The answer is to monitor the exporter itself.
The OpenTelemetry Collector exposes internal metrics that let you see the health and performance of your pipeline in real time. Three metrics in particular are worth watching:
-
otelcol_exporter_queue_size
: The number of batches currently in thesending_queue
. Compare this tootelcol_exporter_queue_capacity
(the queue's total capacity) to understand how full the buffer is. A steadily growing queue size means the pipeline is under backpressure and at risk of hitting limits. -
otelcol_exporter_sent_<spans|metric_points|log_records>
: Shows how much telemetry has been delivered successfully. In a healthy pipeline, this number should steadily increase. -
otelcol_exporter_send_failed_<spans|metric_points|log_records>
: Counts failed export attempts. These metrics do not inherently imply data loss since there could be retries but a rising number indicates a problem with your data, network, or backend. -
otelcol_exporter_enqueue_failed_<spans|metric_points|log_records>
: Counts telemetry items that never made it into the queue. If this increases, it usually indicates that the sending queue is full or misconfigured, leading to actual data loss.
Watching these signals gives you an early warning system for exporter health. They tell you whether data is flowing smoothly, stuck in the queue, or being dropped outright allowing you to react quickly to prevent further loss.
Final thoughts
Getting telemetry out of the Collector is not just about flipping a switch. It is about making sure the data is delivered consistently, securely, and at scale. The OTLP gRPC exporter provides the tools to achieve this with support for retries, queuing, encryption, and load balancing. When tuned properly, it can handle production traffic without dropping critical signals.
Once the pipeline is reliable, the real work begins: turning raw telemetry into insights. By sending data to an OpenTelemetry-native backend like Dash0, you can take full advantage of the context preserved along the way. That means faster detection, clearer understanding, and quicker fixes when things go wrong.
Thanks for reading!