Deep Diving into Dapr Workflows and OpenTelemetry – Tracing the Invisible Parts of Asynchronous Communication

In my previous blog post, I showed how Dapr and OpenTelemetry work together to create a unified picture of your microservices system. We looked at how to stitch signals from your application code together with the signals coming from the Dapr sidecar, using the OpenTelemetry Collector and Dash0 to bring everything into one coherent view.

That post focused on request-response interactions between services. This one goes a level deeper. Over the past few months, Mauricio Salatino and I have been experimenting with one of the most challenging parts of Dapr’s architecture from an observability perspective. We wanted to understand how Dapr Workflows could be observed with OpenTelemetry, and more specifically, whether it was possible to achieve end-to-end tracing without putting the burden on application developers.

This blog documents the result of that work.

Why tracing asynchronous workflows is harder than it looks

Asynchronous workflows are powerful. They help you coordinate long-running work, orchestrate services, handle retries and compensation, and create repeatable, fault-tolerant business processes. Dapr Workflows make this easy. You write orchestrators and activities in your language of choice while the Dapr sidecar takes responsibility for state, durability, and execution.

But as soon as you put part of your control flow inside a workflow engine, you shift part of the request’s lifecycle outside your application code. When something goes wrong, you want answers. Why is this workflow stuck? Which activity took too long? What happened after the request entered the workflow? And at that moment you realize that your traces only cover half of the story.

By default, Dapr emits spans for workflow orchestration. You can see when a workflow instance is created, when an activity was scheduled, and when it completed. But when the activity code inside the application makes an HTTP call, that call appears as a completely separate trace. There is no parent, no shared trace ID, and no relationship to the workflow span. From an SRE or platform engineering perspective, this is almost worse than having no tracing at all. You know something happened, but the connection between cause and effect is missing.

Mauricio and I started asking ourselves a simple question. If a workflow activity makes an HTTP call, why shouldn’t that call be part of the same trace that started at the initial HTTP request. Why can’t Dapr workflows be first-class citizens in distributed tracing, just like any other service invocation.

It turned out to be a surprisingly deep problem.

Understanding how Dapr workflows actually run

To understand why tracing was broken, we need to look at how Dapr Workflows execute under the hood. This part often surprises people.

First, all workflow orchestration runs inside the Dapr sidecar. The workflow engine is implemented using the durabletask-go library and is embedded directly inside the daprd process. There is no external workflow backend to deploy or monitor. Each sidecar contains everything it needs to run workflows, respond to events, and manage durability through Dapr actors and the configured state store.

Second, the workflow code – the orchestrator function and the activity functions – live inside your application process. If you are using Java, the durabletask-java library is part of your application’s Dapr SDK. It connects to the sidecar at startup, registers workflows and activities, and then waits.

Finally, the communication between your app’s SDK and the sidecar’s embedded workflow engine happens over a single long-lived gRPC stream. When the application starts, the Java SDK opens this stream. The sidecar then pushes workflow work items over it: start this workflow, run this activity, resume this step, and so on. The SDK receives these instructions, executes your workflow or activity code, and streams the result back to the sidecar. There are no individual gRPC calls per activity. Everything happens on the same stream.

Understanding how Dapr workflows actually run

This architecture is elegant, efficient, and makes the workflow engine portable and language-agnostic. But it is also exactly why distributed tracing does not work automatically. For a deeper look at how Dapr Workflows are built and executed, see the architecture documentation.

The goal: a single trace per workflow execution

From the beginning, our target experience was clear. If the frontend triggers a workflow, everything that happens inside that workflow should be part of one trace. The orchestration inside the sidecar, the activity execution inside the app, and the HTTP calls made by that activity should all be stitched together.

To get there, we had to push trace context through three layers:

The durabletask-go workflow engine inside the sidecar
The Dapr runtime’s own workflow spans
The durabletask-java client inside the user application

Each layer was losing a piece of the puzzle. Fixing them required changes in all three. The sequence diagram in the previous section gives a precise visual of where each layer can lose track of the parent context. We used this goal - a single trace per workflow execution - as the reference point for identifying exactly where the sidecar, the runtime, and the SDK needed to be updated to preserve the trace.

Where tracing breaks: the gRPC streaming gap

In synchronous systems, trace context propagates through well-understood channels and is automatically picked up by instrumentation. HTTP calls carry traceparent headers, gRPC unary calls carry metadata, and messaging systems carry tracing headers alongside the message body.

Automatic instrumentation - such as the OpenTelemetry Java agent - extracts this context, makes it current on the executing thread, and uses it to correctly parent new spans. As long as execution stays within these well-defined request boundaries, tracing works with little to no manual effort. Even then, this only works because the execution model aligns with the assumptions of the instrumentation: a request arrives, context is activated, work happens on the same thread (or a managed handoff), and outbound calls occur while that context is still active.

Dapr already supports these patterns. To make the contrast clearer, the next two diagrams show how W3C trace context normally flows cleanly between services and how that flow breaks once execution moves into the long-lived gRPC stream. The first diagram shows the ideal propagation path, while the second highlights the loss of the parent–child relationship inside a workflow activity, despite sidecars correctly propagating W3C trace context across HTTP, gRPC, and pub/sub calls.

The first diagram shows the ideal propagation path

Workflows are different. Because the application and the sidecar communicate through a single server-streaming RPC, there is no chance to attach gRPC metadata on a per-activity basis. That metadata is only sent when the stream is created, long before any specific activity begins. There is nowhere to put a traceparent for a particular workflow step because no new RPC call is made, and even if we could deliver the context across the stream, it would still have to cross a thread boundary inside the Java application. The Java SDK receives work items on a gRPC receiver thread but executes the activity code on a worker thread. The OpenTelemetry Java agent largely relies on thread-local context. Without manually restoring the trace context into the worker thread, the agent will not pick it up. It will happily start a new trace for every outgoing HTTP call. The detailed workflow timeline diagram that follows captures this problem in depth: the sidecar creates a valid span, but once the activity request crosses the gRPC stream and lands on a different thread, the parent context is lost entirely, causing the activity’s outbound calls to appear in separate traces.

The detailed workflow timeline diagram that follows captures this problem in depth

The combined effect is simple but painful. Even though the Dapr sidecar has a valid span for the workflow activity, the Java app never sees it. The activity code runs without a parent span, and the trace chain breaks.

We wanted to fix this without requiring developers to write custom tracing code. Ideally, enabling the OpenTelemetry Java agent should be enough.

The three key changes we contributed

1. Adding W3C Trace Context to activity messages inside the sidecar

The first challenge was giving the Java SDK a way to know the parent span for each activity. Since gRPC streaming cannot use metadata per message, we embedded the trace context directly into the ActivityRequest protobuf that the sidecar sends to the application.

In the durabletask-go we added logic that captures the current span in the sidecar when an activity is scheduled. From that span we extract the trace ID, span ID, trace flags, and tracestate. We serialize them into a standard W3C traceparent string and attach it to the ActivityRequest payload.

By treating the message body itself as the carrier, we avoid any transport limitations and stay fully aligned with OpenTelemetry and W3C specifications. This guarantees interoperability across languages and SDKs.

2. Repairing workflow span relationships inside Dapr

Next, we improved how Dapr creates spans for workflow orchestration. Workflows need a clear parent-child hierarchy which provides a good debugging experience.

In the dapr repository, we changed the workflow runtime so that:

The orchestrator span becomes the parent of the activity consumer span
The activity consumer span becomes the parent of the user activity’s work item

This hierarchy ensures that the trace context we embed in the ActivityRequest matches the actual parent span Dapr uses internally.

3. Restoring trace context inside the Java workflow SDK

The final step happens inside the application, in durabletask-java. The SDK now looks for the W3C trace context inside the ActivityRequest sent by the sidecar. When it receives an activity to execute, it parses the traceparent string, reconstructs the SpanContext, and makes it current on the worker thread that will run the activity function.

This is the critical moment. Once the context is active, the OpenTelemetry Java agent sees it and automatically attaches the workflow activity span as the parent for outgoing HTTP calls. Developers do not need to add any tracing code. They write their activity function the same way they always have, and tracing simply works.

A demo version of this logic is included in the branch which we used for real-world testing.

With these three pieces in place, the trace chain remains intact across all boundaries: sidecar to SDK, SDK to user code, and user code out to other services.

Seeing the results in a real workflow

To validate the changes end to end, we extended a demo that Mauricio originally created to showcase how Dapr Workflows operate in a real application. The screenshot shows the simple pizza-ordering UI which drives the workflow. Each transition in the order lifecycle triggers workflow steps and service calls, and with tracing enabled, every one of these transitions now appears as part of the same trace.

The workflow drives the full lifecycle of a pizza order, moving it forward step by step and invoking the necessary services as it progresses. Each of these workflow steps executes inside the Java application and may perform outbound calls as part of the business process. With our tracing improvements in place, all of those calls now appear as part of a single, continuous trace that follows the order end to end. The workflow diagram that follows makes this easier to reason about. It shows the order moving through each step of the process, from creation to completion, and gives you a structural view of the orchestration that now underpins the trace. Each box in the diagram corresponds to a workflow step that is also represented as a span in the final trace.

The workflow diagram that follows makes this easier to reason about

With the OpenTelemetry Operator auto-instrumenting the Java service and the sidecar exporting its built-in workflow spans via OTLP to the collector, we can now see the entire workflow as one continuous trace. It is important to distinguish between these two roles. Dapr does not rely on the OpenTelemetry Operator to produce workflow spans - those spans are emitted by Dapr itself using its built-in OpenTelemetry instrumentation. The Operator is used here only to automatically instrument the Java application, so that activity code and outbound HTTP calls participate in the same trace without requiring manual tracing code. In other words, Dapr workflows are traceable even without the OpenTelemetry Operator. The Operator simply ensures that user code seamlessly joins the trace that Dapr has already started.

The frontend request starts the trace. The workflow orchestration spans sit underneath it. Each activity sits under the workflow span, and the HTTP calls made by these activities appear as perfect children of the activity span. Below is a screenshot that shows this structure as a waterfall trace view. At the top you see the incoming POST request, under it the workflow spans, and beneath those the spans for the HTTP calls made by each step. The waterfall makes the parent child relationships obvious and turns what used to be disjoint traces into a single readable execution story.

Screenshot that shows this structure as a waterfall trace view.

What used to be three or four separate traces is now one clear narrative. It becomes possible to follow a customer order from the moment it is submitted all the way through the business logic that processes it. Below is another view of the trace which adds a performance angle to the story. Because the workflow spans, activity spans, and HTTP client spans now all share the same context, the flamegraph clearly shows which parts of the workflow dominate the total duration and which steps are essentially negligible, making it much easier to spot bottlenecks in the overall process.

Another view of the trace which adds a performance angle to the story

This is exactly what we hoped to achieve when we started this work.

Why we chose standards over shortcuts

Throughout this effort, we deliberately avoided creating any new, bespoke tracing mechanism. Everything is based on standard W3C TraceContext and standard OpenTelemetry APIs.

We used W3C traceparent strings as the propagation format because they are vendor-neutral and already used by Dapr. We used OpenTelemetry’s context APIs because they are portable across languages. We relied on the OpenTelemetry Collector and Operator so that application developers do not need to think about tracing internals.

This makes the approach predictable, debuggable, and future-proof. It also means that other languages and SDKs can implement the same pattern. Nothing here is Go-specific or Java-specific. Any Dapr SDK could read a traceparent from the workflow message and make it current before running an activity.

What’s next

This work currently lives in feature branches, although our goal is to work with the Dapr community to upstream a clean and robust version. We still want to test multi-language workflows, multi-step orchestrations, and long-running workflows that span hours or days.

We also want to understand how these techniques apply to other async Dapr building blocks, such as pub/sub with dead-letter queues, stateful services, and orchestrations that combine workflows with events.

But the core problem is now solved. We have proven that Dapr workflows can participate fully in OpenTelemetry traces, even across process boundaries, queue-like communication, and thread hops. Most importantly, we did it without requiring the developer to pass trace IDs around manually.

As platform engineers, that is exactly the kind of invisible plumbing we want to take off the developers’ plate.

Want to follow along or contribute

If you want to explore the code, all branches are publicly available:

This has been one of the most fun and rewarding explorations I’ve done recently. Huge thanks to Mauricio for diving deep into this with me. We learned a lot about Dapr, OpenTelemetry, and the subtle challenges of async workflow engines. And we are only getting started.

If you try this out, find something odd, or have an idea for extending the approach, I would love to hear from you.