Local activity executor pool exhaustion causes workflow to hang permanently — without possibility to recover without reset

Describe the bug

When the local activity executor thread pool (maxConcurrentLocalActivityExecutionSize) is fully exhausted by workflows whose local activities are blocked on a hung external call (e.g. a signalWithStart gRPC call that never returns), any subsequently scheduled workflow that attempts to execute a local activity hangs permanently and irrecoverably.
Only way to recover is to set ScheduleToClose timeout in LocalActivityOptions / reset workflow execution (best effort attempt as this can happen again in this case too)

Root cause
Local activities are dispatched to an in-process thread pool. If all threads are occupied by blocked activities, newly dispatched local activities are queued but never executed. The workflow coroutine thread parks on CompletablePromiseImpl.get waiting for a result that will never arrive.

Executions in this state cannot be recovered as workflow task never times out or fails so its never retried. Its very similar situation to awaiting condition but signal never comes in. 
Only way to resume is to reset execution

Steps to reproduce
Self-contained JUnit repro test (two tests — one proves the hang, one verifies the scheduleToCloseTimeout workaround):
👉 https://gist.github.com/tsurdilo/ad0ab99dc36f63e26ea968ca25347763
The test uses a ServerSocket.accept() to simulate a hung gRPC call without any real network dependency. It runs deterministically in ~20 seconds.


Test 1 — testPoolExhaustionCausesVictimWorkflowToHang
Confirms the hang. After the pool is exhausted, the victim workflow's event history ends at 6 events with no MARKER_RECORDED, and __stack_trace shows:
     workflow-method-victim-workflow-...: (BLOCKED on Feature.get)
io.temporal.internal.sync.WorkflowThreadScheduler.yieldLocked(WorkflowThreadScheduler.java:37)
io.temporal.internal.sync.WorkflowThreadContext.yield(WorkflowThreadContext.java:70)
io.temporal.internal.sync.WorkflowThreadImpl.yield(WorkflowThreadImpl.java:378)
io.temporal.internal.sync.WorkflowThread.await(WorkflowThread.java:27)
io.temporal.internal.sync.CompletablePromiseImpl.getImpl(CompletablePromiseImpl.java:65)
io.temporal.internal.sync.CompletablePromiseImpl.get(CompletablePromiseImpl.java:55)
io.temporal.internal.sync.ActivityStubBase.execute(ActivityStubBase.java:25)
io.temporal.internal.sync.LocalActivityInvocationHandler.lambda$getActivityFunc$0(LocalActivityInvocationHandler.java:59)
io.temporal.internal.sync.ActivityInvocationHandlerBase.invoke(ActivityInvocationHandlerBase.java:48)


Test 2 — testScheduleToCloseTimeoutBypassesHang
Confirms the workaround. With scheduleToCloseTimeout(5s) set, the victim fails cleanly with TIMEOUT_TYPE_SCHEDULE_TO_CLOSE exactly 5 seconds after scheduling, even though the pool is fully exhausted and the activity never gets a thread. The workflow does not hang.

Workaround

Set scheduleToCloseTimeout on LocalActivityOptions. The SDK tracks this deadline from the moment the activity is scheduled, independently of whether a pool thread is available. The workflow will fail with TIMEOUT_TYPE_SCHEDULE_TO_CLOSE rather than hang.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local activity executor pool exhaustion causes workflow to hang permanently — without possibility to recover without reset #2823

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Local activity executor pool exhaustion causes workflow to hang permanently — without possibility to recover without reset #2823

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions