Skip to content

Local activity executor pool exhaustion causes workflow to hang permanently — without possibility to recover without reset #2823

@tsurdilo

Description

@tsurdilo

Describe the bug

When the local activity executor thread pool (maxConcurrentLocalActivityExecutionSize) is fully exhausted by workflows whose local activities are blocked on a hung external call (e.g. a signalWithStart gRPC call that never returns), any subsequently scheduled workflow that attempts to execute a local activity hangs permanently and irrecoverably.
Only way to recover is to set ScheduleToClose timeout in LocalActivityOptions / reset workflow execution (best effort attempt as this can happen again in this case too)

Root cause
Local activities are dispatched to an in-process thread pool. If all threads are occupied by blocked activities, newly dispatched local activities are queued but never executed. The workflow coroutine thread parks on CompletablePromiseImpl.get waiting for a result that will never arrive.

Executions in this state cannot be recovered as workflow task never times out or fails so its never retried. Its very similar situation to awaiting condition but signal never comes in.
Only way to resume is to reset execution

Steps to reproduce
Self-contained JUnit repro test (two tests — one proves the hang, one verifies the scheduleToCloseTimeout workaround):
👉 https://gist.github.com/tsurdilo/ad0ab99dc36f63e26ea968ca25347763
The test uses a ServerSocket.accept() to simulate a hung gRPC call without any real network dependency. It runs deterministically in ~20 seconds.

Test 1 — testPoolExhaustionCausesVictimWorkflowToHang
Confirms the hang. After the pool is exhausted, the victim workflow's event history ends at 6 events with no MARKER_RECORDED, and __stack_trace shows:
workflow-method-victim-workflow-...: (BLOCKED on Feature.get)
io.temporal.internal.sync.WorkflowThreadScheduler.yieldLocked(WorkflowThreadScheduler.java:37)
io.temporal.internal.sync.WorkflowThreadContext.yield(WorkflowThreadContext.java:70)
io.temporal.internal.sync.WorkflowThreadImpl.yield(WorkflowThreadImpl.java:378)
io.temporal.internal.sync.WorkflowThread.await(WorkflowThread.java:27)
io.temporal.internal.sync.CompletablePromiseImpl.getImpl(CompletablePromiseImpl.java:65)
io.temporal.internal.sync.CompletablePromiseImpl.get(CompletablePromiseImpl.java:55)
io.temporal.internal.sync.ActivityStubBase.execute(ActivityStubBase.java:25)
io.temporal.internal.sync.LocalActivityInvocationHandler.lambda$getActivityFunc$0(LocalActivityInvocationHandler.java:59)
io.temporal.internal.sync.ActivityInvocationHandlerBase.invoke(ActivityInvocationHandlerBase.java:48)

Test 2 — testScheduleToCloseTimeoutBypassesHang
Confirms the workaround. With scheduleToCloseTimeout(5s) set, the victim fails cleanly with TIMEOUT_TYPE_SCHEDULE_TO_CLOSE exactly 5 seconds after scheduling, even though the pool is fully exhausted and the activity never gets a thread. The workflow does not hang.

Workaround

Set scheduleToCloseTimeout on LocalActivityOptions. The SDK tracks this deadline from the moment the activity is scheduled, independently of whether a pool thread is available. The workflow will fail with TIMEOUT_TYPE_SCHEDULE_TO_CLOSE rather than hang.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions