Skip to content

Worker Heartbeating#2818

Open
yuandrew wants to merge 9 commits intotemporalio:masterfrom
yuandrew:worker-heartbeat-final
Open

Worker Heartbeating#2818
yuandrew wants to merge 9 commits intotemporalio:masterfrom
yuandrew:worker-heartbeat-final

Conversation

@yuandrew
Copy link
Copy Markdown
Contributor

@yuandrew yuandrew commented Mar 27, 2026

What was changed

Implements periodic worker heartbeat RPC that reports worker status, slot usage, poller info, host metrics, and sticky cache counters to the Temporal server. Includes HeartbeatManager, PollerTracker, and integration tests covering all heartbeat fields.

  • HeartbeatManager — Per-namespace heartbeat scheduler. Workers register/unregister; scheduler fires at the configured interval. Gracefully shuts down if server returns UNIMPLEMENTED.
  • PollerTracker — Tracks in-flight poll count and last successful poll time per worker type. Only records success when a poll returns actual work.
  • WorkflowClientOptions.workerHeartbeatInterval — New option to configure heartbeat interval. Defaults to 60s. Can be set between 1-60s, or a negative duration to disable.
  • Worker.getActiveTaskQueueTypes() — Reports WORKFLOW, ACTIVITY, and NEXUS (only when Nexus services are registered, matching Go SDK).
  • Worker.buildHeartbeat() — Assembles the full WorkerHeartbeat proto with slot info, poller info, host metrics, sticky cache counters, and timestamps.
  • Shutdown: Each worker sends ShutdownWorkerRequest with a final SHUTTING_DOWN heartbeat and active task queue types.
  • NexusWorker task counters — Fixed totalProcessedTasks/totalFailedTasks to be properly incremented.
  • TrackingSlotSupplier.getSlotSupplierKind() — Reports FixedSize vs ResourceBased in heartbeats.

Why?

New feature!

Checklist

  1. Closes Worker Heartbeating #2716

  2. How was this tested:

  • WorkerHeartbeatIntegrationTest (12 tests): End-to-end against test server — basic fields, slot info, task counters, poller info, failure metrics, interval counter reset, in-flight slot tracking, sticky cache counters/misses, multiple workers, resource-based tuner, shutdown status, disabled heartbeats
  • HeartbeatManagerTest: Unit tests for scheduler lifecycle, UNIMPLEMENTED handling, exception resilience
  • PollerTrackerTest: Unit tests for poll tracking and snapshot generation
  • TrackingSlotSupplierKindTest: Slot supplier kind detection
  • WorkerHeartbeatDeploymentVersionTest: Deployment version in heartbeats
  • All existing tests pass (./gradlew :temporal-sdk:test + spotlessCheck)
  1. Any docs updates needed?

Note

Medium Risk
Adds a new periodic heartbeat RPC path and threads that run in production workers, and changes worker shutdown requests/poller instrumentation, which could affect load and shutdown behavior if misconfigured or server capability detection is wrong.

Overview
Implements periodic worker heartbeating to Temporal Server: workers now register a per-namespace scheduled heartbeat callback that reports status plus runtime stats (slot usage, poller counts/last success, sticky cache hit/miss counts, host metrics, plugin info, and optional deployment version) and disables itself on UNIMPLEMENTED.

Introduces new tracking primitives (PollerTracker, TaskCounter, NamespaceCapabilities) and wires them through workflow/activity/nexus pollers and task handlers so heartbeats can report in-flight pollers and processed/failed task counts; TrackingSlotSupplier now exposes supplier kind and used-slot count for reporting.

Extends WorkflowClientOptions with experimental workerHeartbeatInterval (default 60s, 1–60s allowed, negative disables) and updates worker shutdown to send richer ShutdownWorkerRequest including taskQueue, workerInstanceKey, active task queue types, and a final shutting down heartbeat. CI dev-server startup enables frontend.ListWorkersEnabled to support new worker listing tests.

Reviewed by Cursor Bugbot for commit ff47c66. Bugbot is set up for automated code reviews on this repo. Configure here.

Implements periodic worker heartbeat RPCs that report worker status,
slot usage, poller info, and task counters to the server.

Key components:
- HeartbeatManager: per-namespace scheduler that aggregates heartbeats
  from all workers sharing that namespace
- PollerTracker: tracks in-flight poll count and last successful poll time
- WorkflowClientOptions.workerHeartbeatInterval: configurable interval
  (default 60s, range 1-60s, negative to disable)
- TrackingSlotSupplier: extended with slot type reporting
- Worker: builds SharedNamespaceWorker heartbeat data from activity,
  workflow, and nexus worker stats
- TestWorkflowService: implements recordWorkerHeartbeat, describeWorker,
  and shutdownWorker RPCs for testing
@yuandrew yuandrew force-pushed the worker-heartbeat-final branch from 9eeea9b to bfd3fc6 Compare March 27, 2026 19:41
@yuandrew yuandrew marked this pull request as ready for review March 27, 2026 19:58
@yuandrew yuandrew requested a review from a team as a code owner March 27, 2026 19:58
Comment on lines +51 to +52
private final AtomicInteger totalProcessedTasks = new AtomicInteger();
private final AtomicInteger totalFailedTasks = new AtomicInteger();
Copy link
Copy Markdown
Member

@Sushisource Sushisource Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the workers now have something like this - can we abstract these out into something else? Seems like we could use a shared interface or base class.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved into a new TaskCounter class

DescribeNamespaceRequest.newBuilder()
.setNamespace(workflowClient.getOptions().getNamespace())
.build());
boolean heartbeatsSupported =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my PR I've bundled a bunch of capability stuff up into a class, FYI, this will want to go in there

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up, had Claude pull in your new class, so merge conflict should be minimal now. Feel free to merge first

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a review tho!

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit ff47c66. Configure here.

// Only signal shutdown — don't awaitTermination from within the scheduler's own thread
shuttingDown.set(true);
scheduler.shutdown();
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HeartbeatManager UNIMPLEMENTED handling prevents proper shutdown cleanup

Low Severity

When the server returns UNIMPLEMENTED, heartbeatTick calls shuttingDown.set(true) and scheduler.shutdown() (graceful). Later, when HeartbeatManager.shutdown() is called, SharedNamespaceWorker.shutdown() uses compareAndSet(false, true) which fails since shuttingDown is already true, causing it to return immediately without calling shutdownNow() or awaitTermination(). This means HeartbeatManager.shutdown() won't wait for the scheduler thread to fully terminate.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ff47c66. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Worker Heartbeating

2 participants