Skip to content

KAFKA-20247: Controller registration needs to retry after request timeout#21619

Merged
jsancio merged 9 commits intoapache:trunkfrom
kevin-wu24:KAFKA-20247
Mar 16, 2026
Merged

KAFKA-20247: Controller registration needs to retry after request timeout#21619
jsancio merged 9 commits intoapache:trunkfrom
kevin-wu24:KAFKA-20247

Conversation

@kevin-wu24
Copy link
Copy Markdown
Contributor

@kevin-wu24 kevin-wu24 commented Mar 3, 2026

Set pendingRpc to false when controller registration times out. If we do
not set this, controllers cannot send ControllerRegistrationRequests
thereafter. Instead, subsequent calls to
maybeSendControllerRegistration() will always log:
"maybeSendControllerRegistration: waiting for the previous RPC to
complete." The asserts on L294 and L300 fail when pendingRpc does not
get set in onTimeout.

Previously, RegistrationResponseHandler callbacks were invoked from the
NodeToControllerRequestThread. Instead, these callbacks should append an
event to the ControllerRegistrationManger event queue.

Added testRetransmitRegistrationAfterTimeout. This test times out a
controller registration, and checks that the registration manager
channel manager's request queue state is as expected.

Reviewers: José Armando García Sancio jsancio@apache.org

@github-actions github-actions bot added triage PRs from the community core Kafka Broker small Small PRs labels Mar 3, 2026
Copy link
Copy Markdown
Member

@jsancio jsancio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @kevin-wu24

@jsancio jsancio self-assigned this Mar 3, 2026
@github-actions github-actions bot removed the triage PRs from the community label Mar 4, 2026
@github-actions github-actions bot removed the small Small PRs label Mar 9, 2026
@github-actions github-actions bot added the small Small PRs label Mar 9, 2026
Copy link
Copy Markdown
Member

@jsancio jsancio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution LGTM. Just some minor suggestions in the test.


private def newControllerRegistrationManager(
context: RegistrationTestContext,
exponentialBackoff: ExponentialBackoff = new ExponentialBackoff(1, 2, 100, 0.02)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid having default values in the parameter. There are many benefits to this including easier translation to Java in the near future. Looking at the imports it looks like this class can be easily translated to Java in a future PR.

val manager = newControllerRegistrationManager(context)
// Use a large retry backoff with no jitter so we can reliably observe the
// intermediate state after receiving the error response before the scheduled retry fires.
val retryBackoffMs = 1000L
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment true? You changed the time used by KafkaEventTime to be a mocked time.

Copy link
Copy Markdown
Contributor Author

@kevin-wu24 kevin-wu24 Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the retry timeout can be smaller, but the expression (long) (jitter * initial_interval) evaluated as part of ExponentialBackoff#backoff MUST be > 0 for the retry logic to be deterministic because we append an event to observe the intermediate rpcStats state. The values of initial_interval = 1 and jitter = 0.02 do not satisfy this, because half the time the event gets scheduled immediately with backoff = 0, so pendingRpc can get set to true before rpcStats event can observe it getting set to false.

Not really sure if this is a bug with the class. It doesn't seem to me that this behavior is "wrong," just annoying to write a test around if you pick initial_interval = 1 like was chosen here.

Comment on lines +293 to +295
// Use a large retry backoff with no jitter so we can reliably observe the
// intermediate state after timeout before the scheduled retry fires.
val retryBackoffMs = 1000L
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here. Now the time is mocked. Reschedule event should be more reliable.

Copy link
Copy Markdown
Member

@jsancio jsancio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix

@jsancio jsancio merged commit 32c0e1c into apache:trunk Mar 16, 2026
26 checks passed
jsancio pushed a commit that referenced this pull request Mar 16, 2026
…eout (#21619)

Set pendingRpc to false when controller registration times out. If we do
not set this, controllers cannot send ControllerRegistrationRequests
thereafter. Instead, subsequent calls to
maybeSendControllerRegistration() will always log:
"maybeSendControllerRegistration: waiting for the previous RPC to
complete." The asserts on L294 and L300 fail when pendingRpc does not
get set in onTimeout.

Previously, RegistrationResponseHandler callbacks were invoked from the
NodeToControllerRequestThread. Instead, these callbacks should append an
event to the ControllerRegistrationManger event queue.

Added testRetransmitRegistrationAfterTimeout. This test times out a
controller registration, and checks that the registration manager
channel manager's request queue state is as expected.

Reviewers: José Armando García Sancio <jsancio@apache.org>
jsancio pushed a commit that referenced this pull request Mar 16, 2026
…eout (#21619)

Set pendingRpc to false when controller registration times out. If we do
not set this, controllers cannot send ControllerRegistrationRequests
thereafter. Instead, subsequent calls to
maybeSendControllerRegistration() will always log:
"maybeSendControllerRegistration: waiting for the previous RPC to
complete." The asserts on L294 and L300 fail when pendingRpc does not
get set in onTimeout.

Previously, RegistrationResponseHandler callbacks were invoked from the
NodeToControllerRequestThread. Instead, these callbacks should append an
event to the ControllerRegistrationManger event queue.

Added testRetransmitRegistrationAfterTimeout. This test times out a
controller registration, and checks that the registration manager
channel manager's request queue state is as expected.

Reviewers: José Armando García Sancio <jsancio@apache.org>
jsancio pushed a commit that referenced this pull request Mar 16, 2026
…eout (#21619)

Set pendingRpc to false when controller registration times out. If we do
not set this, controllers cannot send ControllerRegistrationRequests
thereafter. Instead, subsequent calls to
maybeSendControllerRegistration() will always log:
"maybeSendControllerRegistration: waiting for the previous RPC to
complete." The asserts on L294 and L300 fail when pendingRpc does not
get set in onTimeout.

Previously, RegistrationResponseHandler callbacks were invoked from the
NodeToControllerRequestThread. Instead, these callbacks should append an
event to the ControllerRegistrationManger event queue.

Added testRetransmitRegistrationAfterTimeout. This test times out a
controller registration, and checks that the registration manager
channel manager's request queue state is as expected.

Reviewers: José Armando García Sancio <jsancio@apache.org>
gabriellefu pushed a commit to gabriellefu/kafka that referenced this pull request Mar 30, 2026
…eout (apache#21619)

Set pendingRpc to false when controller registration times out. If we do
not set this, controllers cannot send ControllerRegistrationRequests
thereafter. Instead, subsequent calls to
maybeSendControllerRegistration() will always log:
"maybeSendControllerRegistration: waiting for the previous RPC to
complete." The asserts on L294 and L300 fail when pendingRpc does not
get set in onTimeout.

Previously, RegistrationResponseHandler callbacks were invoked from the
NodeToControllerRequestThread. Instead, these callbacks should append an
event to the ControllerRegistrationManger event queue.

Added testRetransmitRegistrationAfterTimeout. This test times out a
controller registration, and checks that the registration manager
channel manager's request queue state is as expected.

Reviewers: José Armando García Sancio <jsancio@apache.org>
Shekharrajak pushed a commit to Shekharrajak/kafka that referenced this pull request Mar 31, 2026
…eout (apache#21619)

Set pendingRpc to false when controller registration times out. If we do
not set this, controllers cannot send ControllerRegistrationRequests
thereafter. Instead, subsequent calls to
maybeSendControllerRegistration() will always log:
"maybeSendControllerRegistration: waiting for the previous RPC to
complete." The asserts on L294 and L300 fail when pendingRpc does not
get set in onTimeout.

Previously, RegistrationResponseHandler callbacks were invoked from the
NodeToControllerRequestThread. Instead, these callbacks should append an
event to the ControllerRegistrationManger event queue.

Added testRetransmitRegistrationAfterTimeout. This test times out a
controller registration, and checks that the registration manager
channel manager's request queue state is as expected.

Reviewers: José Armando García Sancio <jsancio@apache.org>
nileshkumar3 pushed a commit to nileshkumar3/kafka that referenced this pull request Apr 15, 2026
…eout (apache#21619)

Set pendingRpc to false when controller registration times out. If we do
not set this, controllers cannot send ControllerRegistrationRequests
thereafter. Instead, subsequent calls to
maybeSendControllerRegistration() will always log:
"maybeSendControllerRegistration: waiting for the previous RPC to
complete." The asserts on L294 and L300 fail when pendingRpc does not
get set in onTimeout.

Previously, RegistrationResponseHandler callbacks were invoked from the
NodeToControllerRequestThread. Instead, these callbacks should append an
event to the ControllerRegistrationManger event queue.

Added testRetransmitRegistrationAfterTimeout. This test times out a
controller registration, and checks that the registration manager
channel manager's request queue state is as expected.

Reviewers: José Armando García Sancio <jsancio@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved core Kafka Broker small Small PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants