KAFKA-20247: Controller registration needs to retry after request timeout#21619
KAFKA-20247: Controller registration needs to retry after request timeout#21619jsancio merged 9 commits intoapache:trunkfrom
Conversation
jsancio
left a comment
There was a problem hiding this comment.
Thanks for the fix @kevin-wu24
…on the registration manager's event queue
jsancio
left a comment
There was a problem hiding this comment.
The solution LGTM. Just some minor suggestions in the test.
|
|
||
| private def newControllerRegistrationManager( | ||
| context: RegistrationTestContext, | ||
| exponentialBackoff: ExponentialBackoff = new ExponentialBackoff(1, 2, 100, 0.02) |
There was a problem hiding this comment.
Let's avoid having default values in the parameter. There are many benefits to this including easier translation to Java in the near future. Looking at the imports it looks like this class can be easily translated to Java in a future PR.
| val manager = newControllerRegistrationManager(context) | ||
| // Use a large retry backoff with no jitter so we can reliably observe the | ||
| // intermediate state after receiving the error response before the scheduled retry fires. | ||
| val retryBackoffMs = 1000L |
There was a problem hiding this comment.
Is the comment true? You changed the time used by KafkaEventTime to be a mocked time.
There was a problem hiding this comment.
I think the retry timeout can be smaller, but the expression (long) (jitter * initial_interval) evaluated as part of ExponentialBackoff#backoff MUST be > 0 for the retry logic to be deterministic because we append an event to observe the intermediate rpcStats state. The values of initial_interval = 1 and jitter = 0.02 do not satisfy this, because half the time the event gets scheduled immediately with backoff = 0, so pendingRpc can get set to true before rpcStats event can observe it getting set to false.
Not really sure if this is a bug with the class. It doesn't seem to me that this behavior is "wrong," just annoying to write a test around if you pick initial_interval = 1 like was chosen here.
| // Use a large retry backoff with no jitter so we can reliably observe the | ||
| // intermediate state after timeout before the scheduled retry fires. | ||
| val retryBackoffMs = 1000L |
There was a problem hiding this comment.
Same comment here. Now the time is mocked. Reschedule event should be more reliable.
…eout (#21619) Set pendingRpc to false when controller registration times out. If we do not set this, controllers cannot send ControllerRegistrationRequests thereafter. Instead, subsequent calls to maybeSendControllerRegistration() will always log: "maybeSendControllerRegistration: waiting for the previous RPC to complete." The asserts on L294 and L300 fail when pendingRpc does not get set in onTimeout. Previously, RegistrationResponseHandler callbacks were invoked from the NodeToControllerRequestThread. Instead, these callbacks should append an event to the ControllerRegistrationManger event queue. Added testRetransmitRegistrationAfterTimeout. This test times out a controller registration, and checks that the registration manager channel manager's request queue state is as expected. Reviewers: José Armando García Sancio <jsancio@apache.org>
…eout (#21619) Set pendingRpc to false when controller registration times out. If we do not set this, controllers cannot send ControllerRegistrationRequests thereafter. Instead, subsequent calls to maybeSendControllerRegistration() will always log: "maybeSendControllerRegistration: waiting for the previous RPC to complete." The asserts on L294 and L300 fail when pendingRpc does not get set in onTimeout. Previously, RegistrationResponseHandler callbacks were invoked from the NodeToControllerRequestThread. Instead, these callbacks should append an event to the ControllerRegistrationManger event queue. Added testRetransmitRegistrationAfterTimeout. This test times out a controller registration, and checks that the registration manager channel manager's request queue state is as expected. Reviewers: José Armando García Sancio <jsancio@apache.org>
…eout (#21619) Set pendingRpc to false when controller registration times out. If we do not set this, controllers cannot send ControllerRegistrationRequests thereafter. Instead, subsequent calls to maybeSendControllerRegistration() will always log: "maybeSendControllerRegistration: waiting for the previous RPC to complete." The asserts on L294 and L300 fail when pendingRpc does not get set in onTimeout. Previously, RegistrationResponseHandler callbacks were invoked from the NodeToControllerRequestThread. Instead, these callbacks should append an event to the ControllerRegistrationManger event queue. Added testRetransmitRegistrationAfterTimeout. This test times out a controller registration, and checks that the registration manager channel manager's request queue state is as expected. Reviewers: José Armando García Sancio <jsancio@apache.org>
…eout (apache#21619) Set pendingRpc to false when controller registration times out. If we do not set this, controllers cannot send ControllerRegistrationRequests thereafter. Instead, subsequent calls to maybeSendControllerRegistration() will always log: "maybeSendControllerRegistration: waiting for the previous RPC to complete." The asserts on L294 and L300 fail when pendingRpc does not get set in onTimeout. Previously, RegistrationResponseHandler callbacks were invoked from the NodeToControllerRequestThread. Instead, these callbacks should append an event to the ControllerRegistrationManger event queue. Added testRetransmitRegistrationAfterTimeout. This test times out a controller registration, and checks that the registration manager channel manager's request queue state is as expected. Reviewers: José Armando García Sancio <jsancio@apache.org>
…eout (apache#21619) Set pendingRpc to false when controller registration times out. If we do not set this, controllers cannot send ControllerRegistrationRequests thereafter. Instead, subsequent calls to maybeSendControllerRegistration() will always log: "maybeSendControllerRegistration: waiting for the previous RPC to complete." The asserts on L294 and L300 fail when pendingRpc does not get set in onTimeout. Previously, RegistrationResponseHandler callbacks were invoked from the NodeToControllerRequestThread. Instead, these callbacks should append an event to the ControllerRegistrationManger event queue. Added testRetransmitRegistrationAfterTimeout. This test times out a controller registration, and checks that the registration manager channel manager's request queue state is as expected. Reviewers: José Armando García Sancio <jsancio@apache.org>
…eout (apache#21619) Set pendingRpc to false when controller registration times out. If we do not set this, controllers cannot send ControllerRegistrationRequests thereafter. Instead, subsequent calls to maybeSendControllerRegistration() will always log: "maybeSendControllerRegistration: waiting for the previous RPC to complete." The asserts on L294 and L300 fail when pendingRpc does not get set in onTimeout. Previously, RegistrationResponseHandler callbacks were invoked from the NodeToControllerRequestThread. Instead, these callbacks should append an event to the ControllerRegistrationManger event queue. Added testRetransmitRegistrationAfterTimeout. This test times out a controller registration, and checks that the registration manager channel manager's request queue state is as expected. Reviewers: José Armando García Sancio <jsancio@apache.org>
Set pendingRpc to false when controller registration times out. If we do
not set this, controllers cannot send ControllerRegistrationRequests
thereafter. Instead, subsequent calls to
maybeSendControllerRegistration() will always log:
"maybeSendControllerRegistration: waiting for the previous RPC to
complete." The asserts on L294 and L300 fail when pendingRpc does not
get set in onTimeout.
Previously, RegistrationResponseHandler callbacks were invoked from the
NodeToControllerRequestThread. Instead, these callbacks should append an
event to the ControllerRegistrationManger event queue.
Added testRetransmitRegistrationAfterTimeout. This test times out a
controller registration, and checks that the registration manager
channel manager's request queue state is as expected.
Reviewers: José Armando García Sancio jsancio@apache.org