fix(pd): add timeout and null-safety to getLeaderGrpcAddress()#2961
fix(pd): add timeout and null-safety to getLeaderGrpcAddress()#2961bitflicker64 wants to merge 3 commits intoapache:masterfrom
Conversation
The bolt RPC call in getLeaderGrpcAddress() returns null in Docker bridge network mode, causing NPE when a follower PD node attempts to discover the leader's gRPC address. This breaks store registration and partition distribution when any node other than pd0 wins the raft leader election. Add a bounded timeout using the configured rpc-timeout, null-check the RPC response, and fall back to deriving the address from the raft endpoint IP when the RPC fails. Closes apache#2959
|
How I tested:
Results with pd1 as leader: Confirmed fallback triggered in pd1 logs: Before this fix: RPC returns null → NPE → follower PDs can't redirect requests to leader → cluster only worked when pd0 won leader election since it never hit the broken code path. After this fix: RPC failure caught with bounded timeout → fallback to endpoint IP + gRPC port derivation → follower PDs correctly redirect to leader regardless of which PD node wins election. Related docker bridge networking PR: #2952 |
There was a problem hiding this comment.
Pull request overview
This PR addresses PD follower redirects failing in Docker bridge mode by making RaftEngine.getLeaderGrpcAddress() resilient to stalled/failed bolt RPC lookups, preventing NPEs and enabling store registration/partition distribution regardless of which PD becomes Raft leader.
Changes:
- Add a bounded timeout (
config.getRpcTimeout()) to the bolt RPCCompletableFuture.get(...). - Add null-safety around the RPC response before reading
getGrpcAddress(). - Add a fallback that derives the leader gRPC address from the Raft endpoint IP/host plus the configured gRPC port.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java
Show resolved
Hide resolved
hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java
Outdated
Show resolved
Hide resolved
|
|
||
| // Fallback: derive from raft endpoint IP + local gRPC port (best effort) | ||
| String leaderIp = raftNode.getLeaderId().getEndpoint().getIp(); | ||
| return leaderIp + ":" + config.getGrpcPort(); |
There was a problem hiding this comment.
grpc.port values. In this repo's own multi-node test configs, application-server1.yml, application-server2.yml, and application-server3.yml advertise 8686, 8687, and 8688 respectively, so a follower on 8687 will redirect to leader-ip:8687 even when the elected leader is actually listening on 8686 or 8688. That turns the original NPE into a silent misroute.
If we can't recover the leader's advertised gRPC endpoint here, I think it's safer to fail fast than to synthesize an address from the local port, for example:
| return leaderIp + ":" + config.getGrpcPort(); | |
| } catch (TimeoutException | ExecutionException e) { | |
| throw new ExecutionException( | |
| String.format("Failed to resolve leader gRPC address for %s", raftNode.getLeaderId()), | |
| e); | |
| } |
A more complete fix would need a source of truth for the leader's actual grpcAddress, not the local node's port.
hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java
Outdated
Show resolved
Hide resolved
hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java
Show resolved
Hide resolved
- Cache leader PeerId after waitingForLeader() and null-check to avoid NPE when leader election times out - Remove incorrect fallback that derived leader gRPC address from local node's port, causing silent misroutes in multi-node clusters - Wire config.getRpcTimeout() into RaftRpcClient's RpcOptions so Bolt transport timeout is consistent with future.get() caller timeout - Replace hardcoded 10000ms in waitingForLeader() with config.getRpcTimeout() - Remove unused RaftOptions variable and dead imports (ReplicatorGroup, ThreadId) Fixes apache#2959 Related to apache#2952, apache#2962
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| } catch (TimeoutException | ExecutionException e) { | ||
| // TODO: a more complete fix would need a source of truth for the leader's | ||
| // actual grpcAddress rather than deriving it from the local node's port config. | ||
| throw new ExecutionException( | ||
| String.format("Failed to resolve leader gRPC address for %s", leader), e); | ||
| } | ||
|
|
||
| log.warn("Leader gRPC address field is null in RPC response for {}", leader); | ||
| throw new ExecutionException( | ||
| new IllegalStateException( | ||
| String.format("Leader gRPC address unavailable for %s", leader))); |
| } catch (TimeoutException | ExecutionException e) { | ||
| // TODO: a more complete fix would need a source of truth for the leader's | ||
| // actual grpcAddress rather than deriving it from the local node's port config. | ||
| throw new ExecutionException( | ||
| String.format("Failed to resolve leader gRPC address for %s", leader), e); |
| nodeOptions.setEnableMetrics(true); | ||
|
|
Purpose of the PR
In a 3-node PD cluster running in Docker bridge network mode,
getLeaderGrpcAddress()makes a bolt RPC call to discover the leader's gRPC address when the current node is a follower. This call fails in bridge mode — the TCP connection establishes but the bolt RPC response never returns properly, causingCompletableFuture.get()to return null and throw NPE.This causes:
redirectToLeader()fails with NPEpartitionCount:0)DEADLINE_EXCEEDEDloop indefinitelyThe cluster only works when pd0 wins raft leader election (since
isLeader()returns true and the broken code path is skipped). If pd1 or pd2 wins, the NPE fires on every redirect attempt.Related PR: #2952
Main Changes
config.getRpcTimeout()instead of unbounded.get().getGrpcAddress()TimeUnitandTimeoutExceptionimportsVerifying these changes
partitionCount:12on all 3 nodes when pd1 or pd2 is leadergetLeaderGrpcAddressDoes this PR potentially affect the following parts?
Documentation Status
Doc - TODODoc - DoneDoc - No Need