Skip to content

fix(pd): resolve hostname entries in IpAuthHandler allowlist#2962

Open
bitflicker64 wants to merge 4 commits intoapache:masterfrom
bitflicker64:fix-ipauthhandler-dns-resolution
Open

fix(pd): resolve hostname entries in IpAuthHandler allowlist#2962
bitflicker64 wants to merge 4 commits intoapache:masterfrom
bitflicker64:fix-ipauthhandler-dns-resolution

Conversation

@bitflicker64
Copy link
Contributor

@bitflicker64 bitflicker64 commented Mar 5, 2026

Purpose of the PR

IpAuthHandler only compared the client IP with the allowlist entries directly.
When the allowlist contains hostnames, connections from their resolved IPs could be rejected.

Main Changes

  • Resolve hostname allowlist entries using InetAddress.getAllByName
  • Store resolved addresses in a set used for connection validation
  • Keep runtime validation as a simple Set.contains() lookup

Verifying these changes

Does this PR potentially affect the following parts?

  • Dependencies
  • Modify configurations
  • The public API
  • Other affects
  • Nope

Documentation Status

  • Doc - TODO
  • Doc - Done
  • Doc - No Need

@bitflicker64
Copy link
Contributor Author

How I tested:

  1. Built a local Docker image from source with this fix applied (alongside the getLeaderGrpcAddress fix from fix(pd): add timeout and null-safety to getLeaderGrpcAddress() #2961)
  2. Removed static IP workaround from docker-compose — switched back to hostnames (pd0:8610, pd1:8610, pd2:8610) for all raft peers, gRPC hosts, and PD addresses
  3. Brought up the full 3-node cluster (3 PD + 3 Store + 3 Server) in bridge network mode with no ipam static IP config
  4. pd2 won leader election on first boot

Results with pd2 as leader and hostnames only:

partitionCount:12 on all 3 stores ✅
leaderCount:12 on all 3 stores ✅
{"graphs":["hugegraph"]} ✅
All 9 containers healthy ✅
No "Blocked connection" warnings in any PD logs ✅

Before this fix: IpAuthHandler compared raw hostname strings against incoming connection IPs — always failed in bridge mode, requiring static IPs as a workaround.

After this fix: Hostnames are resolved to IPs at startup via InetAddress.getAllByName() — static IP workaround no longer needed.

Related PR: #2952, #2961

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes PD Raft inbound connection allowlisting when the configured allowlist contains hostnames (e.g., Docker bridge service names) by resolving those hostnames to IPs up front and validating connections via a simple set lookup.

Changes:

  • Resolve allowlist entries via InetAddress.getAllByName() at handler construction time
  • Cache resolved IPs in an immutable Set used by isIpAllowed()
  • Log a warning when an allowlist hostname can’t be resolved

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…r changes

- Resolve allowlist hostnames to IPs using InetAddress.getAllByName
- Add refresh() to update resolved IPs when Raft peer list changes
- Wire refresh into RaftEngine.changePeerList()
- Add IpAuthHandlerTest covering hostname resolution, refresh behavior, and failure cases
@bitflicker64
Copy link
Contributor Author

While working on this change, I noticed a couple of related behaviors that appear to predate this PR. The /v1/members/change endpoint relies on RestAuthentication, where Authentication.authenticate() currently validates only the username portion of Basic Auth. I also noticed that GrpcAuthentication does not appear to be wired into the gRPC server. These are outside the scope of this fix, but I thought it might be helpful to mention in case they are worth tracking separately.
Also worth noting: IpAuthHandler.refresh() is wired through RaftEngine.changePeerList(), but PDService.updatePdRaft() calls node.changePeers() directly, so the refresh won't fire for that path.

@bitflicker64 bitflicker64 requested a review from imbajin March 12, 2026 07:25
bitflicker64 added a commit to bitflicker64/incubator-hugegraph that referenced this pull request Mar 12, 2026
- Cache leader PeerId after waitingForLeader() and null-check to avoid
  NPE when leader election times out
- Remove incorrect fallback that derived leader gRPC address from local
  node's port, causing silent misroutes in multi-node clusters
- Wire config.getRpcTimeout() into RaftRpcClient's RpcOptions so Bolt
  transport timeout is consistent with future.get() caller timeout
- Replace hardcoded 10000ms in waitingForLeader() with config.getRpcTimeout()
- Remove unused RaftOptions variable and dead imports (ReplicatorGroup, ThreadId)

Fixes apache#2959
Related to apache#2952, apache#2962
@imbajin imbajin requested a review from Copilot March 13, 2026 09:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@imbajin
Copy link
Member

imbajin commented Mar 13, 2026

While working on this change, I noticed a couple of related behaviors that appear to predate this PR. The /v1/members/change endpoint relies on RestAuthentication, where Authentication.authenticate() currently validates only the username portion of Basic Auth. I also noticed that GrpcAuthentication does not appear to be wired into the gRPC server. These are outside the scope of this fix, but I thought it might be helpful to mention in case they are worth tracking separately. Also worth noting: IpAuthHandler.refresh() is wired through RaftEngine.changePeerList(), but PDService.updatePdRaft() calls node.changePeers() directly, so the refresh won't fire for that path.

Some context about the auth system in PD/Store:

To put it simply, PD/Store didn't have a strict auth mechanism due to legacy design decisions. Since these and other gRPC components were meant to be internal-only, only the graph server was originally built with a formal public-facing auth interface. We later added a basic, quick-and-dirty auth layer for security, but the current implementation remains incomplete. We plan to systematically refactor this later, but for now, let's just add TODOs in the relevant places

@bitflicker64
Copy link
Contributor Author

We plan to systematically refactor this later, but for now, let's just add TODOs in the relevant places

Thanks for the context, that makes sense. I’ll add the TODOs in the relevant places and work through the reviewed suggestions. Apologies for the delay as my Mac broke so I had to shift my setup to Ubuntu

- Wire IpAuthHandler.refresh() into PDService.updatePdRaft() which
  calls node.changePeers() directly, bypassing RaftEngine.changePeerList()
- Include learner IPs in refresh since updatePdRaft() supports learner peers
- Add @before setUp() to IpAuthHandlerTest to prevent stale singleton
  from earlier suite classes poisoning test assertions
- Strengthen testUnresolvableHostnameDoesNotCrash assertions
- Replace testGetInstanceReturnsNullBeforeInit with
  testGetInstanceReturnsSingletonIgnoresNewAllowlist
- Document missing auth check on MemberAPI changePeerList endpoint
- Note that password validation is currently skipped in Authentication
- Flag that GrpcAuthentication interceptor is not wired into the gRPC server
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +56 to +62
public void testHostnameResolvesToIp() {
// "localhost" should resolve to "127.0.0.1" via InetAddress.getAllByName()
// This verifies the core fix: hostname allowlists match numeric remote addresses
IpAuthHandler handler = IpAuthHandler.getInstance(
Collections.singleton("localhost"));
Assert.assertTrue(isIpAllowed(handler, "127.0.0.1"));
}
Comment on lines +65 to +73
public void testUnresolvableHostnameDoesNotCrash() {
// Should log a warning and skip — no exception thrown during construction
IpAuthHandler handler = IpAuthHandler.getInstance(
Collections.singleton("nonexistent.invalid.hostname"));
// Handler was still created successfully despite bad hostname
Assert.assertNotNull(handler);
// Unresolvable entry is skipped so no IPs should be allowed
Assert.assertFalse(isIpAllowed(handler, "127.0.0.1"));
Assert.assertFalse(isIpAllowed(handler, "192.168.0.1"));
*/
public void refresh(Set<String> newAllowedIps) {
this.resolvedIps = resolveAll(newAllowedIps);
log.info("IpAuthHandler allowlist refreshed, resolved {} entries", resolvedIps.size());
log.info("updatePdRaft, change peers success");
// Refresh IpAuthHandler so newly added peers are not blocked
IpAuthHandler handler = IpAuthHandler.getInstance();
if (handler != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ The new tests cover IpAuthHandler.refresh() in isolation, but they do not exercise the actual regression path here: changePeers(...) -> callback -> handler.refresh(...). Could we add one regression test through the peer-list update entrypoint as well? That would lock in the real behavior this PR is fixing and prevent a future refactor from accidentally dropping the callback refresh while the current tests still stay green.

Current coverage
+--------------------+      +-------------------+
| IpAuthHandlerTest  | ---> | handler.refresh() |
+--------------------+      +-------------------+

Missing regression path
+------------------+      +----------------+      +-------------------+
| changePeerList() | ---> | changePeers()  | ---> | handler.refresh() |
+------------------+      +----------------+      +-------------------+

A small integration-style test around RaftEngine.changePeerList() or PDService.updatePdRaft() would probably be enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] IpAuthHandler blocks all cross-node raft connections in Docker bridge mode — hostname vs IP mismatch

3 participants