feat: ETS-backed LB with per-request pick, remove persistent_term writes on reconcile#527
Open
cgreeno wants to merge 25 commits into
Open
feat: ETS-backed LB with per-request pick, remove persistent_term writes on reconcile#527cgreeno wants to merge 25 commits into
cgreeno wants to merge 25 commits into
Conversation
…tes on reconcile
Moves the client load-balancer pick onto the RPC hot path via an ETS-backed
RoundRobin strategy, so gRPC clients now round-robin across backends per
request instead of caching one pick for 15 seconds.
Behaviour changes (GRPC.Client.LoadBalancing):
- init/1 now takes :channels (full Channel structs), not :addresses
- pick/1 returns {:ok, Channel.t(), new_state} directly — no second lookup
- adds optional update/2 for in-place reconciliation and shutdown/1 for cleanup
- @optional_callbacks keeps simple strategies compliant without stubs
RoundRobin rewrite:
- ETS table (:set, :public, read_concurrency: true) owned by the Connection
- pick: :ets.update_counter/3 + :ets.lookup + elem/2 — lock-free, no GenServer
- update: mutates channels tuple in place, no table churn across reconciles
- shutdown: :ets.delete, idempotent
PickFirst: trivially updated to the new channels-shaped signature.
Connection rewire:
- pick_channel/2 reads persistent_term once, dispatches to lb_mod.pick per request
- persistent_term is written exactly once (init) and erased once (disconnect) —
no more global GC pass across every BEAM process on DNS re-resolution
- :refresh timer removed (obsolete: per-request pick rotates automatically)
- build_balanced_state passes connected Channels to lb_mod.init
- build_direct_state uses PickFirst for uniformity (single persistent_term shape)
- reconcile calls lb_mod.update(lb_state, connected_channels) — update/2 allows
empty lists so pick returns :no_channels when all backends fail (LB's ETS is
now the source of truth; persistent_term stays stable)
Architecture follows sleipnir's review of elixir-grpc#513:
- Connection stays as orchestrator, LB owns pick state via opaque tid
- No new processes — ETS table lifecycle piggybacks on Connection GenServer
- LB module is the only thing that touches the ETS table
Tests: 275 total (was 266), 0 failures across 3 seeds.
- 11 new RoundRobin unit tests (init/pick/update/shutdown/concurrency)
- 8 new PickFirst unit tests
- 1 new integration test proving per-request rotation end-to-end
Micro-benchmark (bench/lb_pick.exs, Apple M1 Max):
Single-process pick:
persistent_term.get 13.34M ips 75 ns (cached pick, no rotation)
ETS RoundRobin.pick 3.42M ips 293 ns (per-request rotation)
GenServer.call pick 0.98M ips 1020 ns (serialized via mailbox)
32-process concurrent pick:
persistent_term.get 6.6M ips 0.15 μs (read-scales linearly)
GenServer.call pick 26.6K ips 37 μs (mailbox-bound)
ETS RoundRobin.pick 9.9K ips 101 μs (counter contention)
Single-process reconciliation write:
persistent_term.put 5.95M ips 168 ns (+ global GC pass, not measured)
ETS RoundRobin.update 1.51M ips 663 ns (local, no side effects)
The comparison is not apples-to-apples: persistent_term.get returns the same
cached pick every call (no load distribution), ETS does per-request rotation.
At realistic gRPC traffic rates (1K-10K picks/sec bounded by network latency),
neither the 293ns pick cost nor the counter contention is observable — RPC
latency dominates by 4+ orders of magnitude.
Follow-up to elixir-grpc#520. Can drop the benchee dep + bench script if desired.
The benchmark served its purpose: informing the design choice in the previous commit. Keeping benchee as a first-party dep for one script isn't justified, and the numbers belong in the PR description anyway (preserved in c3c3ebe's commit message for history). No code or test changes. Dep tree is back to the pre-LB state.
If a caller reads {lb_mod, lb_state} from persistent_term and a concurrent
disconnect runs lb_mod.shutdown before the caller reaches RoundRobin.pick/1,
both :ets.lookup and :ets.update_counter raise ArgumentError on the deleted
table — crashing the calling RPC process rather than surfacing a tagged error.
The pre-refactor code couldn't hit this path (persistent_term.get with a
default of nil always returned safely), so this is a regression introduced
by moving the pick onto the hot path.
Rescue ArgumentError and map it to {:error, :no_channels}, matching the
empty-table branch. Adds a unit test that shuts the table down and then
picks, which reproduced the crash before the rescue.
The LB owns its ETS table; Connection.shutdown_lb/2 is the only place that
calls lb_mod.shutdown/1. Nothing in the suite currently breaks if that call
is removed, so a refactor could silently leak an ETS table per gracefully-
closed connection (and another per supervised-shutdown crash).
Two regression tests:
* disconnect/1 frees the table — captures the tid via :sys.get_state before
disconnecting, asserts :ets.info returns :undefined afterwards.
* terminate/2 frees the table — same shape, but stops the GenServer via
GenServer.stop to exercise the crash path.
Both rely on :round_robin being the only policy with an ETS table today;
if PickFirst grows one, these tests still pass as-is.
The previous commit that moved the pick onto the hot path removed the 15s
:refresh timer, but two pieces of module docs still described it:
* "The orchestrator periodically refreshes the LB decision to adapt to
changes." — replaced with a one-liner describing per-request dispatch.
* "The orchestrator refreshes the LB pick every 15 seconds." — removed
entirely along with its standalone Notes section.
Docs now match behaviour.
…tarted test
This test erases the persistent_term entry, confirms Connection.connect
returns :no_connection on the already_started branch, then puts an entry
back so the subsequent disconnect completes cleanly.
It used to restore a bare Channel struct, which predates the ETS refactor.
The current shape is {lb_mod, lb_state}. The test was passing by accident
because disconnect only erases the key (shape-agnostic) — but a future
refactor that actually reads the restored entry would trip on the mismatch.
Capture the real entry before erasing and restore it verbatim.
The persistent_term entry stores {lb_mod, lb_state} written once at init and
never rewritten during the connection's life (that's the whole point of the
ETS refactor). This works as long as lb_mod.update/2 mutates external state
and returns a state value the caller already holds — which is how RoundRobin
behaves (the tid is stable).
PickFirst used to return a NEW state map from update/2:
def update(_state, [first | _]), do: {:ok, %{current: first}}
The Connection GenServer's own state got the new map, but persistent_term
kept the original %{current: first_backend_at_connect_time}. So
pick_channel silently kept returning the dead backend forever after any
DNS change. Not covered by tests because every re_resolve_test scenario
passes lb_policy: :round_robin.
Fix: give PickFirst the same ETS-backed state shape as RoundRobin. init
creates a one-row table {:current, channel}, update mutates it in place,
pick reads it. State is %{tid: tid} — stable across update/2.
Tests:
* Rewrites pick_first_test.exs against the new shape (11 tests).
* Adds a re-resolve integration test using the default LB (PickFirst):
connect with one backend, swap the DNS answer, assert pick_channel
returns the new backend. Reproduced the bug before this fix.
Follow-up on @sleipnir's comment: "we can replace them with ETS, but we can't forget that the responsibility for the pick lies with the load balancing module." elixir-grpc#520 (comment) His sample used ETS keyed by ref with no persistent_term. My earlier commit kept persistent_term as a write-once pointer — worked fine, but wasn't what he showed. This matches the sample. New module: GRPC.Client.LoadBalancing.Registry. Small GenServer supervised by the client Application. Owns a named public ETS table (:grpc_client_lb_registry). Three functions — put/2, lookup/1, delete/1. Lookup rescues ArgumentError so a racing :ets.delete on shutdown gives :error instead of crashing the RPC. Connection now: * Registry.put on init (ref -> {lb_mod, lb_state}) * Registry.lookup in pick_channel * Registry.delete in disconnect and terminate No persistent_term left in the LB path. LB behaviour unchanged, per-LB ETS tables stay as they were. 282 tests pass. connection_test.exs pokes Registry directly where it used to poke persistent_term. BTW — also considered putting {lb_mod, lb_state} straight on the Channel struct. Drops both persistent_term and the registry, same runtime behaviour, one fewer indirection on every pick. Flagged in the PR for @sleipnir — happy to swap if that reads cleaner to him.
The Registry is only exercised through Connection today — if someone regresses lookup/1 or delete/1 in isolation, no test fails. Add direct round-trip tests for the public API (arbitrary key shapes, overwrite, missing key returns :error, delete on unknown key is safe). The ArgumentError rescue in lookup/1 is covered end-to-end by the concurrent pick+disconnect test — not duplicated here.
Concurrent RPCs must not crash when disconnect is tearing down LB state.
This is the scenario the ArgumentError rescues in Registry.lookup/1 and
the LB modules' pick/1 exist to handle — but nothing actually proved it
end-to-end.
Spawns 50 processes doing 100 picks each, fires disconnect mid-flight.
Every result must be either {:ok, %Channel{}} or {:error, :no_connection}.
Anything else means we crashed a caller.
…ycles Mirrors the regression pattern from PR elixir-grpc#509, which added a 100-cycle test to prove the persistent_term leak fix. Same idea, applied to the tables the refactor introduced: the shared Registry and the per-LB ETS tables. Snapshots :ets.all() and the Registry size before the loop, cycles 500 connect+disconnect pairs, asserts the Registry returns to its starting size and no more than a handful of new tables exist (VM may create a few incidentally during the run).
…ends
Derived from research into long-lived gRPC clients (GCP, Temporal,
LiveKit, our own unleash_fresha) — they all hit rapid DNS churn where
the backend count drops between picks. The risk is that a concurrent
RoundRobin pick indexes into a tuple that update/2 just replaced with
a smaller one.
Starts with 8 backends, launches 30 processes doing 200 picks each,
triggers a DNS reconcile down to 2 backends mid-flight. Every pick
must return {:ok, %Channel{}} — no :badarg from elem/2, no stale
index arithmetic, no crash. After the reconcile settles, all picks
must come from the two survivors only.
The earlier refactor marked update/2 as @optional_callbacks and added a reconcile_lb fallback that called init/1 when the LB skipped update/2. The fallback was broken: init/1 creates a fresh ETS table and the previous lb_state's table orphans, leaking one table per DNS reconcile. For our own LBs the fallback was dead code (both implement update/2), but the trap was real for any third party writing a custom LB. update/2 is the only way to reconcile a stateful LB without leaking — "optional" was the wrong signal. Make it required so the compiler warns anyone who skips it, delete the fallback, let reconcile_lb be the single line it should have been. shutdown/1 stays optional — strategies with no external resources can genuinely omit it.
…on resolver worker crash LB strategies' ETS tables are owned by the Connection GenServer (lb_mod.init now runs inside init/1), so BEAM reclaims them automatically on exit — no explicit shutdown callback needed. Merged the two connect paths through a single finalize_connection/2 helper. On resolver worker death, stop the Connection instead of re-initializing. The old re-init path created a crash-loop when init itself kept failing (bad config, unreachable control plane). Unrelated :EXIT signals are now logged and ignored so stray links from user code can't take Connection down. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rescope the catch-all :EXIT comment so the :normal clause's silent ignore isn't surprising, and replace "reconciles" with "DNS re-resolutions" in the PickFirst moduledoc for readers who aren't already deep in the LB code path.
…connect
Three small fixes from review feedback on the ETS LB work:
- init/1: publish to the Registry last. init crashes don't run
terminate/2, so registering before resolver.init/2 would strand the
entry if resolver init raised.
- rebalance_after_reconcile/3: republish {lb_mod, lb_state} after
lb_mod.update/2 so pick_channel/2 reads the fresh state returned by
update/2, not the state captured at init.
- handle_call({:disconnect, ...}): linearized to Registry.delete →
transport close → reply. Dropped the dead Map.drop branch; the struct
always has :real_channels.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5d63ada to
d1be507
Compare
This was a bad call on my part. Stopping the Connection when the
resolver worker dies means a network blip or brief DNS hiccup tears
down every transport channel and reconnects from scratch. We can't
have network jitter force a restart.
The reasoning I had behind the original change ("let-it-crash, trust
the supervisor") doesn't hold up either:
- The supervisor restart uses the original initial_state snapshot.
real_channels in there points to conn_pids from first connect, so
the new Connection inherits stale references.
- terminate/2 only deletes the Registry entry; transports leak on
abnormal exit.
- The "avoid crash-loop" justification was wrong. The old code had
a fallback to resolver_state: nil — it degraded gracefully, it
didn't crash-loop.
Restore the original in-place re-init.
e63c20e to
648f80d
Compare
hansihe
reviewed
Apr 29, 2026
Replace the shared ETS Registry with :persistent_term keyed by
{Connection, ref} and move the RoundRobin cursor from an ETS counter
row to :atomics. Three problems collapse into one design:
1. Hot-path counter contention. :ets.update_counter/3 locks the key's
hash bucket, so 32 concurrent picks bottleneck on a single counter
(~10K ips in the bench). :atomics.add_get/3 is a hardware CAS — no
table-level or key-level lock — and scales linearly across
schedulers.
2. Two ETS tables where one is enough. The shared registry table is
gone; the per-Connection LB table now holds only the channels
tuple. Lookup of {lb_mod, lb_state} is a :persistent_term.get,
which is faster than an :ets.lookup and allocation-free.
3. Steady-state global GC. The published lb_state contains a stable
tid + atomics ref for the life of the connection. Reconcile
mutates ETS + atomics in place and does NOT republish, so
:persistent_term.put fires once per connect — not every 30s — and
the global GC pass it triggers stays off the hot loop.
Lifecycle:
init/1 put once, after lb_mod.init + resolver.init succeed
pick_channel/2 get on every RPC
disconnect/1 erase first (in-flight picks fail fast), then close
terminate/2 erase (idempotent — safe even if disconnect erased)
The terminate/2 erase covers the abnormal-exit path; the only leak
window is brutal kill, which already existed with Registry.delete and
is not regressed by this change.
The failure modes the previous PR introduced are all preserved:
- rescue ArgumentError in lb_mod.pick — table reclaimed mid-pick
- persistent_term.get/2 fallback to nil — entry erased mid-pick
- erase BEFORE closing transports — picks fail with :no_connection
rather than racing a dying adapter
Tests: 286/286 pass (was 292; six Registry tests removed). The
500-cycle leak regression now counts {Connection, _}-keyed
persistent_term entries instead of registry-table size.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The resolver code from elixir-grpc#520 ships with essentially zero inline comments and tight one-line @moduledocs. Match that here: - Slash multi-paragraph @moduledoc blocks down to one-liners on the LB behaviour, RoundRobin, PickFirst, and the LB test moduledocs. - Remove the multi-line WHY comments in Connection (init publish ordering, disconnect erase ordering, reconcile in-place mutation). - Remove enumerate-every-test moduledoc from RoundRobin tests. - Remove the "guards the race where..." race-test comments — the test name already says what it does, and the rescue is in the LB code. - Remove the connection_test commentary on the already_started path, the LB ETS lifecycle setup, the concurrent-pick rationale, and the ETS-table-slop justification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@sleipnir I got a friend to review it first to try and save you some time. Happy to go in whatever direction you want however. :) |
Collaborator
Ok let me known when you are done! |
Contributor
Author
|
@sleipnir - Done - he reviewed it earlier. :) |
Collaborator
Okay, now I'm going to review it. |
…-path # Conflicts: # grpc/test/grpc/client/connection_test.exs
Contributor
Author
|
@sleipnir - how is it looking? No Pressure ;) |
Test leaked a Connection + DNSResolver pair with resolve_interval: 200ms. After the test exited, the leaked resolver kept calling MockResolver.resolve/1, which Mox rejected (UnexpectedCallError). Connection.handle_info re-init'd the resolver, MockResolver.init/2 also raised, Connection crashed, DynamicSupervisor restarted it, max_restarts tripped, and the supervisor shut down — breaking later tests that try to start_child (notably the new 500-cycle leak test).
…-path # Conflicts: # grpc/test/grpc/client/connection_test.exs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #520 putting the LB module in control of the pick decision.
pick_channel/2now goes throughlb_mod.pick(lb_state)on every RPC. The hot path is fully lock-free:No GenServer call on the hot path. No
:persistent_term.puton reconcile (and so no global GC pass when backends scale in/out).The shape generalizes: weighted RR, least-requests, health-aware routing, and xDS all need richer per-backend state than a single atom — "LB owns its ETS table + atomics" carries them without forcing each strategy to invent state plumbing.
Architecture
:persistent_termkeyed{Connection, ref}{lb_mod, lb_state}init/1, erase ondisconnect/1+terminate/2{:channels, channels_tuple}:atomicsref (inlb_state)The
lb_statepublished topersistent_term(thetid+atomicsref) is stable for the life of the connection — reconcile mutates ETS and atomics in place rather than republishing, so:persistent_term.putfires once per connect, not every 30s. That keeps the global GC passputtriggers off the steady-state path.Micro-benchmark (Apple M1 Max)
TL;DR: the pick path adds ~290 ns over the pre-PR cached read (52 ns → 345 ns) in exchange for a real per-request RR rotation across N channels. The reconciliation write path replaces
:persistent_term.putwith:ets.insert + :atomics.put, eliminating the global GC pass that previously fired every 30s per Connection.All comparisons are vs the pre-PR baseline: a single cached
:persistent_term.getof one channel, no rotation — what shipped before #520.Single-process pick
:persistent_term.get, cached)persistent_term + ets + atomics)32-process concurrent pick
:persistent_term.get)The ~6× delta is the cost of moving from a pure read to a real RR rotation. Critically,
:atomics.add_get/3is a hardware CAS with no key-level or table-level lock, so the contention story is just cache coherence rather than a serialization point. ~6M picks/sec at 32 procs is well above any realistic per-Connection RPC volume.Single-process reconciliation write
:persistent_term.put):ets.insert + :atomics.put)The headline
persistent_term.putlatency hides the real cost. Per the BEAM docs, every write forces a global GC pass — every process in the node has to scan its heap. With periodic DNS re-resolution this fires every 30s per Connection; the new design pays ~340 ns more per reconcile and zero GC.Notable details
RoundRobin.pick/1rescuesArgumentError—disconnect/1may delete the LB ETS table mid-pick and we want a tagged error rather than a propagated crash.:persistent_term.putruns last ininit/1—init/1crashes skipterminate/2, so publishing earlier would leak entries on a failingresolver.init/2.disconnect/1calls:persistent_term.erasebefore closing transports, so in-flight picks fail fast with:no_connectionrather than racing a dying adapter.terminate/2also erases (idempotent on the disconnect path —:persistent_term.erasereturnsfalseon a missing key without raising). Covers crash exits and supervisor stops.rebalance_after_reconcile/3does not republish topersistent_term— the publishedlb_stateshape (tid + atomics ref) is stable for the life of the connection, soupdate/2's in-place mutation is visible to picks immediately.Test coverage
disconnect/1andterminate/2both free the LB's ETS table.disconnect/1— no crashes.{Connection, _}-keyedpersistent_termleak, zero ETS table leak (mirrors the fix: erase persistent_term leak in GRPC.Client.Connection on disconnect #509persistent_termleak regression).Full suite: 286 tests, 0 failures.