feat: ETS-backed LB with per-request pick, remove persistent_term writes on reconcile by cgreeno · Pull Request #527 · elixir-grpc/grpc

cgreeno · 2026-04-24T22:12:25Z

Summary

Follow-up to #520 putting the LB module in control of the pick decision.

pick_channel/2 now goes through lb_mod.pick(lb_state) on every RPC. The hot path is fully lock-free:

pick_channel/2
  → :persistent_term.get({Connection, ref})    # compile-time-constant lookup
  → lb_mod.pick(lb_state)                       # :atomics.add_get + elem (RR) or :ets.lookup (PF)
  → {:ok, channel}

No GenServer call on the hot path. No :persistent_term.put on reconcile (and so no global GC pass when backends scale in/out).

The shape generalizes: weighted RR, least-requests, health-aware routing, and xDS all need richer per-backend state than a single atom — "LB owns its ETS table + atomics" carries them without forcing each strategy to invent state plumbing.

Architecture

Where	What it holds	Lifecycle
`:persistent_term` keyed `{Connection, ref}`	`{lb_mod, lb_state}`	put once on `init/1`, erase on `disconnect/1` + `terminate/2`
Per-Connection ETS table (one per Connection, owned by the GenServer)	`{:channels, channels_tuple}`	replaced wholesale on reconcile
`:atomics` ref (in `lb_state`)	RR cursor (unsigned 64-bit)	reset to 0 on reconcile

The lb_state published to persistent_term (the tid + atomics ref) is stable for the life of the connection — reconcile mutates ETS and atomics in place rather than republishing, so :persistent_term.put fires once per connect, not every 30s. That keeps the global GC pass put triggers off the steady-state path.

Micro-benchmark (Apple M1 Max)

TL;DR: the pick path adds ~290 ns over the pre-PR cached read (52 ns → 345 ns) in exchange for a real per-request RR rotation across N channels. The reconciliation write path replaces :persistent_term.put with :ets.insert + :atomics.put, eliminating the global GC pass that previously fired every 30s per Connection.

All comparisons are vs the pre-PR baseline: a single cached :persistent_term.get of one channel, no rotation — what shipped before #520.

Single-process pick

Variant	Throughput	Latency	Notes
Pre-PR baseline (`:persistent_term.get`, cached)	19.08M ips	52 ns	no rotation, single channel
NEW pick (`persistent_term + ets + atomics`)	2.90M ips	345 ns	full per-request RR rotation

32-process concurrent pick

Variant	Total throughput	Notes
Pre-PR baseline (`:persistent_term.get`)	~35M picks/sec	reads scale linearly
NEW pick	~6.0M picks/sec	atomics counter scales — limited by cache-line ping-pong, not a software lock

The ~6× delta is the cost of moving from a pure read to a real RR rotation. Critically, :atomics.add_get/3 is a hardware CAS with no key-level or table-level lock, so the contention story is just cache coherence rather than a serialization point. ~6M picks/sec at 32 procs is well above any realistic per-Connection RPC volume.

Single-process reconciliation write

Variant	Throughput	Latency	Notes
Pre-PR baseline (`:persistent_term.put`)	5.37M ips	186 ns	+ global GC pass across every BEAM process
NEW reconcile (`:ets.insert + :atomics.put`)	1.90M ips	527 ns	local, no global side effects

The headline persistent_term.put latency hides the real cost. Per the BEAM docs, every write forces a global GC pass — every process in the node has to scan its heap. With periodic DNS re-resolution this fires every 30s per Connection; the new design pays ~340 ns more per reconcile and zero GC.

Notable details

RoundRobin.pick/1 rescues ArgumentError — disconnect/1 may delete the LB ETS table mid-pick and we want a tagged error rather than a propagated crash.
:persistent_term.put runs last in init/1 — init/1 crashes skip terminate/2, so publishing earlier would leak entries on a failing resolver.init/2.
disconnect/1 calls :persistent_term.erase before closing transports, so in-flight picks fail fast with :no_connection rather than racing a dying adapter.
terminate/2 also erases (idempotent on the disconnect path — :persistent_term.erase returns false on a missing key without raising). Covers crash exits and supervisor stops.
rebalance_after_reconcile/3 does not republish to persistent_term — the published lb_state shape (tid + atomics ref) is stable for the life of the connection, so update/2's in-place mutation is visible to picks immediately.

Test coverage

19 LB unit tests (PickFirst, RoundRobin including 16-process concurrency stress).
disconnect/1 and terminate/2 both free the LB's ETS table.
50 concurrent pickers × 100 picks vs disconnect/1 — no crashes.
30 concurrent pickers × 200 picks during a shrinking reconcile — no crashes; survivors-only after settle.
500 connect/disconnect cycles — zero {Connection, _}-keyed persistent_term leak, zero ETS table leak (mirrors the fix: erase persistent_term leak in GRPC.Client.Connection on disconnect #509 persistent_term leak regression).

Full suite: 286 tests, 0 failures.

…tes on reconcile Moves the client load-balancer pick onto the RPC hot path via an ETS-backed RoundRobin strategy, so gRPC clients now round-robin across backends per request instead of caching one pick for 15 seconds. Behaviour changes (GRPC.Client.LoadBalancing): - init/1 now takes :channels (full Channel structs), not :addresses - pick/1 returns {:ok, Channel.t(), new_state} directly — no second lookup - adds optional update/2 for in-place reconciliation and shutdown/1 for cleanup - @optional_callbacks keeps simple strategies compliant without stubs RoundRobin rewrite: - ETS table (:set, :public, read_concurrency: true) owned by the Connection - pick: :ets.update_counter/3 + :ets.lookup + elem/2 — lock-free, no GenServer - update: mutates channels tuple in place, no table churn across reconciles - shutdown: :ets.delete, idempotent PickFirst: trivially updated to the new channels-shaped signature. Connection rewire: - pick_channel/2 reads persistent_term once, dispatches to lb_mod.pick per request - persistent_term is written exactly once (init) and erased once (disconnect) — no more global GC pass across every BEAM process on DNS re-resolution - :refresh timer removed (obsolete: per-request pick rotates automatically) - build_balanced_state passes connected Channels to lb_mod.init - build_direct_state uses PickFirst for uniformity (single persistent_term shape) - reconcile calls lb_mod.update(lb_state, connected_channels) — update/2 allows empty lists so pick returns :no_channels when all backends fail (LB's ETS is now the source of truth; persistent_term stays stable) Architecture follows sleipnir's review of elixir-grpc#513: - Connection stays as orchestrator, LB owns pick state via opaque tid - No new processes — ETS table lifecycle piggybacks on Connection GenServer - LB module is the only thing that touches the ETS table Tests: 275 total (was 266), 0 failures across 3 seeds. - 11 new RoundRobin unit tests (init/pick/update/shutdown/concurrency) - 8 new PickFirst unit tests - 1 new integration test proving per-request rotation end-to-end Micro-benchmark (bench/lb_pick.exs, Apple M1 Max): Single-process pick: persistent_term.get 13.34M ips 75 ns (cached pick, no rotation) ETS RoundRobin.pick 3.42M ips 293 ns (per-request rotation) GenServer.call pick 0.98M ips 1020 ns (serialized via mailbox) 32-process concurrent pick: persistent_term.get 6.6M ips 0.15 μs (read-scales linearly) GenServer.call pick 26.6K ips 37 μs (mailbox-bound) ETS RoundRobin.pick 9.9K ips 101 μs (counter contention) Single-process reconciliation write: persistent_term.put 5.95M ips 168 ns (+ global GC pass, not measured) ETS RoundRobin.update 1.51M ips 663 ns (local, no side effects) The comparison is not apples-to-apples: persistent_term.get returns the same cached pick every call (no load distribution), ETS does per-request rotation. At realistic gRPC traffic rates (1K-10K picks/sec bounded by network latency), neither the 293ns pick cost nor the counter contention is observable — RPC latency dominates by 4+ orders of magnitude. Follow-up to elixir-grpc#520. Can drop the benchee dep + bench script if desired.

The benchmark served its purpose: informing the design choice in the previous commit. Keeping benchee as a first-party dep for one script isn't justified, and the numbers belong in the PR description anyway (preserved in c3c3ebe's commit message for history). No code or test changes. Dep tree is back to the pre-LB state.

If a caller reads {lb_mod, lb_state} from persistent_term and a concurrent disconnect runs lb_mod.shutdown before the caller reaches RoundRobin.pick/1, both :ets.lookup and :ets.update_counter raise ArgumentError on the deleted table — crashing the calling RPC process rather than surfacing a tagged error. The pre-refactor code couldn't hit this path (persistent_term.get with a default of nil always returned safely), so this is a regression introduced by moving the pick onto the hot path. Rescue ArgumentError and map it to {:error, :no_channels}, matching the empty-table branch. Adds a unit test that shuts the table down and then picks, which reproduced the crash before the rescue.

The LB owns its ETS table; Connection.shutdown_lb/2 is the only place that calls lb_mod.shutdown/1. Nothing in the suite currently breaks if that call is removed, so a refactor could silently leak an ETS table per gracefully- closed connection (and another per supervised-shutdown crash). Two regression tests: * disconnect/1 frees the table — captures the tid via :sys.get_state before disconnecting, asserts :ets.info returns :undefined afterwards. * terminate/2 frees the table — same shape, but stops the GenServer via GenServer.stop to exercise the crash path. Both rely on :round_robin being the only policy with an ETS table today; if PickFirst grows one, these tests still pass as-is.

The previous commit that moved the pick onto the hot path removed the 15s :refresh timer, but two pieces of module docs still described it: * "The orchestrator periodically refreshes the LB decision to adapt to changes." — replaced with a one-liner describing per-request dispatch. * "The orchestrator refreshes the LB pick every 15 seconds." — removed entirely along with its standalone Notes section. Docs now match behaviour.

…tarted test This test erases the persistent_term entry, confirms Connection.connect returns :no_connection on the already_started branch, then puts an entry back so the subsequent disconnect completes cleanly. It used to restore a bare Channel struct, which predates the ETS refactor. The current shape is {lb_mod, lb_state}. The test was passing by accident because disconnect only erases the key (shape-agnostic) — but a future refactor that actually reads the restored entry would trip on the mismatch. Capture the real entry before erasing and restore it verbatim.

The persistent_term entry stores {lb_mod, lb_state} written once at init and never rewritten during the connection's life (that's the whole point of the ETS refactor). This works as long as lb_mod.update/2 mutates external state and returns a state value the caller already holds — which is how RoundRobin behaves (the tid is stable). PickFirst used to return a NEW state map from update/2: def update(_state, [first | _]), do: {:ok, %{current: first}} The Connection GenServer's own state got the new map, but persistent_term kept the original %{current: first_backend_at_connect_time}. So pick_channel silently kept returning the dead backend forever after any DNS change. Not covered by tests because every re_resolve_test scenario passes lb_policy: :round_robin. Fix: give PickFirst the same ETS-backed state shape as RoundRobin. init creates a one-row table {:current, channel}, update mutates it in place, pick reads it. State is %{tid: tid} — stable across update/2. Tests: * Rewrites pick_first_test.exs against the new shape (11 tests). * Adds a re-resolve integration test using the default LB (PickFirst): connect with one backend, swap the DNS answer, assert pick_channel returns the new backend. Reproduced the bug before this fix.

@sleipnir

Follow-up on @sleipnir's comment: "we can replace them with ETS, but we can't forget that the responsibility for the pick lies with the load balancing module." elixir-grpc#520 (comment) His sample used ETS keyed by ref with no persistent_term. My earlier commit kept persistent_term as a write-once pointer — worked fine, but wasn't what he showed. This matches the sample. New module: GRPC.Client.LoadBalancing.Registry. Small GenServer supervised by the client Application. Owns a named public ETS table (:grpc_client_lb_registry). Three functions — put/2, lookup/1, delete/1. Lookup rescues ArgumentError so a racing :ets.delete on shutdown gives :error instead of crashing the RPC. Connection now: * Registry.put on init (ref -> {lb_mod, lb_state}) * Registry.lookup in pick_channel * Registry.delete in disconnect and terminate No persistent_term left in the LB path. LB behaviour unchanged, per-LB ETS tables stay as they were. 282 tests pass. connection_test.exs pokes Registry directly where it used to poke persistent_term. BTW — also considered putting {lb_mod, lb_state} straight on the Channel struct. Drops both persistent_term and the registry, same runtime behaviour, one fewer indirection on every pick. Flagged in the PR for @sleipnir — happy to swap if that reads cleaner to him.

The Registry is only exercised through Connection today — if someone regresses lookup/1 or delete/1 in isolation, no test fails. Add direct round-trip tests for the public API (arbitrary key shapes, overwrite, missing key returns :error, delete on unknown key is safe). The ArgumentError rescue in lookup/1 is covered end-to-end by the concurrent pick+disconnect test — not duplicated here.

Concurrent RPCs must not crash when disconnect is tearing down LB state. This is the scenario the ArgumentError rescues in Registry.lookup/1 and the LB modules' pick/1 exist to handle — but nothing actually proved it end-to-end. Spawns 50 processes doing 100 picks each, fires disconnect mid-flight. Every result must be either {:ok, %Channel{}} or {:error, :no_connection}. Anything else means we crashed a caller.

…ycles Mirrors the regression pattern from PR elixir-grpc#509, which added a 100-cycle test to prove the persistent_term leak fix. Same idea, applied to the tables the refactor introduced: the shared Registry and the per-LB ETS tables. Snapshots :ets.all() and the Registry size before the loop, cycles 500 connect+disconnect pairs, asserts the Registry returns to its starting size and no more than a handful of new tables exist (VM may create a few incidentally during the run).

…ends Derived from research into long-lived gRPC clients (GCP, Temporal, LiveKit, our own unleash_fresha) — they all hit rapid DNS churn where the backend count drops between picks. The risk is that a concurrent RoundRobin pick indexes into a tuple that update/2 just replaced with a smaller one. Starts with 8 backends, launches 30 processes doing 200 picks each, triggers a DNS reconcile down to 2 backends mid-flight. Every pick must return {:ok, %Channel{}} — no :badarg from elem/2, no stale index arithmetic, no crash. After the reconcile settles, all picks must come from the two survivors only.

The earlier refactor marked update/2 as @optional_callbacks and added a reconcile_lb fallback that called init/1 when the LB skipped update/2. The fallback was broken: init/1 creates a fresh ETS table and the previous lb_state's table orphans, leaking one table per DNS reconcile. For our own LBs the fallback was dead code (both implement update/2), but the trap was real for any third party writing a custom LB. update/2 is the only way to reconcile a stateful LB without leaking — "optional" was the wrong signal. Make it required so the compiler warns anyone who skips it, delete the fallback, let reconcile_lb be the single line it should have been. shutdown/1 stays optional — strategies with no external resources can genuinely omit it.

…on resolver worker crash LB strategies' ETS tables are owned by the Connection GenServer (lb_mod.init now runs inside init/1), so BEAM reclaims them automatically on exit — no explicit shutdown callback needed. Merged the two connect paths through a single finalize_connection/2 helper. On resolver worker death, stop the Connection instead of re-initializing. The old re-init path created a crash-loop when init itself kept failing (bad config, unreachable control plane). Unrelated :EXIT signals are now logged and ignored so stray links from user code can't take Connection down. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rescope the catch-all :EXIT comment so the :normal clause's silent ignore isn't surprising, and replace "reconciles" with "DNS re-resolutions" in the PickFirst moduledoc for readers who aren't already deep in the LB code path.

…connect Three small fixes from review feedback on the ETS LB work: - init/1: publish to the Registry last. init crashes don't run terminate/2, so registering before resolver.init/2 would strand the entry if resolver init raised. - rebalance_after_reconcile/3: republish {lb_mod, lb_state} after lb_mod.update/2 so pick_channel/2 reads the fresh state returned by update/2, not the state captured at init. - handle_call({:disconnect, ...}): linearized to Registry.delete → transport close → reply. Dropped the dead Map.drop branch; the struct always has :real_channels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This was a bad call on my part. Stopping the Connection when the resolver worker dies means a network blip or brief DNS hiccup tears down every transport channel and reconnects from scratch. We can't have network jitter force a restart. The reasoning I had behind the original change ("let-it-crash, trust the supervisor") doesn't hold up either: - The supervisor restart uses the original initial_state snapshot. real_channels in there points to conn_pids from first connect, so the new Connection inherits stale references. - terminate/2 only deletes the Registry entry; transports leak on abnormal exit. - The "avoid crash-loop" justification was wrong. The old code had a fallback to resolver_state: nil — it degraded gracefully, it didn't crash-loop. Restore the original in-place re-init.

Replace the shared ETS Registry with :persistent_term keyed by {Connection, ref} and move the RoundRobin cursor from an ETS counter row to :atomics. Three problems collapse into one design: 1. Hot-path counter contention. :ets.update_counter/3 locks the key's hash bucket, so 32 concurrent picks bottleneck on a single counter (~10K ips in the bench). :atomics.add_get/3 is a hardware CAS — no table-level or key-level lock — and scales linearly across schedulers. 2. Two ETS tables where one is enough. The shared registry table is gone; the per-Connection LB table now holds only the channels tuple. Lookup of {lb_mod, lb_state} is a :persistent_term.get, which is faster than an :ets.lookup and allocation-free. 3. Steady-state global GC. The published lb_state contains a stable tid + atomics ref for the life of the connection. Reconcile mutates ETS + atomics in place and does NOT republish, so :persistent_term.put fires once per connect — not every 30s — and the global GC pass it triggers stays off the hot loop. Lifecycle: init/1 put once, after lb_mod.init + resolver.init succeed pick_channel/2 get on every RPC disconnect/1 erase first (in-flight picks fail fast), then close terminate/2 erase (idempotent — safe even if disconnect erased) The terminate/2 erase covers the abnormal-exit path; the only leak window is brutal kill, which already existed with Registry.delete and is not regressed by this change. The failure modes the previous PR introduced are all preserved: - rescue ArgumentError in lb_mod.pick — table reclaimed mid-pick - persistent_term.get/2 fallback to nil — entry erased mid-pick - erase BEFORE closing transports — picks fail with :no_connection rather than racing a dying adapter Tests: 286/286 pass (was 292; six Registry tests removed). The 500-cycle leak regression now counts {Connection, _}-keyed persistent_term entries instead of registry-table size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@moduledocs

The resolver code from elixir-grpc#520 ships with essentially zero inline comments and tight one-line @moduledocs. Match that here: - Slash multi-paragraph @moduledoc blocks down to one-liners on the LB behaviour, RoundRobin, PickFirst, and the LB test moduledocs. - Remove the multi-line WHY comments in Connection (init publish ordering, disconnect erase ordering, reconcile in-place mutation). - Remove enumerate-every-test moduledoc from RoundRobin tests. - Remove the "guards the race where..." race-test comments — the test name already says what it does, and the rescue is in the LB code. - Remove the connection_test commentary on the already_started path, the LB ETS lifecycle setup, the concurrent-pick rationale, and the ETS-table-slop justification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cgreeno · 2026-05-05T16:08:04Z

@sleipnir I got a friend to review it first to try and save you some time. Happy to go in whatever direction you want however. :)

sleipnir · 2026-05-05T17:53:57Z

@sleipnir I got a friend to review it first to try and save you some time. Happy to go in whatever direction you want however. :)

Ok let me known when you are done!

cgreeno · 2026-05-06T08:43:29Z

@sleipnir - Done - he reviewed it earlier. :)

sleipnir · 2026-05-06T23:19:13Z

@sleipnir - Done - he reviewed it earlier. :)

Okay, now I'm going to review it.

…-path # Conflicts: # grpc/test/grpc/client/connection_test.exs

cgreeno · 2026-05-13T16:57:53Z

@sleipnir - how is it looking? No Pressure ;)

Test leaked a Connection + DNSResolver pair with resolve_interval: 200ms. After the test exited, the leaked resolver kept calling MockResolver.resolve/1, which Mox rejected (UnexpectedCallError). Connection.handle_info re-init'd the resolver, MockResolver.init/2 also raised, Connection crashed, DynamicSupervisor restarted it, max_restarts tripped, and the supervisor shut down — breaking later tests that try to start_child (notably the new 500-cycle leak test).

…-path # Conflicts: # grpc/test/grpc/client/connection_test.exs

cgreeno and others added 17 commits April 28, 2026 13:50

docs: tighten comments on Registry ordering and pick race

d1be507

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cgreeno force-pushed the feature/ets-lb-pick-path branch from 5d63ada to d1be507 Compare April 28, 2026 17:40

cgreeno force-pushed the feature/ets-lb-pick-path branch from e63c20e to 648f80d Compare April 28, 2026 18:12

hansihe reviewed Apr 29, 2026

View reviewed changes

Comment thread grpc/lib/grpc/client/load_balacing/round_robin.ex Outdated

Comment thread grpc/lib/grpc/client/load_balacing/registry.ex Outdated

Comment thread grpc/lib/grpc/client/connection.ex Outdated

cgreeno and others added 4 commits April 30, 2026 13:15

Merge branch 'master' into feature/ets-lb-pick-path

4ab3008

Merge branch 'master' into feature/ets-lb-pick-path

b4b7f21

cgreeno marked this pull request as ready for review May 5, 2026 16:07

Merge remote-tracking branch 'origin/master' into feature/ets-lb-pick…

9eb3a75

…-path # Conflicts: # grpc/test/grpc/client/connection_test.exs

cgreeno added 2 commits May 13, 2026 20:01

Merge remote-tracking branch 'origin/master' into feature/ets-lb-pick…

4f39740

…-path # Conflicts: # grpc/test/grpc/client/connection_test.exs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ETS-backed LB with per-request pick, remove persistent_term writes on reconcile#527

feat: ETS-backed LB with per-request pick, remove persistent_term writes on reconcile#527
cgreeno wants to merge 25 commits into
elixir-grpc:masterfrom
cgreeno:feature/ets-lb-pick-path

cgreeno commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cgreeno commented May 5, 2026

Uh oh!

sleipnir commented May 5, 2026

Uh oh!

cgreeno commented May 6, 2026

Uh oh!

sleipnir commented May 6, 2026

Uh oh!

cgreeno commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cgreeno commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Micro-benchmark (Apple M1 Max)

Notable details

Test coverage

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cgreeno commented May 5, 2026

Uh oh!

sleipnir commented May 5, 2026

Uh oh!

cgreeno commented May 6, 2026

Uh oh!

sleipnir commented May 6, 2026

Uh oh!

cgreeno commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cgreeno commented Apr 24, 2026 •

edited

Loading