Skip to content

feat: ETS-backed LB with per-request pick, remove persistent_term writes on reconcile#527

Open
cgreeno wants to merge 25 commits into
elixir-grpc:masterfrom
cgreeno:feature/ets-lb-pick-path
Open

feat: ETS-backed LB with per-request pick, remove persistent_term writes on reconcile#527
cgreeno wants to merge 25 commits into
elixir-grpc:masterfrom
cgreeno:feature/ets-lb-pick-path

Conversation

@cgreeno
Copy link
Copy Markdown
Contributor

@cgreeno cgreeno commented Apr 24, 2026

Summary

Follow-up to #520 putting the LB module in control of the pick decision.

pick_channel/2 now goes through lb_mod.pick(lb_state) on every RPC. The hot path is fully lock-free:

pick_channel/2
  → :persistent_term.get({Connection, ref})    # compile-time-constant lookup
  → lb_mod.pick(lb_state)                       # :atomics.add_get + elem (RR) or :ets.lookup (PF)
  → {:ok, channel}

No GenServer call on the hot path. No :persistent_term.put on reconcile (and so no global GC pass when backends scale in/out).

The shape generalizes: weighted RR, least-requests, health-aware routing, and xDS all need richer per-backend state than a single atom — "LB owns its ETS table + atomics" carries them without forcing each strategy to invent state plumbing.

Architecture

Where What it holds Lifecycle
:persistent_term keyed {Connection, ref} {lb_mod, lb_state} put once on init/1, erase on disconnect/1 + terminate/2
Per-Connection ETS table (one per Connection, owned by the GenServer) {:channels, channels_tuple} replaced wholesale on reconcile
:atomics ref (in lb_state) RR cursor (unsigned 64-bit) reset to 0 on reconcile

The lb_state published to persistent_term (the tid + atomics ref) is stable for the life of the connection — reconcile mutates ETS and atomics in place rather than republishing, so :persistent_term.put fires once per connect, not every 30s. That keeps the global GC pass put triggers off the steady-state path.

Micro-benchmark (Apple M1 Max)

TL;DR: the pick path adds ~290 ns over the pre-PR cached read (52 ns → 345 ns) in exchange for a real per-request RR rotation across N channels. The reconciliation write path replaces :persistent_term.put with :ets.insert + :atomics.put, eliminating the global GC pass that previously fired every 30s per Connection.

All comparisons are vs the pre-PR baseline: a single cached :persistent_term.get of one channel, no rotation — what shipped before #520.

Single-process pick

Variant Throughput Latency Notes
Pre-PR baseline (:persistent_term.get, cached) 19.08M ips 52 ns no rotation, single channel
NEW pick (persistent_term + ets + atomics) 2.90M ips 345 ns full per-request RR rotation

32-process concurrent pick

Variant Total throughput Notes
Pre-PR baseline (:persistent_term.get) ~35M picks/sec reads scale linearly
NEW pick ~6.0M picks/sec atomics counter scales — limited by cache-line ping-pong, not a software lock

The ~6× delta is the cost of moving from a pure read to a real RR rotation. Critically, :atomics.add_get/3 is a hardware CAS with no key-level or table-level lock, so the contention story is just cache coherence rather than a serialization point. ~6M picks/sec at 32 procs is well above any realistic per-Connection RPC volume.

Single-process reconciliation write

Variant Throughput Latency Notes
Pre-PR baseline (:persistent_term.put) 5.37M ips 186 ns + global GC pass across every BEAM process
NEW reconcile (:ets.insert + :atomics.put) 1.90M ips 527 ns local, no global side effects

The headline persistent_term.put latency hides the real cost. Per the BEAM docs, every write forces a global GC pass — every process in the node has to scan its heap. With periodic DNS re-resolution this fires every 30s per Connection; the new design pays ~340 ns more per reconcile and zero GC.

Notable details

  • RoundRobin.pick/1 rescues ArgumentErrordisconnect/1 may delete the LB ETS table mid-pick and we want a tagged error rather than a propagated crash.
  • :persistent_term.put runs last in init/1init/1 crashes skip terminate/2, so publishing earlier would leak entries on a failing resolver.init/2.
  • disconnect/1 calls :persistent_term.erase before closing transports, so in-flight picks fail fast with :no_connection rather than racing a dying adapter.
  • terminate/2 also erases (idempotent on the disconnect path — :persistent_term.erase returns false on a missing key without raising). Covers crash exits and supervisor stops.
  • rebalance_after_reconcile/3 does not republish to persistent_term — the published lb_state shape (tid + atomics ref) is stable for the life of the connection, so update/2's in-place mutation is visible to picks immediately.

Test coverage

  • 19 LB unit tests (PickFirst, RoundRobin including 16-process concurrency stress).
  • disconnect/1 and terminate/2 both free the LB's ETS table.
  • 50 concurrent pickers × 100 picks vs disconnect/1 — no crashes.
  • 30 concurrent pickers × 200 picks during a shrinking reconcile — no crashes; survivors-only after settle.
  • 500 connect/disconnect cycles — zero {Connection, _}-keyed persistent_term leak, zero ETS table leak (mirrors the fix: erase persistent_term leak in GRPC.Client.Connection on disconnect #509 persistent_term leak regression).

Full suite: 286 tests, 0 failures.

cgreeno and others added 17 commits April 28, 2026 13:50
…tes on reconcile

Moves the client load-balancer pick onto the RPC hot path via an ETS-backed
RoundRobin strategy, so gRPC clients now round-robin across backends per
request instead of caching one pick for 15 seconds.

Behaviour changes (GRPC.Client.LoadBalancing):
  - init/1 now takes :channels (full Channel structs), not :addresses
  - pick/1 returns {:ok, Channel.t(), new_state} directly — no second lookup
  - adds optional update/2 for in-place reconciliation and shutdown/1 for cleanup
  - @optional_callbacks keeps simple strategies compliant without stubs

RoundRobin rewrite:
  - ETS table (:set, :public, read_concurrency: true) owned by the Connection
  - pick: :ets.update_counter/3 + :ets.lookup + elem/2 — lock-free, no GenServer
  - update: mutates channels tuple in place, no table churn across reconciles
  - shutdown: :ets.delete, idempotent

PickFirst: trivially updated to the new channels-shaped signature.

Connection rewire:
  - pick_channel/2 reads persistent_term once, dispatches to lb_mod.pick per request
  - persistent_term is written exactly once (init) and erased once (disconnect) —
    no more global GC pass across every BEAM process on DNS re-resolution
  - :refresh timer removed (obsolete: per-request pick rotates automatically)
  - build_balanced_state passes connected Channels to lb_mod.init
  - build_direct_state uses PickFirst for uniformity (single persistent_term shape)
  - reconcile calls lb_mod.update(lb_state, connected_channels) — update/2 allows
    empty lists so pick returns :no_channels when all backends fail (LB's ETS is
    now the source of truth; persistent_term stays stable)

Architecture follows sleipnir's review of elixir-grpc#513:
  - Connection stays as orchestrator, LB owns pick state via opaque tid
  - No new processes — ETS table lifecycle piggybacks on Connection GenServer
  - LB module is the only thing that touches the ETS table

Tests: 275 total (was 266), 0 failures across 3 seeds.
  - 11 new RoundRobin unit tests (init/pick/update/shutdown/concurrency)
  - 8 new PickFirst unit tests
  - 1 new integration test proving per-request rotation end-to-end

Micro-benchmark (bench/lb_pick.exs, Apple M1 Max):

  Single-process pick:
    persistent_term.get   13.34M ips   75 ns    (cached pick, no rotation)
    ETS RoundRobin.pick    3.42M ips  293 ns    (per-request rotation)
    GenServer.call pick    0.98M ips 1020 ns    (serialized via mailbox)

  32-process concurrent pick:
    persistent_term.get     6.6M ips  0.15 μs   (read-scales linearly)
    GenServer.call pick    26.6K ips   37 μs    (mailbox-bound)
    ETS RoundRobin.pick     9.9K ips  101 μs    (counter contention)

  Single-process reconciliation write:
    persistent_term.put    5.95M ips  168 ns    (+ global GC pass, not measured)
    ETS RoundRobin.update  1.51M ips  663 ns    (local, no side effects)

The comparison is not apples-to-apples: persistent_term.get returns the same
cached pick every call (no load distribution), ETS does per-request rotation.
At realistic gRPC traffic rates (1K-10K picks/sec bounded by network latency),
neither the 293ns pick cost nor the counter contention is observable — RPC
latency dominates by 4+ orders of magnitude.

Follow-up to elixir-grpc#520. Can drop the benchee dep + bench script if desired.
The benchmark served its purpose: informing the design choice in the previous
commit. Keeping benchee as a first-party dep for one script isn't justified,
and the numbers belong in the PR description anyway (preserved in c3c3ebe's
commit message for history).

No code or test changes. Dep tree is back to the pre-LB state.
If a caller reads {lb_mod, lb_state} from persistent_term and a concurrent
disconnect runs lb_mod.shutdown before the caller reaches RoundRobin.pick/1,
both :ets.lookup and :ets.update_counter raise ArgumentError on the deleted
table — crashing the calling RPC process rather than surfacing a tagged error.

The pre-refactor code couldn't hit this path (persistent_term.get with a
default of nil always returned safely), so this is a regression introduced
by moving the pick onto the hot path.

Rescue ArgumentError and map it to {:error, :no_channels}, matching the
empty-table branch. Adds a unit test that shuts the table down and then
picks, which reproduced the crash before the rescue.
The LB owns its ETS table; Connection.shutdown_lb/2 is the only place that
calls lb_mod.shutdown/1. Nothing in the suite currently breaks if that call
is removed, so a refactor could silently leak an ETS table per gracefully-
closed connection (and another per supervised-shutdown crash).

Two regression tests:

  * disconnect/1 frees the table — captures the tid via :sys.get_state before
    disconnecting, asserts :ets.info returns :undefined afterwards.
  * terminate/2 frees the table — same shape, but stops the GenServer via
    GenServer.stop to exercise the crash path.

Both rely on :round_robin being the only policy with an ETS table today;
if PickFirst grows one, these tests still pass as-is.
The previous commit that moved the pick onto the hot path removed the 15s
:refresh timer, but two pieces of module docs still described it:

  * "The orchestrator periodically refreshes the LB decision to adapt to
    changes." — replaced with a one-liner describing per-request dispatch.
  * "The orchestrator refreshes the LB pick every 15 seconds." — removed
    entirely along with its standalone Notes section.

Docs now match behaviour.
…tarted test

This test erases the persistent_term entry, confirms Connection.connect
returns :no_connection on the already_started branch, then puts an entry
back so the subsequent disconnect completes cleanly.

It used to restore a bare Channel struct, which predates the ETS refactor.
The current shape is {lb_mod, lb_state}. The test was passing by accident
because disconnect only erases the key (shape-agnostic) — but a future
refactor that actually reads the restored entry would trip on the mismatch.

Capture the real entry before erasing and restore it verbatim.
The persistent_term entry stores {lb_mod, lb_state} written once at init and
never rewritten during the connection's life (that's the whole point of the
ETS refactor). This works as long as lb_mod.update/2 mutates external state
and returns a state value the caller already holds — which is how RoundRobin
behaves (the tid is stable).

PickFirst used to return a NEW state map from update/2:

    def update(_state, [first | _]), do: {:ok, %{current: first}}

The Connection GenServer's own state got the new map, but persistent_term
kept the original %{current: first_backend_at_connect_time}. So
pick_channel silently kept returning the dead backend forever after any
DNS change. Not covered by tests because every re_resolve_test scenario
passes lb_policy: :round_robin.

Fix: give PickFirst the same ETS-backed state shape as RoundRobin. init
creates a one-row table {:current, channel}, update mutates it in place,
pick reads it. State is %{tid: tid} — stable across update/2.

Tests:
  * Rewrites pick_first_test.exs against the new shape (11 tests).
  * Adds a re-resolve integration test using the default LB (PickFirst):
    connect with one backend, swap the DNS answer, assert pick_channel
    returns the new backend. Reproduced the bug before this fix.
Follow-up on @sleipnir's comment: "we can replace them with ETS, but we
can't forget that the responsibility for the pick lies with the load
balancing module."
elixir-grpc#520 (comment)

His sample used ETS keyed by ref with no persistent_term. My earlier
commit kept persistent_term as a write-once pointer — worked fine, but
wasn't what he showed. This matches the sample.

New module: GRPC.Client.LoadBalancing.Registry. Small GenServer
supervised by the client Application. Owns a named public ETS table
(:grpc_client_lb_registry). Three functions — put/2, lookup/1, delete/1.
Lookup rescues ArgumentError so a racing :ets.delete on shutdown gives
:error instead of crashing the RPC.

Connection now:
  * Registry.put on init (ref -> {lb_mod, lb_state})
  * Registry.lookup in pick_channel
  * Registry.delete in disconnect and terminate

No persistent_term left in the LB path. LB behaviour unchanged, per-LB
ETS tables stay as they were. 282 tests pass. connection_test.exs pokes
Registry directly where it used to poke persistent_term.

BTW — also considered putting {lb_mod, lb_state} straight on the Channel
struct. Drops both persistent_term and the registry, same runtime
behaviour, one fewer indirection on every pick. Flagged in the PR for
@sleipnir — happy to swap if that reads cleaner to him.
The Registry is only exercised through Connection today — if someone
regresses lookup/1 or delete/1 in isolation, no test fails. Add direct
round-trip tests for the public API (arbitrary key shapes, overwrite,
missing key returns :error, delete on unknown key is safe).

The ArgumentError rescue in lookup/1 is covered end-to-end by the
concurrent pick+disconnect test — not duplicated here.
Concurrent RPCs must not crash when disconnect is tearing down LB state.
This is the scenario the ArgumentError rescues in Registry.lookup/1 and
the LB modules' pick/1 exist to handle — but nothing actually proved it
end-to-end.

Spawns 50 processes doing 100 picks each, fires disconnect mid-flight.
Every result must be either {:ok, %Channel{}} or {:error, :no_connection}.
Anything else means we crashed a caller.
…ycles

Mirrors the regression pattern from PR elixir-grpc#509, which added a 100-cycle test
to prove the persistent_term leak fix. Same idea, applied to the tables
the refactor introduced: the shared Registry and the per-LB ETS tables.

Snapshots :ets.all() and the Registry size before the loop, cycles 500
connect+disconnect pairs, asserts the Registry returns to its starting
size and no more than a handful of new tables exist (VM may create a
few incidentally during the run).
…ends

Derived from research into long-lived gRPC clients (GCP, Temporal,
LiveKit, our own unleash_fresha) — they all hit rapid DNS churn where
the backend count drops between picks. The risk is that a concurrent
RoundRobin pick indexes into a tuple that update/2 just replaced with
a smaller one.

Starts with 8 backends, launches 30 processes doing 200 picks each,
triggers a DNS reconcile down to 2 backends mid-flight. Every pick
must return {:ok, %Channel{}} — no :badarg from elem/2, no stale
index arithmetic, no crash. After the reconcile settles, all picks
must come from the two survivors only.
The earlier refactor marked update/2 as @optional_callbacks and added a
reconcile_lb fallback that called init/1 when the LB skipped update/2.
The fallback was broken: init/1 creates a fresh ETS table and the
previous lb_state's table orphans, leaking one table per DNS reconcile.
For our own LBs the fallback was dead code (both implement update/2),
but the trap was real for any third party writing a custom LB.

update/2 is the only way to reconcile a stateful LB without leaking —
"optional" was the wrong signal. Make it required so the compiler warns
anyone who skips it, delete the fallback, let reconcile_lb be the single
line it should have been.

shutdown/1 stays optional — strategies with no external resources can
genuinely omit it.
…on resolver worker crash

LB strategies' ETS tables are owned by the Connection GenServer (lb_mod.init
now runs inside init/1), so BEAM reclaims them automatically on exit — no
explicit shutdown callback needed. Merged the two connect paths through a
single finalize_connection/2 helper.

On resolver worker death, stop the Connection instead of re-initializing.
The old re-init path created a crash-loop when init itself kept failing
(bad config, unreachable control plane). Unrelated :EXIT signals are now
logged and ignored so stray links from user code can't take Connection down.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rescope the catch-all :EXIT comment so the :normal clause's silent
ignore isn't surprising, and replace "reconciles" with "DNS
re-resolutions" in the PickFirst moduledoc for readers who aren't
already deep in the LB code path.
…connect

Three small fixes from review feedback on the ETS LB work:

- init/1: publish to the Registry last. init crashes don't run
  terminate/2, so registering before resolver.init/2 would strand the
  entry if resolver init raised.
- rebalance_after_reconcile/3: republish {lb_mod, lb_state} after
  lb_mod.update/2 so pick_channel/2 reads the fresh state returned by
  update/2, not the state captured at init.
- handle_call({:disconnect, ...}): linearized to Registry.delete →
  transport close → reply. Dropped the dead Map.drop branch; the struct
  always has :real_channels.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cgreeno cgreeno force-pushed the feature/ets-lb-pick-path branch from 5d63ada to d1be507 Compare April 28, 2026 17:40
This was a bad call on my part. Stopping the Connection when the
resolver worker dies means a network blip or brief DNS hiccup tears
down every transport channel and reconnects from scratch. We can't
have network jitter force a restart.

The reasoning I had behind the original change ("let-it-crash, trust
the supervisor") doesn't hold up either:

- The supervisor restart uses the original initial_state snapshot.
  real_channels in there points to conn_pids from first connect, so
  the new Connection inherits stale references.
- terminate/2 only deletes the Registry entry; transports leak on
  abnormal exit.
- The "avoid crash-loop" justification was wrong. The old code had
  a fallback to resolver_state: nil — it degraded gracefully, it
  didn't crash-loop.

Restore the original in-place re-init.
@cgreeno cgreeno force-pushed the feature/ets-lb-pick-path branch from e63c20e to 648f80d Compare April 28, 2026 18:12
Comment thread grpc/lib/grpc/client/load_balacing/round_robin.ex Outdated
Comment thread grpc/lib/grpc/client/load_balacing/registry.ex Outdated
Comment thread grpc/lib/grpc/client/connection.ex Outdated
cgreeno and others added 4 commits April 30, 2026 13:15
Replace the shared ETS Registry with :persistent_term keyed by
{Connection, ref} and move the RoundRobin cursor from an ETS counter
row to :atomics. Three problems collapse into one design:

1. Hot-path counter contention. :ets.update_counter/3 locks the key's
   hash bucket, so 32 concurrent picks bottleneck on a single counter
   (~10K ips in the bench). :atomics.add_get/3 is a hardware CAS — no
   table-level or key-level lock — and scales linearly across
   schedulers.

2. Two ETS tables where one is enough. The shared registry table is
   gone; the per-Connection LB table now holds only the channels
   tuple. Lookup of {lb_mod, lb_state} is a :persistent_term.get,
   which is faster than an :ets.lookup and allocation-free.

3. Steady-state global GC. The published lb_state contains a stable
   tid + atomics ref for the life of the connection. Reconcile
   mutates ETS + atomics in place and does NOT republish, so
   :persistent_term.put fires once per connect — not every 30s — and
   the global GC pass it triggers stays off the hot loop.

Lifecycle:

  init/1            put once, after lb_mod.init + resolver.init succeed
  pick_channel/2    get on every RPC
  disconnect/1      erase first (in-flight picks fail fast), then close
  terminate/2       erase (idempotent — safe even if disconnect erased)

The terminate/2 erase covers the abnormal-exit path; the only leak
window is brutal kill, which already existed with Registry.delete and
is not regressed by this change.

The failure modes the previous PR introduced are all preserved:

  - rescue ArgumentError in lb_mod.pick — table reclaimed mid-pick
  - persistent_term.get/2 fallback to nil — entry erased mid-pick
  - erase BEFORE closing transports — picks fail with :no_connection
    rather than racing a dying adapter

Tests: 286/286 pass (was 292; six Registry tests removed). The
500-cycle leak regression now counts {Connection, _}-keyed
persistent_term entries instead of registry-table size.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The resolver code from elixir-grpc#520 ships with essentially zero inline
comments and tight one-line @moduledocs. Match that here:

- Slash multi-paragraph @moduledoc blocks down to one-liners on the
  LB behaviour, RoundRobin, PickFirst, and the LB test moduledocs.
- Remove the multi-line WHY comments in Connection (init publish
  ordering, disconnect erase ordering, reconcile in-place mutation).
- Remove enumerate-every-test moduledoc from RoundRobin tests.
- Remove the "guards the race where..." race-test comments — the
  test name already says what it does, and the rescue is in the LB
  code.
- Remove the connection_test commentary on the already_started path,
  the LB ETS lifecycle setup, the concurrent-pick rationale, and the
  ETS-table-slop justification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cgreeno cgreeno marked this pull request as ready for review May 5, 2026 16:07
@cgreeno
Copy link
Copy Markdown
Contributor Author

cgreeno commented May 5, 2026

@sleipnir I got a friend to review it first to try and save you some time. Happy to go in whatever direction you want however. :)

@sleipnir
Copy link
Copy Markdown
Collaborator

sleipnir commented May 5, 2026

@sleipnir I got a friend to review it first to try and save you some time. Happy to go in whatever direction you want however. :)

Ok let me known when you are done!

@cgreeno
Copy link
Copy Markdown
Contributor Author

cgreeno commented May 6, 2026

@sleipnir - Done - he reviewed it earlier. :)

@sleipnir
Copy link
Copy Markdown
Collaborator

sleipnir commented May 6, 2026

@sleipnir - Done - he reviewed it earlier. :)

Okay, now I'm going to review it.

…-path

# Conflicts:
#	grpc/test/grpc/client/connection_test.exs
@cgreeno
Copy link
Copy Markdown
Contributor Author

cgreeno commented May 13, 2026

@sleipnir - how is it looking? No Pressure ;)

cgreeno added 2 commits May 13, 2026 20:01
Test leaked a Connection + DNSResolver pair with resolve_interval: 200ms.
After the test exited, the leaked resolver kept calling MockResolver.resolve/1,
which Mox rejected (UnexpectedCallError). Connection.handle_info re-init'd the
resolver, MockResolver.init/2 also raised, Connection crashed, DynamicSupervisor
restarted it, max_restarts tripped, and the supervisor shut down — breaking
later tests that try to start_child (notably the new 500-cycle leak test).
…-path

# Conflicts:
#	grpc/test/grpc/client/connection_test.exs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants