lightningdevkit
diff --git a/‎fuzz/FC-INFO.md‎
Lines changed: 287 additions & 0 deletions b/‎fuzz/FC-INFO.md‎
Lines changed: 287 additions & 0 deletions
@@ -0,0 +1,287 @@
+# Force-Close Fuzzing Notes
+
+This document captures the practical lessons from stabilizing force-close coverage in
+`chanmon_consistency`.
+
+It is aimed at future changes to the harness, not at explaining force-close in Lightning
+generally.
+
+## Goal
+
+The goal of force-close fuzzing here is:
+
+- exercise realistic off-chain to on-chain transitions
+- keep races possible, but exceptional
+- ensure the harness actually drives the system to resolution
+- avoid spending most corpus coverage on harness-created nonsense
+
+The important distinction is:
+
+- a protocol race is useful
+- a harness starvation artifact is mostly noise
+
+## What Realism Means Here
+
+For this harness, realistic means:
+
+- queued messages get a chance to move before huge chain jumps
+- monitor updates complete in plausible places
+- payment claims propagate before we declare everything timed out
+- on-chain resolution follows the actual balances and broadcasts that exist
+
+What was unrealistic:
+
+- queue HTLC state changes
+- skip delivery for many blocks
+- jump height by 50, 100, or 200 in one shot
+- then observe a timeout force-close everywhere
+
+That created lots of fake claim-vs-timeout races.
+
+## Main Harness Rules That Worked
+
+### 1. Drain progress around height advancement
+
+Large height jumps now advance one block at a time, with bounded progress draining before and
+after each block.
+
+Why:
+
+- this gives queued off-chain state a chance to settle
+- it keeps timeout races possible, but not dominant
+- it is the smallest realism improvement that materially helped
+
+The key helper is `advance_chain_carefully!` in
+[chanmon_consistency.rs](/Users/joost/repo/rust-lightning-fuzz-force-close/fuzz/src/chanmon_consistency.rs).
+
+### 2. Stop early if nothing actionable remains
+
+Blindly honoring a `+50` or `+200` block jump wastes time and creates noise.
+
+We now stop early when there is no pending work:
+
+- no peer-message backlog
+- no broadcaster backlog
+- no pending monitor updates
+- no timed balances that still need height progress
+
+### 3. Distinguish timed work from passive confirmations
+
+`ClaimableAwaitingConfirmations` should not, by itself, keep the pre-advance drain alive.
+
+That balance means:
+
+- something already happened on-chain
+- we are mostly waiting for confirmations
+
+It is not the same kind of "active timed work" as:
+
+- `ContentiousClaimable`
+- `MaybeTimeoutClaimableHTLC`
+- `MaybePreimageClaimableHTLC`
+- `CounterpartyRevokedOutputClaimable`
+
+Treating `ClaimableAwaitingConfirmations` as active timed work caused settle churn.
+
+### 4. Reconcile harness bookkeeping with live node state
+
+The harness tracks `pending_payments`, but after restarts the real source of truth is the
+`ChannelManager` state exposed by `list_recent_payments()`.
+
+Important lesson:
+
+- local caches in the harness can become stale across restart/replay flows
+- final invariants should reflect live node state, not only local bookkeeping
+
+The harness now reconciles cached pending payments against `list_recent_payments()` before final
+assertions.
+
+## Force-Close Specific Pitfalls
+
+### Announcement preference mismatch
+
+The trusted-peer open path can override `announce_for_forwarding`, while node defaults still prefer
+announced channels.
+
+If `force_announced_channel_preference` remains enabled, channels can fail to open before the
+fuzzer ever reaches force-close resolution.
+
+For this harness, keeping:
+
+- `force_announced_channel_preference = false`
+
+is necessary to preserve the intended force-close scenarios.
+
+### Startup replay ordering for closed channels
+
+On restart, a regenerated startup force-close update can coexist with older in-flight closed-channel
+monitor updates.
+
+Watch out for:
+
+- assuming startup-generated force-close updates always have the highest update id
+
+That assumption was wrong. The correct behavior is:
+
+- do not assert on strict ordering
+- if needed, renumber the regenerated update after the already-pending one
+- drop duplicate force-close updates when one already exists
+
+This was fixed in
+[channelmanager.rs](/Users/joost/repo/rust-lightning-fuzz-force-close/lightning/src/ln/channelmanager.rs).
+
+### Wallet state must track confirmed transactions
+
+The harness uses a wallet source for bumping and external-funding claims.
+
+If confirmed transactions update `ChainState` but not the wallet view, then:
+
+- spent inputs may still look available
+- change outputs may be missing
+- rebroadcasted claim transactions can become nonsense
+
+That caused repeated claim rebroadcasts and fake non-quiescence.
+
+The wallet must be kept in sync with:
+
+- confirmed spends
+- newly created change outputs
+
+### Child-before-parent vs stale conflicting broadcasts
+
+Not every failed confirmation should be retried forever.
+
+Two distinct cases exist:
+
+1. Child-before-parent
+- retrying later is correct
+
+2. Stale conflicting commitment or claim tx
+- retrying forever is wrong
+
+The confirmation drain must:
+
+- retry deferred transactions while something in the batch still makes progress
+- only keep transactions that could become valid later
+- drop permanently invalid stale conflicts
+
+## Signature and Weight Lessons Under Fuzzing
+
+Two fuzz-specific adjustments were necessary and appear justified:
+
+### Skip low-R grinding under fuzzing
+
+Under `cfg(fuzzing)` and `cfg(secp256k1_fuzz)`, signature generation can behave unlike production.
+
+If low-R grinding stays enabled, signing can become pathologically expensive or effectively stall.
+
+So under fuzzing:
+
+- use plain signing, not low-R grinding
+
+This keeps the path under test while avoiding artificial hangs.
+
+### Relax signed-weight debug assertions under fuzzing
+
+Signed transaction weight depends on signature sizes.
+
+Under fuzzing, signature-size assumptions no longer match production distributions reliably.
+
+So fuzz-only relaxation of "actual signed weight must not exceed estimate" assertions is reasonable.
+
+This should stay fuzz-only.
+
+## What To Preserve
+
+If you change the force-close harness, preserve these behaviors:
+
+- per-block draining around large height advances
+- bounded settle loops with actionable assertions
+- wallet synchronization with confirmed transactions
+- reconciliation of cached payment bookkeeping against live node state
+- startup replay handling for closed-channel monitor updates
+- fuzz-only signing and weight accommodations
+
+These were all required to get the corpus stable.
+
+## What To Be Careful About
+
+### Do not overuse global modes
+
+A mode system was considered, but the harness improved enough without it.
+
+The current approach is better:
+
+- keep the opcode space broad
+- add realism locally where it matters
+- drive settlement more carefully
+
+That preserves coverage without over-structuring the fuzzer.
+
+### Do not make late-claim races impossible
+
+The harness should strongly prefer realistic progress before timeouts, but it should not make
+late-claim races impossible.
+
+The right balance is:
+
+- default behavior is realistic claim propagation
+- rare races still happen because of interleavings, async monitor behavior, and restart timing
+
+### Do not assume pending payments imply unresolved protocol state
+
+Sometimes they do.
+
+Sometimes they are just stale harness bookkeeping after restart and replay.
+
+Always check:
+
+- message queues
+- monitor updates
+- claimable balances
+- broadcaster state
+- live `list_recent_payments()`
+
+before concluding a payment is truly stuck.
+
+### Do not reintroduce big blind height jumps
+
+This is the easiest way to make the corpus noisy again.
+
+If height advancement changes, keep asking:
+
+- are we giving off-chain state a fair chance before timing HTLCs out?
+
+## Good Debugging Workflow
+
+When force-close fuzzing regresses:
+
+1. Run the corpus with `fuzz-runner`.
+2. Isolate one failing case.
+3. Check whether the failure is:
+   - real protocol or restart ordering
+   - stale harness bookkeeping
+   - wallet/view desynchronization
+   - over-broad settle criteria
+4. Prefer fixing harness realism before adding more timeouts or more iterations.
+5. Re-run the isolated case, then the full corpus.
+
+The runner is the preferred tool for this now:
+
+- [runner/README.md](/Users/joost/repo/rust-lightning-fuzz-force-close/fuzz/runner/README.md)
+
+## Current Stable Verification
+
+The key verification command is:
+
+```bash
+cargo run --manifest-path fuzz/Cargo.toml -p fuzz-runner -- --timeout-secs 20
+```
+
+At the time these notes were written, the full corpus completed with:
+
+- `ok: 366`
+- `failed: 0`
+- `timed_out: 0`
+
+That should be rechecked after any meaningful force-close harness change.