|
| 1 | +# Force-Close Fuzzing Notes |
| 2 | + |
| 3 | +This document captures the practical lessons from stabilizing force-close coverage in |
| 4 | +`chanmon_consistency`. |
| 5 | + |
| 6 | +It is aimed at future changes to the harness, not at explaining force-close in Lightning |
| 7 | +generally. |
| 8 | + |
| 9 | +## Goal |
| 10 | + |
| 11 | +The goal of force-close fuzzing here is: |
| 12 | + |
| 13 | +- exercise realistic off-chain to on-chain transitions |
| 14 | +- keep races possible, but exceptional |
| 15 | +- ensure the harness actually drives the system to resolution |
| 16 | +- avoid spending most corpus coverage on harness-created nonsense |
| 17 | + |
| 18 | +The important distinction is: |
| 19 | + |
| 20 | +- a protocol race is useful |
| 21 | +- a harness starvation artifact is mostly noise |
| 22 | + |
| 23 | +## What Realism Means Here |
| 24 | + |
| 25 | +For this harness, realistic means: |
| 26 | + |
| 27 | +- queued messages get a chance to move before huge chain jumps |
| 28 | +- monitor updates complete in plausible places |
| 29 | +- payment claims propagate before we declare everything timed out |
| 30 | +- on-chain resolution follows the actual balances and broadcasts that exist |
| 31 | + |
| 32 | +What was unrealistic: |
| 33 | + |
| 34 | +- queue HTLC state changes |
| 35 | +- skip delivery for many blocks |
| 36 | +- jump height by 50, 100, or 200 in one shot |
| 37 | +- then observe a timeout force-close everywhere |
| 38 | + |
| 39 | +That created lots of fake claim-vs-timeout races. |
| 40 | + |
| 41 | +## Main Harness Rules That Worked |
| 42 | + |
| 43 | +### 1. Drain progress around height advancement |
| 44 | + |
| 45 | +Large height jumps now advance one block at a time, with bounded progress draining before and |
| 46 | +after each block. |
| 47 | + |
| 48 | +Why: |
| 49 | + |
| 50 | +- this gives queued off-chain state a chance to settle |
| 51 | +- it keeps timeout races possible, but not dominant |
| 52 | +- it is the smallest realism improvement that materially helped |
| 53 | + |
| 54 | +The key helper is `advance_chain_carefully!` in |
| 55 | +[chanmon_consistency.rs](/Users/joost/repo/rust-lightning-fuzz-force-close/fuzz/src/chanmon_consistency.rs). |
| 56 | + |
| 57 | +### 2. Stop early if nothing actionable remains |
| 58 | + |
| 59 | +Blindly honoring a `+50` or `+200` block jump wastes time and creates noise. |
| 60 | + |
| 61 | +We now stop early when there is no pending work: |
| 62 | + |
| 63 | +- no peer-message backlog |
| 64 | +- no broadcaster backlog |
| 65 | +- no pending monitor updates |
| 66 | +- no timed balances that still need height progress |
| 67 | + |
| 68 | +### 3. Distinguish timed work from passive confirmations |
| 69 | + |
| 70 | +`ClaimableAwaitingConfirmations` should not, by itself, keep the pre-advance drain alive. |
| 71 | + |
| 72 | +That balance means: |
| 73 | + |
| 74 | +- something already happened on-chain |
| 75 | +- we are mostly waiting for confirmations |
| 76 | + |
| 77 | +It is not the same kind of "active timed work" as: |
| 78 | + |
| 79 | +- `ContentiousClaimable` |
| 80 | +- `MaybeTimeoutClaimableHTLC` |
| 81 | +- `MaybePreimageClaimableHTLC` |
| 82 | +- `CounterpartyRevokedOutputClaimable` |
| 83 | + |
| 84 | +Treating `ClaimableAwaitingConfirmations` as active timed work caused settle churn. |
| 85 | + |
| 86 | +### 4. Reconcile harness bookkeeping with live node state |
| 87 | + |
| 88 | +The harness tracks `pending_payments`, but after restarts the real source of truth is the |
| 89 | +`ChannelManager` state exposed by `list_recent_payments()`. |
| 90 | + |
| 91 | +Important lesson: |
| 92 | + |
| 93 | +- local caches in the harness can become stale across restart/replay flows |
| 94 | +- final invariants should reflect live node state, not only local bookkeeping |
| 95 | + |
| 96 | +The harness now reconciles cached pending payments against `list_recent_payments()` before final |
| 97 | +assertions. |
| 98 | + |
| 99 | +## Force-Close Specific Pitfalls |
| 100 | + |
| 101 | +### Announcement preference mismatch |
| 102 | + |
| 103 | +The trusted-peer open path can override `announce_for_forwarding`, while node defaults still prefer |
| 104 | +announced channels. |
| 105 | + |
| 106 | +If `force_announced_channel_preference` remains enabled, channels can fail to open before the |
| 107 | +fuzzer ever reaches force-close resolution. |
| 108 | + |
| 109 | +For this harness, keeping: |
| 110 | + |
| 111 | +- `force_announced_channel_preference = false` |
| 112 | + |
| 113 | +is necessary to preserve the intended force-close scenarios. |
| 114 | + |
| 115 | +### Startup replay ordering for closed channels |
| 116 | + |
| 117 | +On restart, a regenerated startup force-close update can coexist with older in-flight closed-channel |
| 118 | +monitor updates. |
| 119 | + |
| 120 | +Watch out for: |
| 121 | + |
| 122 | +- assuming startup-generated force-close updates always have the highest update id |
| 123 | + |
| 124 | +That assumption was wrong. The correct behavior is: |
| 125 | + |
| 126 | +- do not assert on strict ordering |
| 127 | +- if needed, renumber the regenerated update after the already-pending one |
| 128 | +- drop duplicate force-close updates when one already exists |
| 129 | + |
| 130 | +This was fixed in |
| 131 | +[channelmanager.rs](/Users/joost/repo/rust-lightning-fuzz-force-close/lightning/src/ln/channelmanager.rs). |
| 132 | + |
| 133 | +### Wallet state must track confirmed transactions |
| 134 | + |
| 135 | +The harness uses a wallet source for bumping and external-funding claims. |
| 136 | + |
| 137 | +If confirmed transactions update `ChainState` but not the wallet view, then: |
| 138 | + |
| 139 | +- spent inputs may still look available |
| 140 | +- change outputs may be missing |
| 141 | +- rebroadcasted claim transactions can become nonsense |
| 142 | + |
| 143 | +That caused repeated claim rebroadcasts and fake non-quiescence. |
| 144 | + |
| 145 | +The wallet must be kept in sync with: |
| 146 | + |
| 147 | +- confirmed spends |
| 148 | +- newly created change outputs |
| 149 | + |
| 150 | +### Child-before-parent vs stale conflicting broadcasts |
| 151 | + |
| 152 | +Not every failed confirmation should be retried forever. |
| 153 | + |
| 154 | +Two distinct cases exist: |
| 155 | + |
| 156 | +1. Child-before-parent |
| 157 | +- retrying later is correct |
| 158 | + |
| 159 | +2. Stale conflicting commitment or claim tx |
| 160 | +- retrying forever is wrong |
| 161 | + |
| 162 | +The confirmation drain must: |
| 163 | + |
| 164 | +- retry deferred transactions while something in the batch still makes progress |
| 165 | +- only keep transactions that could become valid later |
| 166 | +- drop permanently invalid stale conflicts |
| 167 | + |
| 168 | +## Signature and Weight Lessons Under Fuzzing |
| 169 | + |
| 170 | +Two fuzz-specific adjustments were necessary and appear justified: |
| 171 | + |
| 172 | +### Skip low-R grinding under fuzzing |
| 173 | + |
| 174 | +Under `cfg(fuzzing)` and `cfg(secp256k1_fuzz)`, signature generation can behave unlike production. |
| 175 | + |
| 176 | +If low-R grinding stays enabled, signing can become pathologically expensive or effectively stall. |
| 177 | + |
| 178 | +So under fuzzing: |
| 179 | + |
| 180 | +- use plain signing, not low-R grinding |
| 181 | + |
| 182 | +This keeps the path under test while avoiding artificial hangs. |
| 183 | + |
| 184 | +### Relax signed-weight debug assertions under fuzzing |
| 185 | + |
| 186 | +Signed transaction weight depends on signature sizes. |
| 187 | + |
| 188 | +Under fuzzing, signature-size assumptions no longer match production distributions reliably. |
| 189 | + |
| 190 | +So fuzz-only relaxation of "actual signed weight must not exceed estimate" assertions is reasonable. |
| 191 | + |
| 192 | +This should stay fuzz-only. |
| 193 | + |
| 194 | +## What To Preserve |
| 195 | + |
| 196 | +If you change the force-close harness, preserve these behaviors: |
| 197 | + |
| 198 | +- per-block draining around large height advances |
| 199 | +- bounded settle loops with actionable assertions |
| 200 | +- wallet synchronization with confirmed transactions |
| 201 | +- reconciliation of cached payment bookkeeping against live node state |
| 202 | +- startup replay handling for closed-channel monitor updates |
| 203 | +- fuzz-only signing and weight accommodations |
| 204 | + |
| 205 | +These were all required to get the corpus stable. |
| 206 | + |
| 207 | +## What To Be Careful About |
| 208 | + |
| 209 | +### Do not overuse global modes |
| 210 | + |
| 211 | +A mode system was considered, but the harness improved enough without it. |
| 212 | + |
| 213 | +The current approach is better: |
| 214 | + |
| 215 | +- keep the opcode space broad |
| 216 | +- add realism locally where it matters |
| 217 | +- drive settlement more carefully |
| 218 | + |
| 219 | +That preserves coverage without over-structuring the fuzzer. |
| 220 | + |
| 221 | +### Do not make late-claim races impossible |
| 222 | + |
| 223 | +The harness should strongly prefer realistic progress before timeouts, but it should not make |
| 224 | +late-claim races impossible. |
| 225 | + |
| 226 | +The right balance is: |
| 227 | + |
| 228 | +- default behavior is realistic claim propagation |
| 229 | +- rare races still happen because of interleavings, async monitor behavior, and restart timing |
| 230 | + |
| 231 | +### Do not assume pending payments imply unresolved protocol state |
| 232 | + |
| 233 | +Sometimes they do. |
| 234 | + |
| 235 | +Sometimes they are just stale harness bookkeeping after restart and replay. |
| 236 | + |
| 237 | +Always check: |
| 238 | + |
| 239 | +- message queues |
| 240 | +- monitor updates |
| 241 | +- claimable balances |
| 242 | +- broadcaster state |
| 243 | +- live `list_recent_payments()` |
| 244 | + |
| 245 | +before concluding a payment is truly stuck. |
| 246 | + |
| 247 | +### Do not reintroduce big blind height jumps |
| 248 | + |
| 249 | +This is the easiest way to make the corpus noisy again. |
| 250 | + |
| 251 | +If height advancement changes, keep asking: |
| 252 | + |
| 253 | +- are we giving off-chain state a fair chance before timing HTLCs out? |
| 254 | + |
| 255 | +## Good Debugging Workflow |
| 256 | + |
| 257 | +When force-close fuzzing regresses: |
| 258 | + |
| 259 | +1. Run the corpus with `fuzz-runner`. |
| 260 | +2. Isolate one failing case. |
| 261 | +3. Check whether the failure is: |
| 262 | + - real protocol or restart ordering |
| 263 | + - stale harness bookkeeping |
| 264 | + - wallet/view desynchronization |
| 265 | + - over-broad settle criteria |
| 266 | +4. Prefer fixing harness realism before adding more timeouts or more iterations. |
| 267 | +5. Re-run the isolated case, then the full corpus. |
| 268 | + |
| 269 | +The runner is the preferred tool for this now: |
| 270 | + |
| 271 | +- [runner/README.md](/Users/joost/repo/rust-lightning-fuzz-force-close/fuzz/runner/README.md) |
| 272 | + |
| 273 | +## Current Stable Verification |
| 274 | + |
| 275 | +The key verification command is: |
| 276 | + |
| 277 | +```bash |
| 278 | +cargo run --manifest-path fuzz/Cargo.toml -p fuzz-runner -- --timeout-secs 20 |
| 279 | +``` |
| 280 | + |
| 281 | +At the time these notes were written, the full corpus completed with: |
| 282 | + |
| 283 | +- `ok: 366` |
| 284 | +- `failed: 0` |
| 285 | +- `timed_out: 0` |
| 286 | + |
| 287 | +That should be rechecked after any meaningful force-close harness change. |
0 commit comments