Skip to content

Commit a5443d8

Browse files
committed
Stabilize chanmon force-close fuzzing
AI tools were used in preparing this commit.
1 parent b25d429 commit a5443d8

File tree

6 files changed

+911
-104
lines changed

6 files changed

+911
-104
lines changed

fuzz/FC-INFO.md

Lines changed: 287 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,287 @@
1+
# Force-Close Fuzzing Notes
2+
3+
This document captures the practical lessons from stabilizing force-close coverage in
4+
`chanmon_consistency`.
5+
6+
It is aimed at future changes to the harness, not at explaining force-close in Lightning
7+
generally.
8+
9+
## Goal
10+
11+
The goal of force-close fuzzing here is:
12+
13+
- exercise realistic off-chain to on-chain transitions
14+
- keep races possible, but exceptional
15+
- ensure the harness actually drives the system to resolution
16+
- avoid spending most corpus coverage on harness-created nonsense
17+
18+
The important distinction is:
19+
20+
- a protocol race is useful
21+
- a harness starvation artifact is mostly noise
22+
23+
## What Realism Means Here
24+
25+
For this harness, realistic means:
26+
27+
- queued messages get a chance to move before huge chain jumps
28+
- monitor updates complete in plausible places
29+
- payment claims propagate before we declare everything timed out
30+
- on-chain resolution follows the actual balances and broadcasts that exist
31+
32+
What was unrealistic:
33+
34+
- queue HTLC state changes
35+
- skip delivery for many blocks
36+
- jump height by 50, 100, or 200 in one shot
37+
- then observe a timeout force-close everywhere
38+
39+
That created lots of fake claim-vs-timeout races.
40+
41+
## Main Harness Rules That Worked
42+
43+
### 1. Drain progress around height advancement
44+
45+
Large height jumps now advance one block at a time, with bounded progress draining before and
46+
after each block.
47+
48+
Why:
49+
50+
- this gives queued off-chain state a chance to settle
51+
- it keeps timeout races possible, but not dominant
52+
- it is the smallest realism improvement that materially helped
53+
54+
The key helper is `advance_chain_carefully!` in
55+
[chanmon_consistency.rs](/Users/joost/repo/rust-lightning-fuzz-force-close/fuzz/src/chanmon_consistency.rs).
56+
57+
### 2. Stop early if nothing actionable remains
58+
59+
Blindly honoring a `+50` or `+200` block jump wastes time and creates noise.
60+
61+
We now stop early when there is no pending work:
62+
63+
- no peer-message backlog
64+
- no broadcaster backlog
65+
- no pending monitor updates
66+
- no timed balances that still need height progress
67+
68+
### 3. Distinguish timed work from passive confirmations
69+
70+
`ClaimableAwaitingConfirmations` should not, by itself, keep the pre-advance drain alive.
71+
72+
That balance means:
73+
74+
- something already happened on-chain
75+
- we are mostly waiting for confirmations
76+
77+
It is not the same kind of "active timed work" as:
78+
79+
- `ContentiousClaimable`
80+
- `MaybeTimeoutClaimableHTLC`
81+
- `MaybePreimageClaimableHTLC`
82+
- `CounterpartyRevokedOutputClaimable`
83+
84+
Treating `ClaimableAwaitingConfirmations` as active timed work caused settle churn.
85+
86+
### 4. Reconcile harness bookkeeping with live node state
87+
88+
The harness tracks `pending_payments`, but after restarts the real source of truth is the
89+
`ChannelManager` state exposed by `list_recent_payments()`.
90+
91+
Important lesson:
92+
93+
- local caches in the harness can become stale across restart/replay flows
94+
- final invariants should reflect live node state, not only local bookkeeping
95+
96+
The harness now reconciles cached pending payments against `list_recent_payments()` before final
97+
assertions.
98+
99+
## Force-Close Specific Pitfalls
100+
101+
### Announcement preference mismatch
102+
103+
The trusted-peer open path can override `announce_for_forwarding`, while node defaults still prefer
104+
announced channels.
105+
106+
If `force_announced_channel_preference` remains enabled, channels can fail to open before the
107+
fuzzer ever reaches force-close resolution.
108+
109+
For this harness, keeping:
110+
111+
- `force_announced_channel_preference = false`
112+
113+
is necessary to preserve the intended force-close scenarios.
114+
115+
### Startup replay ordering for closed channels
116+
117+
On restart, a regenerated startup force-close update can coexist with older in-flight closed-channel
118+
monitor updates.
119+
120+
Watch out for:
121+
122+
- assuming startup-generated force-close updates always have the highest update id
123+
124+
That assumption was wrong. The correct behavior is:
125+
126+
- do not assert on strict ordering
127+
- if needed, renumber the regenerated update after the already-pending one
128+
- drop duplicate force-close updates when one already exists
129+
130+
This was fixed in
131+
[channelmanager.rs](/Users/joost/repo/rust-lightning-fuzz-force-close/lightning/src/ln/channelmanager.rs).
132+
133+
### Wallet state must track confirmed transactions
134+
135+
The harness uses a wallet source for bumping and external-funding claims.
136+
137+
If confirmed transactions update `ChainState` but not the wallet view, then:
138+
139+
- spent inputs may still look available
140+
- change outputs may be missing
141+
- rebroadcasted claim transactions can become nonsense
142+
143+
That caused repeated claim rebroadcasts and fake non-quiescence.
144+
145+
The wallet must be kept in sync with:
146+
147+
- confirmed spends
148+
- newly created change outputs
149+
150+
### Child-before-parent vs stale conflicting broadcasts
151+
152+
Not every failed confirmation should be retried forever.
153+
154+
Two distinct cases exist:
155+
156+
1. Child-before-parent
157+
- retrying later is correct
158+
159+
2. Stale conflicting commitment or claim tx
160+
- retrying forever is wrong
161+
162+
The confirmation drain must:
163+
164+
- retry deferred transactions while something in the batch still makes progress
165+
- only keep transactions that could become valid later
166+
- drop permanently invalid stale conflicts
167+
168+
## Signature and Weight Lessons Under Fuzzing
169+
170+
Two fuzz-specific adjustments were necessary and appear justified:
171+
172+
### Skip low-R grinding under fuzzing
173+
174+
Under `cfg(fuzzing)` and `cfg(secp256k1_fuzz)`, signature generation can behave unlike production.
175+
176+
If low-R grinding stays enabled, signing can become pathologically expensive or effectively stall.
177+
178+
So under fuzzing:
179+
180+
- use plain signing, not low-R grinding
181+
182+
This keeps the path under test while avoiding artificial hangs.
183+
184+
### Relax signed-weight debug assertions under fuzzing
185+
186+
Signed transaction weight depends on signature sizes.
187+
188+
Under fuzzing, signature-size assumptions no longer match production distributions reliably.
189+
190+
So fuzz-only relaxation of "actual signed weight must not exceed estimate" assertions is reasonable.
191+
192+
This should stay fuzz-only.
193+
194+
## What To Preserve
195+
196+
If you change the force-close harness, preserve these behaviors:
197+
198+
- per-block draining around large height advances
199+
- bounded settle loops with actionable assertions
200+
- wallet synchronization with confirmed transactions
201+
- reconciliation of cached payment bookkeeping against live node state
202+
- startup replay handling for closed-channel monitor updates
203+
- fuzz-only signing and weight accommodations
204+
205+
These were all required to get the corpus stable.
206+
207+
## What To Be Careful About
208+
209+
### Do not overuse global modes
210+
211+
A mode system was considered, but the harness improved enough without it.
212+
213+
The current approach is better:
214+
215+
- keep the opcode space broad
216+
- add realism locally where it matters
217+
- drive settlement more carefully
218+
219+
That preserves coverage without over-structuring the fuzzer.
220+
221+
### Do not make late-claim races impossible
222+
223+
The harness should strongly prefer realistic progress before timeouts, but it should not make
224+
late-claim races impossible.
225+
226+
The right balance is:
227+
228+
- default behavior is realistic claim propagation
229+
- rare races still happen because of interleavings, async monitor behavior, and restart timing
230+
231+
### Do not assume pending payments imply unresolved protocol state
232+
233+
Sometimes they do.
234+
235+
Sometimes they are just stale harness bookkeeping after restart and replay.
236+
237+
Always check:
238+
239+
- message queues
240+
- monitor updates
241+
- claimable balances
242+
- broadcaster state
243+
- live `list_recent_payments()`
244+
245+
before concluding a payment is truly stuck.
246+
247+
### Do not reintroduce big blind height jumps
248+
249+
This is the easiest way to make the corpus noisy again.
250+
251+
If height advancement changes, keep asking:
252+
253+
- are we giving off-chain state a fair chance before timing HTLCs out?
254+
255+
## Good Debugging Workflow
256+
257+
When force-close fuzzing regresses:
258+
259+
1. Run the corpus with `fuzz-runner`.
260+
2. Isolate one failing case.
261+
3. Check whether the failure is:
262+
- real protocol or restart ordering
263+
- stale harness bookkeeping
264+
- wallet/view desynchronization
265+
- over-broad settle criteria
266+
4. Prefer fixing harness realism before adding more timeouts or more iterations.
267+
5. Re-run the isolated case, then the full corpus.
268+
269+
The runner is the preferred tool for this now:
270+
271+
- [runner/README.md](/Users/joost/repo/rust-lightning-fuzz-force-close/fuzz/runner/README.md)
272+
273+
## Current Stable Verification
274+
275+
The key verification command is:
276+
277+
```bash
278+
cargo run --manifest-path fuzz/Cargo.toml -p fuzz-runner -- --timeout-secs 20
279+
```
280+
281+
At the time these notes were written, the full corpus completed with:
282+
283+
- `ok: 366`
284+
- `failed: 0`
285+
- `timed_out: 0`
286+
287+
That should be rechecked after any meaningful force-close harness change.

0 commit comments

Comments
 (0)