Fix IoTConsensusV2 receiver writer borrow race#17495
Fix IoTConsensusV2 receiver writer borrow race#17495Pengzna wants to merge 3 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a race condition in the IoTConsensusV2 receiver’s TsFile writer borrowing logic by ensuring writer lookup/claim happens under the writer-pool lock, preventing concurrent borrowers from selecting the same idle writer.
Changes:
- Wrap corresponding-writer lookup and idle-writer selection in the writer-pool lock within
borrowCorrespondingWriter. - Immediately mark an idle writer as used and bind it to the
commitIdonce selected, instead of delaying the “used” marking. - Only await when neither a matching in-use writer nor an idle writer is available.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| idleWriter.get().setUsed(true); | ||
| idleWriter.get().setCommitIdOfCorrespondingHolderEvent(commitId); | ||
| return idleWriter.get().refreshLastUsedTs(); |
There was a problem hiding this comment.
borrowCorrespondingWriter is now protected by the pool lock, but tryToFindCorrespondingWriter still scans the pool without taking the same lock. Since claiming a writer updates isUsed and commitIdOfCorrespondingHolderEvent, an unlocked reader can observe an intermediate state and fail to find the writer (e.g., the preCheck failure path then calls releaseTsFileWriter(null) and the writer can remain stuck isUsed=true until zombie cleanup). Consider also guarding tryToFindCorrespondingWriter with this lock (or providing a single locked lookup/release-by-commitId API) so all pool state transitions are observed consistently.
| idleWriter.get().setUsed(true); | |
| idleWriter.get().setCommitIdOfCorrespondingHolderEvent(commitId); | |
| return idleWriter.get().refreshLastUsedTs(); | |
| final IoTConsensusV2TsFileWriter writer = idleWriter.get(); | |
| writer.setCommitIdOfCorrespondingHolderEvent(commitId); | |
| writer.refreshLastUsedTs(); | |
| writer.setUsed(true); | |
| return writer; |
| idleWriter.get().setUsed(true); | ||
| idleWriter.get().setCommitIdOfCorrespondingHolderEvent(commitId); | ||
| return idleWriter.get().refreshLastUsedTs(); |
There was a problem hiding this comment.
Minor maintainability: idleWriter.get() is called three times in the same block (setUsed / setCommitId / refresh). Assigning it to a local variable once will make the code easier to read and avoids repeated Optional dereferences.
| idleWriter.get().setUsed(true); | |
| idleWriter.get().setCommitIdOfCorrespondingHolderEvent(commitId); | |
| return idleWriter.get().refreshLastUsedTs(); | |
| final IoTConsensusV2TsFileWriter writer = idleWriter.get(); | |
| writer.setUsed(true); | |
| writer.setCommitIdOfCorrespondingHolderEvent(commitId); | |
| return writer.refreshLastUsedTs(); |
Summary
Problem
This change fixes a receiver-side race that we hit while investigating unexpected
syncLagin IoTConsensusV2.In a local 3C3D reproduction with continuous writes and
flush, two different tsfile events could be assigned to the same receiver writer slot.A representative case on pipe
__consensus.DataRegion[3]_4_3looked like this:1776274436558-1-0-0.tsfilestarted writing into receiver slot/01776274436900-2-0-0.tsfilealso borrowed the same slot/0updateWritingFileIfNeeded(...)saw a different file name on the same writer, so it closed and deleted the previous temporary filewriting file null is not available, and the sender had to retryThis transient failure is enough to create unnecessary retries, and it can amplify the lag symptoms we were debugging.
Root Cause
The issue is in
IoTConsensusV2Receiver.borrowCorrespondingWriter(...).Previously, when the receiver did not find an existing writer for a
commitId, it:condition.await(RETRY_WAIT_TIME, ...)usedand bound thecommitIdThe problem is that
await(...)releases the pool lock.So another concurrent tsfile request can enter the same critical section during that 500ms window, observe the same writer as still idle, and select it as well.
That is how two different tsfile events can incorrectly share one writer slot.
Fix
The fix is intentionally small and local to writer-pool bookkeeping:
This removes the original "find first, claim later" window that caused the race, while preserving the hot-path behavior for already-bound writers.
Why This Is Safe
TCommitId; this PR does not change replicate index assignment or request ordering.