Fix IoTConsensusV2 receiver writer borrow race by Pengzna · Pull Request #17495 · apache/iotdb

Pengzna · 2026-04-16T02:32:08Z

Summary

keep the existing fast path for requests that already have a bound writer
claim an idle writer immediately after it is found instead of waiting before marking it used
keep waiting only when no matching writer and no idle writer are available

Problem

This change fixes a receiver-side race that we hit while investigating unexpected syncLag in IoTConsensusV2.

In a local 3C3D reproduction with continuous writes and flush, two different tsfile events could be assigned to the same receiver writer slot.
A representative case on pipe __consensus.DataRegion[3]_4_3 looked like this:

tsfile 1776274436558-1-0-0.tsfile started writing into receiver slot /0
before that slot was actually claimed, tsfile 1776274436900-2-0-0.tsfile also borrowed the same slot /0
when the second file arrived, updateWritingFileIfNeeded(...) saw a different file name on the same writer, so it closed and deleted the previous temporary file
later, the earlier file's seal request failed with writing file null is not available, and the sender had to retry

This transient failure is enough to create unnecessary retries, and it can amplify the lag symptoms we were debugging.

Root Cause

The issue is in IoTConsensusV2Receiver.borrowCorrespondingWriter(...).

Previously, when the receiver did not find an existing writer for a commitId, it:

found an idle writer
called condition.await(RETRY_WAIT_TIME, ...)
only after waking up marked that writer as used and bound the commitId

The problem is that await(...) releases the pool lock.
So another concurrent tsfile request can enter the same critical section during that 500ms window, observe the same writer as still idle, and select it as well.
That is how two different tsfile events can incorrectly share one writer slot.

Fix

The fix is intentionally small and local to writer-pool bookkeeping:

keep the existing fast path when a request can directly find its already-bound writer
when a new writer is needed, use the pool lock only to find an idle writer and claim it immediately
only wait when there is neither a matching writer on the fast path nor an idle writer under the lock

This removes the original "find first, claim later" window that caused the race, while preserving the hot-path behavior for already-bound writers.

Why This Is Safe

Different tsfile events are still distinguished by TCommitId; this PR does not change replicate index assignment or request ordering.
The lock only protects writer-slot ownership bookkeeping for newly borrowed writers. Actual file transfer and file IO remain outside this critical section.
The scope is deliberately narrow: no historical/realtime extraction logic is changed here.

Copilot

Pull request overview

This PR addresses a race condition in the IoTConsensusV2 receiver’s TsFile writer borrowing logic by ensuring writer lookup/claim happens under the writer-pool lock, preventing concurrent borrowers from selecting the same idle writer.

Changes:

Wrap corresponding-writer lookup and idle-writer selection in the writer-pool lock within borrowCorrespondingWriter.
Immediately mark an idle writer as used and bind it to the commitId once selected, instead of delaying the “used” marking.
Only await when neither a matching in-use writer nor an idle writer is available.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T02:35:56Z

+            idleWriter.get().setUsed(true);
+            idleWriter.get().setCommitIdOfCorrespondingHolderEvent(commitId);
+            return idleWriter.get().refreshLastUsedTs();


borrowCorrespondingWriter is now protected by the pool lock, but tryToFindCorrespondingWriter still scans the pool without taking the same lock. Since claiming a writer updates isUsed and commitIdOfCorrespondingHolderEvent, an unlocked reader can observe an intermediate state and fail to find the writer (e.g., the preCheck failure path then calls releaseTsFileWriter(null) and the writer can remain stuck isUsed=true until zombie cleanup). Consider also guarding tryToFindCorrespondingWriter with this lock (or providing a single locked lookup/release-by-commitId API) so all pool state transitions are observed consistently.

Suggested change

idleWriter.get().setUsed(true);

idleWriter.get().setCommitIdOfCorrespondingHolderEvent(commitId);

return idleWriter.get().refreshLastUsedTs();

final IoTConsensusV2TsFileWriter writer = idleWriter.get();

writer.setCommitIdOfCorrespondingHolderEvent(commitId);

writer.refreshLastUsedTs();

writer.setUsed(true);

return writer;

Copilot · 2026-04-16T02:35:56Z

+            idleWriter.get().setUsed(true);
+            idleWriter.get().setCommitIdOfCorrespondingHolderEvent(commitId);
+            return idleWriter.get().refreshLastUsedTs();


Minor maintainability: idleWriter.get() is called three times in the same block (setUsed / setCommitId / refresh). Assigning it to a local variable once will make the code easier to read and avoids repeated Optional dereferences.

Suggested change

idleWriter.get().setUsed(true);

idleWriter.get().setCommitIdOfCorrespondingHolderEvent(commitId);

return idleWriter.get().refreshLastUsedTs();

final IoTConsensusV2TsFileWriter writer = idleWriter.get();

writer.setUsed(true);

writer.setCommitIdOfCorrespondingHolderEvent(commitId);

return writer.refreshLastUsedTs();

Fix IoTConsensusV2 receiver writer borrow race

e1a694b

Copilot AI review requested due to automatic review settings April 16, 2026 02:32

Copilot started reviewing on behalf of Pengzna April 16, 2026 02:32 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

jt2594838 approved these changes Apr 16, 2026

View reviewed changes

Pengzna added 2 commits April 16, 2026 10:45

Keep IoTConsensusV2 writer borrow fast path

1d4324a

fix data race

922620b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix IoTConsensusV2 receiver writer borrow race#17495

Fix IoTConsensusV2 receiver writer borrow race#17495
Pengzna wants to merge 3 commits intoapache:masterfrom
Pengzna:codex/iotv2-writer-borrow-fix

Pengzna commented Apr 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-            idleWriter.get().setUsed(true);
-            idleWriter.get().setCommitIdOfCorrespondingHolderEvent(commitId);
-            return idleWriter.get().refreshLastUsedTs();
+            final IoTConsensusV2TsFileWriter writer = idleWriter.get();
+            writer.setCommitIdOfCorrespondingHolderEvent(commitId);
+            writer.refreshLastUsedTs();
+            writer.setUsed(true);
+            return writer;

Conversation

Pengzna commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause

Fix

Why This Is Safe

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pengzna commented Apr 16, 2026 •

edited

Loading