[CELEBORN-2341] Preserve partially-committed partitions on CommitFiles timeout by shlomitubul · Pull Request #3706 · apache/celeborn

shlomitubul · 2026-05-27T16:47:51Z

What changes were proposed in this pull request?

When the worker-side celeborn.worker.commitFiles.timeout fires and future.cancel(true) interrupts the per-partition commit tasks, Controller's BiFunction has two issues that amplify data loss unnecessarily:

The response is built with List.empty.asJava for both committedPrimaryIds and committedReplicaIds, even though those concurrent sets may have been partially populated by tasks that finished before the timer fired. All successful commit work is silently thrown away.
context.reply() is never called on the error path, so the originating commit RPC sits unanswered until the driver's celeborn.client.rpc.commitFiles.askTimeout expires. The worker has already determined the outcome — there's no reason to make the driver wait.

This PR:

Builds the response from the actual state of committedPrimaryIds / committedReplicaIds / failedPrimaryIds / failedReplicaIds / committedPrimaryStorageInfos / committedReplicaStorageInfos / committedMapIdBitMap / partitionSizeList.
Returns StatusCode.PARTIAL_SUCCESS when any partition committed before cancellation, with the populated lists. CommitHandler on the client already treats PARTIAL_SUCCESS as a terminal, non-retry status (alongside SUCCESS, SHUFFLE_UNREGISTERED, REQUEST_FAILED, WORKER_EXCLUDED, COMMIT_FILE_EXCEPTION), so no client-side change is required.
Preserves the existing StatusCode.COMMIT_FILE_EXCEPTION response when nothing committed.
Calls context.reply(response) so the driver's RPC ask receives the verdict immediately instead of timing out.

Why are the changes needed?

In production we hit celeborn.worker.commitFiles.timeout periodically on heavy shuffles where the per-worker partition count makes the close() work exceed the timeout. When that happens today:

The driver receives no reply and times out at commitFiles.askTimeout, logging Cannot receive any reply ... in 300000 milliseconds.
Even partitions whose close() ran to completion before the timer fired are reported as not committed (because of the empty lists).
The driver marks the entire shuffle as data-lost via dataLostShuffleSet.add(shuffleId), even when most partitions succeeded.
Every reducer for that shuffle hits SHUFFLE_DATA_LOST → FetchFailedException → DAGScheduler retries the whole map stage.

With this change, the driver receives a definitive PARTIAL_SUCCESS reply with the actual committed/failed split. The data-lost set is populated based on the partitions that genuinely didn't make it. Reducers for the partitions that did commit can fetch them normally.

Does this PR introduce any user-facing change?

No client API or wire format changes. StatusCode.PARTIAL_SUCCESS is already part of the CommitFilesResponse protocol and is already handled by the driver-side CommitHandler.parallelCommitFiles retry loop. The user-visible effect is reduced blast radius when a worker's commit times out under heavy load.

How was this patch tested?

./build/mvn -DskipTests install builds the full tree cleanly on main (verified locally with Java 17 / Maven 3.9); the file is byte-identical between main and branch-0.6, so the same patch applies and compiles on both.
The change preserves the existing COMMIT_FILE_EXCEPTION path byte-for-byte for the "nothing committed" case; the new branch reuses the exact same construction pattern as the success-path reply() method (committedPrimaryStorageAndDiskHintList, committedReplicaStorageAndDiskHintList, committedMapIdBitMapList, etc.).
All required imports (jArrayList, jHashMap, RoaringBitmap, StorageInfo) are already in scope at the top of Controller.scala.
Unit/integration tests for the partial-success-on-timeout path are TBD; happy to add a test if reviewers point me at the right test class to extend.

codecov · 2026-05-27T17:28:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.01%. Comparing base (c4546ef) to head (cb74c9b).
⚠️ Report is 58 commits behind head on branch-0.6.

Additional details and impacted files

@@              Coverage Diff               @@
##           branch-0.6    #3706      +/-   ##
==============================================
+ Coverage       66.77%   67.01%   +0.24%     
==============================================
  Files             354      354              
  Lines           21565    21652      +87     
  Branches         1912     1911       -1     
==============================================
+ Hits            14397    14507     +110     
+ Misses           6155     6132      -23     
  Partials         1013     1013

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

When the worker-side `celeborn.worker.commitFiles.timeout` fires and `future.cancel(true)` interrupts the per-partition commit tasks, Controller's BiFunction does two things that amplify data loss unnecessarily: 1. The response is built with `List.empty.asJava` for both committed primary and replica id lists, even though `committedPrimaryIds` / `committedReplicaIds` may be partially populated from tasks that finished before the timer fired. 2. `context.reply()` is never called on the error path, so the driver waits the full `celeborn.client.rpc.commitFiles.askTimeout` for a reply that never comes — even though the worker has already determined the outcome. This change preserves the SUCCESSFUL work and replies promptly: - If at least one partition committed before cancellation, the response uses `StatusCode.PARTIAL_SUCCESS` with the actual `committedPrimaryIds` / `committedReplicaIds` and accompanying storage / map-id-bitmap / partition-size data. The driver's `CommitHandler` retry loop already treats `PARTIAL_SUCCESS` as a terminal, non-retry status (same as `SUCCESS`) and processes the committed/failed partition lists; no driver-side change is needed. - If nothing committed (the catastrophic case), the legacy `COMMIT_FILE_EXCEPTION` response is preserved unchanged. - `context.reply(response)` now fires immediately on the error path so the driver's RPC ask completes with the worker's verdict instead of hitting `RpcTimeoutException` at `commitFiles.askTimeout`. Impact on the SHUFFLE_DATA_LOST -> FetchFailedException stage-retry path: when a worker hits its commit timeout under heavy load, the shuffle's reducer tasks can still fetch the partitions that did make it. Only the genuinely-failed partitions trigger data-lost handling, dramatically reducing the blast radius of any single slow commit.

Copilot

Pull request overview

This PR improves worker-side CommitFiles handling when celeborn.worker.commitFiles.timeout triggers cancellation, aiming to reduce unnecessary shuffle-wide data loss by returning a timely RPC reply that preserves any partitions that successfully committed before cancellation.

Changes:

Build the error-path CommitFilesResponse from the actual committed* / failed* / committed*StorageInfos / bitmap / size state instead of always returning empty committed lists.
Reply to the driver immediately on the exceptional/cancel path (instead of letting the client ask time out).
Stop the COMMIT_FILES_TIME timer on the exceptional/cancel path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+              val response =
+                if (committedPrimaryIds.isEmpty && committedReplicaIds.isEmpty) {
+                  CommitFilesResponse(
+                    StatusCode.COMMIT_FILE_EXCEPTION,
+                    List.empty.asJava,
+                    List.empty.asJava,
+                    primaryIds,
+                    replicaIds)
+                } else {
+                  CommitFilesResponse(
+                    StatusCode.PARTIAL_SUCCESS,
+                    new jArrayList[String](committedPrimaryIds),
+                    new jArrayList[String](committedReplicaIds),
+                    new jArrayList[String](failedPrimaryIds),
+                    new jArrayList[String](failedReplicaIds),
+                    new jHashMap[String, StorageInfo](committedPrimaryStorageInfos),
+                    new jHashMap[String, StorageInfo](committedReplicaStorageInfos),
+                    new jHashMap[String, RoaringBitmap](committedMapIdBitMap),
+                    partitionSizeList.asScala.sum,
+                    partitionSizeList.size())
+                }


SteNicholas

@shlomitubul, could you please create a new pull request which target branch is main?

SteNicholas

Thanks for digging into this — the motivation is solid, and two of the three changes are clean wins. But I think the core "report partial success" change drops in-flight partitions in a way the driver can't detect, so I'd like to hold on this one.

Blocking: in-flight partitions become silent data loss

On the worker each commit task reaches a terminal state by adding its id to exactly one of committedIds / emptyFileIds / failedIds, and committedIds.add happens after fileWriter.close() — so the ordering correctly rules out reporting a non-durable commit. Good.

But when the timeout calls future.cancel(true) + task.cancel(true), tasks that were still queued / interrupted before reaching a terminal state land in none of those sets. This PR's PARTIAL_SUCCESS reports committed = real committed and failed = only the explicitly-failed ids, so the in-flight partitions are in neither list. Following it to the driver:

processResponse records only committed and failed ids.
checkDataLost keys only off the failed sets (!pushReplicateEnabled && failedPrimaries.nonEmpty, or failed-on-both-replicas); it never compares against the requested set.
A normal successful commit already reports empty-file partitions in neither list, and the driver treats "absent from committed" as valid-but-empty.

So the driver can't distinguish an in-flight (uncommitted, has data) partition from an empty (no data) one — both are absent from committed and from failed. The in-flight partition is silently treated as empty: missing from the reducer file group, not flagged by checkDataLost, no recompute → the reducer silently produces wrong results, not even a FetchFailure.

That's worse than today's behavior in the dangerous direction: the current COMMIT_FILE_EXCEPTION reports all ids as failed (over-reporting → whole-shuffle recompute — wasteful but safe). And it bites exactly the target scenario: a heavy shuffle whose partition count exceeds the timeout will have many queued-but-unstarted tasks when the timer fires.

Suggested fix: in the PARTIAL_SUCCESS branch, compute the failed lists as requested − committed − emptyFile rather than just the explicitly-failed set, e.g. failedPrimaryIds = primaryIds \ committedPrimaryIds \ emptyFilePrimaryIds (same for replica). That keeps the real improvement (committed partitions preserved) while ensuring every not-actually-committed partition is reported as failed so checkDataLost recomputes it. The worker has to do this — the driver can't, since it can't tell empty from in-flight without help.

The other two changes look good

context.reply(response) on the error path is a real fix — the original error branch never replied, so the driver waited out commitFiles.askTimeout. I checked the dedup flow (COMMIT_FINISHED→reply / COMMIT_INPROCESS): the future handler replies to the original context exactly once and duplicate RPCs get their own context, so no double-reply.
workerSource.stopTimer(...) fixes a latent timer leak on the error path, and is mutually exclusive with the success reply(), so no double-stop.

Secondary

The response is snapshotted while interrupted-but-not-stopped tasks may still mutate the sets. The commit-then-add ordering prevents the dangerous case, but cross-collection snapshots aren't atomic (an id can be in committedIds before its committedStorageInfos entry is copied). Benign — the driver just retries that one — but worth a comment.
Given the severity, this needs a test: simulate a timeout with some partitions committed and some still queued, and assert the response marks the uncommitted ones as failed so the driver recomputes them.

With the failed-set computation changed to requested − committed − empty plus a test, I think this becomes a solid improvement. Happy to help with the test wiring if useful.

shlomitubul · 2026-06-07T10:18:56Z

Thanks @SteNicholas — your blocking analysis was exactly right. Tracing it through commitFiles and the driver confirmed that queued/interrupted tasks land in none of the committed / empty / failed sets, and checkDataLost keys only off the failed sets, so an in-flight partition was indistinguishable from an empty one and would silently produce wrong results with no FetchFailure.

I've opened #3721 targeting main with the fix:

failed lists computed as requested − committed − empty (your suggested fix), so every not-actually-committed partition is recomputed by the driver;
the response-building extracted into Controller.buildCommitFilesResponseOnCancel and covered by a new ControllerSuite (committed preserved, in-flight → failed, empty → not failed; and the nothing-committed → COMMIT_FILE_EXCEPTION case);
context.reply() + stopTimer() on the error path, as in this PR.

Closing this one in favor of #3721. Thanks again for the careful review.

github-actions Bot added the module:worker label May 27, 2026

shlomitubul marked this pull request as ready for review May 27, 2026 18:15

shlomitubul changed the title ~~[0.6][WORKER] Preserve partially-committed partitions on CommitFiles timeout~~ [CELEBORN-2341] Preserve partially-committed partitions on CommitFiles timeout May 27, 2026

shlomitubul force-pushed the worker-preserve-partial-commit-on-timeout-0.6 branch from cb74c9b to 27e9fd4 Compare May 27, 2026 22:37

SteNicholas requested a review from Copilot June 1, 2026 07:01

Copilot started reviewing on behalf of SteNicholas June 1, 2026 07:01 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

SteNicholas requested changes Jun 3, 2026

View reviewed changes

shlomitubul mentioned this pull request Jun 7, 2026

[CELEBORN-2341] Preserve partially-committed partitions on CommitFiles timeout #3721

Open

shlomitubul closed this Jun 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-2341] Preserve partially-committed partitions on CommitFiles timeout#3706

[CELEBORN-2341] Preserve partially-committed partitions on CommitFiles timeout#3706
shlomitubul wants to merge 1 commit into
apache:branch-0.6from
shlomitubul:worker-preserve-partial-commit-on-timeout-0.6

shlomitubul commented May 27, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

SteNicholas left a comment

Uh oh!

SteNicholas left a comment

Uh oh!

shlomitubul commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shlomitubul commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

codecov Bot commented May 27, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

SteNicholas left a comment

Choose a reason for hiding this comment

Uh oh!

SteNicholas left a comment

Choose a reason for hiding this comment

Blocking: in-flight partitions become silent data loss

The other two changes look good

Secondary

Uh oh!

shlomitubul commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shlomitubul commented May 27, 2026 •

edited

Loading