Skip to content

[CELEBORN-2332] Fix self join deadlock in C++ WorkerPartitionReader fetch callbacks#3693

Open
afterincomparableyum wants to merge 2 commits into
apache:mainfrom
afterincomparableyum:celeborn-2332
Open

[CELEBORN-2332] Fix self join deadlock in C++ WorkerPartitionReader fetch callbacks#3693
afterincomparableyum wants to merge 2 commits into
apache:mainfrom
afterincomparableyum:celeborn-2332

Conversation

@afterincomparableyum
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

push-merged data, which exercises this fetch path more aggressively and reliably triggered an EDEADLK abort.

The onSuccess_/onFailure_ callbacks are invoked on the TransportClient's IO thread and capture a weak_ptr that is lifted to a shared_ptr inside the callback body. When that local shared_ptr happens to hold the last reference, dropping it inline runs WorkerPartitionReader on the IO executor's own thread, which transitively destroys the embedded TransportClient and its IOThreadPoolExecutor. The executor then attempts to pthread_join the thread that is currently executing the callback and the join fails with EDEADLK, aborting the process.

Hand the final reference off to the global CPU executor so destruction of the reader (and the IO executor underneath it) always happens on a different thread than the one running the callback.

Why are the changes needed?

When running the bytedance bolt celeborn e2e tests for push-merged data I am working on, I run into this error.

Does this PR resolve a correctness bug?

Yes

Does this PR introduce any user-facing change?

No

How was this patch tested?

I ran the Celeborn bolt e2e tests with this change and the tests passed with push merged data support.

@afterincomparableyum
Copy link
Copy Markdown
Contributor Author

ping @SteNicholas for review please.

… callbacks

Surfaced by the Bolt e2e Celeborn tests while adding support for
push-merged data, which exercises this fetch path more aggressively
and reliably triggered an EDEADLK abort.

The onSuccess_/onFailure_ callbacks are invoked on the TransportClient's
IO thread and capture a weak_ptr that is lifted to a shared_ptr inside
the callback body. When that local shared_ptr happens to hold the last
reference, dropping it inline runs WorkerPartitionReader's destructor on
the IO executor's own thread, which transitively destroys the embedded
TransportClient and its IOThreadPoolExecutor. The executor then attempts
to pthread_join the thread that is currently executing the callback and
the join fails with EDEADLK, aborting the process.

Hand the final reference off to a dedicated CPUThreadPoolExecutor so
destruction of the reader (and the IO executor underneath it) always
happens on a different thread than the one running the callback. The
executor is constructed directly rather than obtained from
folly::getGlobalCPUExecutor() to avoid pulling in folly's singleton
vault, which would otherwise require folly::init() and break unit tests.
@SteNicholas SteNicholas force-pushed the main branch 2 times, most recently from 1d92a40 to cf8d472 Compare May 27, 2026 02:11
@SteNicholas SteNicholas requested a review from Copilot May 27, 2026 05:08
@SteNicholas SteNicholas changed the title [CELEBORN-2332] Fix self join deadlock in C++ WorkerPartitionReader f… [CELEBORN-2332] Fix self join deadlock in C++ WorkerPartitionReader fetch callbacks May 27, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a correctness issue in the C++ client fetch path where WorkerPartitionReader could be destroyed on the TransportClient IO thread, leading to a self-pthread_join (EDEADLK) abort when the embedded IO executor is torn down from within its own callback thread.

Changes:

  • Introduces an off-IO-thread folly::CPUThreadPoolExecutor to release the last shared_ptr<WorkerPartitionReader> asynchronously.
  • Updates onSuccess_ and onFailure_ fetch callbacks to hand off the final reference to that executor to ensure destruction happens on a different thread.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cpp/celeborn/client/reader/WorkerPartitionReader.cpp
Comment thread cpp/celeborn/client/reader/WorkerPartitionReader.cpp
@SteNicholas
Copy link
Copy Markdown
Member

@afterincomparableyum, thanks for contribution. Please take a look at comment of copilot.

@SteNicholas
Copy link
Copy Markdown
Member

@afterincomparableyum, any update?

@afterincomparableyum
Copy link
Copy Markdown
Contributor Author

Will address comments by tomorrow @SteNicholas

@afterincomparableyum
Copy link
Copy Markdown
Contributor Author

regarding CoPilot comments @SteNicholas, it is right that this posts a task per callback. But I don't think the use_count() == 1 guard is safe, so I'd prefer to keep the unconditional handoff.

The whole point of the offload is that the reader's last reference must never be dropped on the fetch callback threaders it destroys the reader on the IO thread and triggers the EDEADLK issue this PR fixes.

The problem with use_count() is that it's a stale snapshot. The reader's owner lives on another thread and can drop its reference at any moment (cleanupReader() is just currReader_ = nullptr). So:

  1. Callback lifts shared_this --> refcount = 2
  2. Checks use_count() --> sees 2 --> decides to drop inline
  3. Owner thread drops its reference --> refcount = 1
  4. Callback returns --> shared_this destructs on the IO thread --> same EDEADLK crash

So the guard trades a correct path for a racy one whose failure mode is a process abort. With multiple chunks in flight, this interleaving is realistic.

Regarding overhead. I will update this PR and add comments with a TODO to explore optimizations if needed.

… behavior and a potential TODO optimization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants