[CELEBORN-2340] Explicitly trim Flusher thread pool cache when flusher queue is empty by s0nskar · Pull Request #3696 · apache/celeborn

s0nskar · 2026-05-19T09:15:13Z

What changes were proposed in this pull request?

We are using 0.5.3 and noticed that after a worker processes a heavy shuffle workload, its Netty direct memory metric (usedDirectMemory) spikes to a high value and then gets stuck and never recovers — even after the all flush ends and the worker becomes completely idle. The memory stays stuck until the worker is restarted.

Although i noticed that there are fixes which tries to handle this by using pinnedMemory – #3018 and #3099 but i think that this PR is the correct way to handle this instead of relying on pinnedMemory.

Why are the changes needed?

PooledByteBufAllocator maintains a per-thread PoolThreadCache for each Flusher thread. When a ByteBuf is released, instead of immediately returning its slot to the PoolArena, it goes into this thread-local cache. But the PoolArena's view of the underlying PoolChunk is not updated. usedDirectMemory counts all PoolChunk native memory regardless of how much is truly in use vs. sitting in thread caches.

The thread cache is only swept when the owning thread crosses a allocation threshold, but since the worker is in pause state after all flush items end, all flush thread gets blocked on workingQueue(index).take(). The cache is never swept, PoolChunks in thread cache are never freed, and usedDirectMemory stays frozen.

In this PR, we are changing take() with poll() and on poll() timeout we are explicitly calling allocator.trimCurrentThreadCache() to flush cached PoolChunks back to the PoolArena, allowing fully-free chunks to be destroyed and usedDirectMemory to recover.

A similar pattern is already used by:

https://github.com/apache/celeborn/blob/main/worker/src/main/java/org/apache/celeborn/service/deploy/worker/memory/ReadBufferDispatcher.java#L131

Does this PR resolve a correctness bug?

Yes

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

s0nskar · 2026-05-19T09:16:02Z

cc: @leixm @SteNicholas @FMX PTAL

zaynt4606

LGTM but we need a JIRA ticket to merge

Copilot

Pull request overview

This PR updates the worker-side storage flusher loop to proactively release Netty PooledByteBufAllocator per-thread cache memory when flusher threads are idle, aiming to allow usedDirectMemory to drop after heavy shuffle workloads complete.

Changes:

Replace workingQueues(index).take() with poll(..., 1000ms) in flusher worker threads.
On poll timeout (queue idle), call PooledByteBufAllocator.trimCurrentThreadCache() to flush thread-local cached buffers back to the arenas.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

s0nskar · 2026-05-27T10:59:18Z

Added the jira and handled copilot suggestion.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

+                      processIOException(e, DiskStatus.READ_OR_WRITE_FAILURE)
+                      logWarning(s"Flusher-$this-thread-$index encounter exception.", t)
                  }
-                } catch {
-                  case t: Throwable =>
-                    val e = ExceptionUtils.wrapThrowableToIOException(t)
-                    task.notifier.setException(e)
-                    processIOException(e, DiskStatus.READ_OR_WRITE_FAILURE)
-                    logWarning(s"Flusher-$this-thread-$index encounter exception.", t)
+                  lastBeginFlushTime.set(index, -1)
                }


+            } else {
+              allocator match {
+                case alloc: PooledByteBufAllocator =>
+                  // Free buffer pool memory to main direct memory when flush thread is idle.
+                  alloc.trimCurrentThreadCache
+                case _ =>
+              }


Fix Netty direct memory stuck after high shuffle workload on worker

6638309

github-actions Bot added the module:worker label May 19, 2026

zaynt4606 reviewed May 21, 2026

View reviewed changes

zaynt4606 changed the title ~~Explicitly trim Flusher thread pool cache when flusher queue is empty~~ [CELEBORN-xxxx] Explicitly trim Flusher thread pool cache when flusher queue is empty May 21, 2026

SteNicholas requested a review from Copilot May 25, 2026 06:20

Copilot started reviewing on behalf of SteNicholas May 25, 2026 06:20 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Comment thread worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/Flusher.scala Outdated

SteNicholas force-pushed the main branch 2 times, most recently from 1d92a40 to cf8d472 Compare May 27, 2026 02:11

s0nskar changed the title ~~[CELEBORN-xxxx] Explicitly trim Flusher thread pool cache when flusher queue is empty~~ [CELEBORN-2340] Explicitly trim Flusher thread pool cache when flusher queue is empty May 27, 2026

Copilot suggestions

a277a41

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

s0nskar mentioned this pull request May 28, 2026

[CELEBORN-2344] Support to force set serving state via HTTP APIs #3710

Open

2 tasks

SteNicholas requested a review from Copilot June 2, 2026 12:00

Copilot started reviewing on behalf of SteNicholas June 2, 2026 12:00 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-2340] Explicitly trim Flusher thread pool cache when flusher queue is empty#3696

[CELEBORN-2340] Explicitly trim Flusher thread pool cache when flusher queue is empty#3696
s0nskar wants to merge 2 commits into
apache:mainfrom
s0nskar:fix_memory_leak

s0nskar commented May 19, 2026

Uh oh!

s0nskar commented May 19, 2026

Uh oh!

zaynt4606 left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

s0nskar commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

s0nskar commented May 19, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR resolve a correctness bug?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

s0nskar commented May 19, 2026

Uh oh!

zaynt4606 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

s0nskar commented May 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants