[CELEBORN-XXXX] Fix flaky CelebornHashCheckDiskSuite by pan3793 · Pull Request #3704 · apache/celeborn

pan3793 · 2026-05-26T13:02:26Z

What changes were proposed in this pull request?

Changes vs. before:

createWorker is now inside the retry loop, so each attempt gets a fresh Worker with a fresh random port.
On failure, the partially-initialized worker is cleaned up via worker.stop(EXIT_IMMEDIATELY) (which calls metricsSystem.stop()), instead of shutdownGracefully() which leaves MetricsSystem running.
Reinstated the exponential backoff (2s → 4s → 8s → 16s) that commit 2efdf75 accidentally dropped.
workerStarted = true is set only after initialize() succeeds (no need to reset it in the catch).
Lock release moved into a finally so an early failure can't leak the lock.

Why are the changes needed?

Does this PR resolve a correctness bug?

Yes

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

### What changes were proposed in this pull request? As the title, it's a packaging change. ### Why are the changes needed? I found that `celeborn-cli` always prints such warnings, but actually the slf4j-api and log4j2 jars are correctly present in classpath. ``` SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. ``` after some investigation, I found that `celeborn-openapi-client-*.jar` bundles shaded slf4j classes, which causes the issue. ``` $ jar tf $CELEBORN_HOME/cli-jars/celeborn-openapi-client-*.jar | grep slf4j ... org/apache/celeborn/shaded/org/slf4j/ org/apache/celeborn/shaded/org/slf4j/ILoggerFactory.class org/apache/celeborn/shaded/org/slf4j/IMarkerFactory.class org/apache/celeborn/shaded/org/slf4j/Logger.class ... ``` ### Does this PR resolve a correctness bug? - [ ] Yes ### Does this PR introduce _any_ user-facing change? - [ ] Yes ### How was this patch tested? Tested with `celeborn-cli`, `SLF4J` binding warnings have gone. Closes apache#3701 from pan3793/CELEBORN-2337. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: 子懿 <ziyi.jxf@antgroup.com>

### What changes were proposed in this pull request? Remove hardcoded `CELEBORN_PRINT_LAUNCH_COMMAND=0` in `bin/celeborn-class` ### Why are the changes needed? It should be picked from the environment variable. ### Does this PR resolve a correctness bug? - [ ] Yes ### Does this PR introduce _any_ user-facing change? - [ ] Yes ### How was this patch tested? ``` $ CELEBORN_PRINT_LAUNCH_COMMAND=1 sbin/celeborn-cli --version ... Start to launch /opt/java/openjdk/bin/java -XX:+IgnoreUnrecognizedVMOptions -cp /opt/celeborn/conf::/opt/celeborn/cli-jars/*: org.apache.celeborn.cli.CelebornCli --version ... ``` Closes apache#3702 from pan3793/CELEBORN-2338. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: 子懿 <ziyi.jxf@antgroup.com>

Mockito 4.x's mockConstruction is thread-scoped; moving the initial createWorker into the worker starter thread broke WorkerSuite's "CELEBORN-2257: Properly reports remote disks on worker registration" because the MasterClient construction was no longer intercepted. Construct the first Worker on the calling thread (preserving the Mockito contract) and only fall back to a fresh createWorker on the worker starter thread when a retry is actually needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Create a fresh Worker instance on each retry and stop with EXIT_IMMEDIATELY to avoid port-already-in-use failures on rapid restarts (pattern from apache#3704). Use exponential backoff between retries.

pan3793 added 3 commits May 26, 2026 20:14

[CELEBORN-XXXX] Fix flaky CelebornHashCheckDiskSuite

6689aa6

github-actions Bot added the module:worker label May 26, 2026

SteNicholas force-pushed the main branch 3 times, most recently from 1d92a40 to cf8d472 Compare May 27, 2026 02:11

SteNicholas mentioned this pull request May 27, 2026

[CELEBORN-2303] Ignore some flaky test #3624

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-XXXX] Fix flaky CelebornHashCheckDiskSuite#3704

[CELEBORN-XXXX] Fix flaky CelebornHashCheckDiskSuite#3704
pan3793 wants to merge 4 commits into
apache:mainfrom
pan3793:flaky-tests

pan3793 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pan3793 commented May 26, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR resolve a correctness bug?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant