[CELEBORN-XXXX] Fix flaky CelebornHashCheckDiskSuite#3704
Draft
pan3793 wants to merge 4 commits into
Draft
Conversation
### What changes were proposed in this pull request? As the title, it's a packaging change. ### Why are the changes needed? I found that `celeborn-cli` always prints such warnings, but actually the slf4j-api and log4j2 jars are correctly present in classpath. ``` SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. ``` after some investigation, I found that `celeborn-openapi-client-*.jar` bundles shaded slf4j classes, which causes the issue. ``` $ jar tf $CELEBORN_HOME/cli-jars/celeborn-openapi-client-*.jar | grep slf4j ... org/apache/celeborn/shaded/org/slf4j/ org/apache/celeborn/shaded/org/slf4j/ILoggerFactory.class org/apache/celeborn/shaded/org/slf4j/IMarkerFactory.class org/apache/celeborn/shaded/org/slf4j/Logger.class ... ``` ### Does this PR resolve a correctness bug? - [ ] Yes ### Does this PR introduce _any_ user-facing change? - [ ] Yes ### How was this patch tested? Tested with `celeborn-cli`, `SLF4J` binding warnings have gone. Closes apache#3701 from pan3793/CELEBORN-2337. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: 子懿 <ziyi.jxf@antgroup.com>
### What changes were proposed in this pull request? Remove hardcoded `CELEBORN_PRINT_LAUNCH_COMMAND=0` in `bin/celeborn-class` ### Why are the changes needed? It should be picked from the environment variable. ### Does this PR resolve a correctness bug? - [ ] Yes ### Does this PR introduce _any_ user-facing change? - [ ] Yes ### How was this patch tested? ``` $ CELEBORN_PRINT_LAUNCH_COMMAND=1 sbin/celeborn-cli --version ... Start to launch /opt/java/openjdk/bin/java -XX:+IgnoreUnrecognizedVMOptions -cp /opt/celeborn/conf::/opt/celeborn/cli-jars/*: org.apache.celeborn.cli.CelebornCli --version ... ``` Closes apache#3702 from pan3793/CELEBORN-2338. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: 子懿 <ziyi.jxf@antgroup.com>
Mockito 4.x's mockConstruction is thread-scoped; moving the initial createWorker into the worker starter thread broke WorkerSuite's "CELEBORN-2257: Properly reports remote disks on worker registration" because the MasterClient construction was no longer intercepted. Construct the first Worker on the calling thread (preserving the Mockito contract) and only fall back to a fresh createWorker on the worker starter thread when a retry is actually needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1d92a40 to
cf8d472
Compare
giggsoff
pushed a commit
to arenadata/celeborn
that referenced
this pull request
Jun 5, 2026
Create a fresh Worker instance on each retry and stop with EXIT_IMMEDIATELY to avoid port-already-in-use failures on rapid restarts (pattern from apache#3704). Use exponential backoff between retries.
giggsoff
pushed a commit
to arenadata/celeborn
that referenced
this pull request
Jun 5, 2026
Create a fresh Worker instance on each retry and stop with EXIT_IMMEDIATELY to avoid port-already-in-use failures on rapid restarts (pattern from apache#3704). Use exponential backoff between retries.
giggsoff
pushed a commit
to arenadata/celeborn
that referenced
this pull request
Jun 5, 2026
Create a fresh Worker instance on each retry and stop with EXIT_IMMEDIATELY to avoid port-already-in-use failures on rapid restarts (pattern from apache#3704). Use exponential backoff between retries.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Changes vs. before:
createWorkeris now inside the retry loop, so each attempt gets a fresh Worker with a fresh random port.worker.stop(EXIT_IMMEDIATELY)(which callsmetricsSystem.stop()), instead ofshutdownGracefully()which leavesMetricsSystemrunning.workerStarted = trueis set only afterinitialize()succeeds (no need to reset it in the catch).finallyso an early failure can't leak the lock.Why are the changes needed?
Does this PR resolve a correctness bug?
Does this PR introduce any user-facing change?
How was this patch tested?