Skip to content

[CELEBORN-XXXX] Fix flaky CelebornHashCheckDiskSuite#3704

Draft
pan3793 wants to merge 4 commits into
apache:mainfrom
pan3793:flaky-tests
Draft

[CELEBORN-XXXX] Fix flaky CelebornHashCheckDiskSuite#3704
pan3793 wants to merge 4 commits into
apache:mainfrom
pan3793:flaky-tests

Conversation

@pan3793
Copy link
Copy Markdown
Member

@pan3793 pan3793 commented May 26, 2026

What changes were proposed in this pull request?

Changes vs. before:

  • createWorker is now inside the retry loop, so each attempt gets a fresh Worker with a fresh random port.
  • On failure, the partially-initialized worker is cleaned up via worker.stop(EXIT_IMMEDIATELY) (which calls metricsSystem.stop()), instead of shutdownGracefully() which leaves MetricsSystem running.
  • Reinstated the exponential backoff (2s → 4s → 8s → 16s) that commit 2efdf75 accidentally dropped.
  • workerStarted = true is set only after initialize() succeeds (no need to reset it in the catch).
  • Lock release moved into a finally so an early failure can't leak the lock.

Why are the changes needed?

Does this PR resolve a correctness bug?

  • Yes

Does this PR introduce any user-facing change?

  • Yes

How was this patch tested?

pan3793 added 3 commits May 26, 2026 20:14
### What changes were proposed in this pull request?

As the title, it's a packaging change.

### Why are the changes needed?

I found that `celeborn-cli` always prints such warnings, but actually the slf4j-api and log4j2 jars are correctly present in classpath.

```
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
```

after some investigation, I found that `celeborn-openapi-client-*.jar` bundles shaded slf4j classes, which causes the issue.

```
$ jar tf $CELEBORN_HOME/cli-jars/celeborn-openapi-client-*.jar | grep slf4j
...
org/apache/celeborn/shaded/org/slf4j/
org/apache/celeborn/shaded/org/slf4j/ILoggerFactory.class
org/apache/celeborn/shaded/org/slf4j/IMarkerFactory.class
org/apache/celeborn/shaded/org/slf4j/Logger.class
...
```

### Does this PR resolve a correctness bug?

- [ ] Yes

### Does this PR introduce _any_ user-facing change?

- [ ] Yes

### How was this patch tested?

Tested with `celeborn-cli`, `SLF4J` binding warnings have gone.

Closes apache#3701 from pan3793/CELEBORN-2337.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: 子懿 <ziyi.jxf@antgroup.com>
### What changes were proposed in this pull request?

Remove hardcoded `CELEBORN_PRINT_LAUNCH_COMMAND=0` in `bin/celeborn-class`

### Why are the changes needed?

It should be picked from the environment variable.

### Does this PR resolve a correctness bug?

- [ ] Yes

### Does this PR introduce _any_ user-facing change?

- [ ] Yes

### How was this patch tested?

```
$ CELEBORN_PRINT_LAUNCH_COMMAND=1 sbin/celeborn-cli --version
...
Start to launch /opt/java/openjdk/bin/java -XX:+IgnoreUnrecognizedVMOptions -cp /opt/celeborn/conf::/opt/celeborn/cli-jars/*: org.apache.celeborn.cli.CelebornCli --version
...
```

Closes apache#3702 from pan3793/CELEBORN-2338.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: 子懿 <ziyi.jxf@antgroup.com>
Mockito 4.x's mockConstruction is thread-scoped; moving the initial
createWorker into the worker starter thread broke WorkerSuite's
"CELEBORN-2257: Properly reports remote disks on worker registration"
because the MasterClient construction was no longer intercepted.

Construct the first Worker on the calling thread (preserving the
Mockito contract) and only fall back to a fresh createWorker on the
worker starter thread when a retry is actually needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@SteNicholas SteNicholas force-pushed the main branch 3 times, most recently from 1d92a40 to cf8d472 Compare May 27, 2026 02:11
giggsoff pushed a commit to arenadata/celeborn that referenced this pull request Jun 5, 2026
Create a fresh Worker instance on each retry and stop with EXIT_IMMEDIATELY
to avoid port-already-in-use failures on rapid restarts (pattern from
apache#3704). Use exponential backoff between retries.
giggsoff pushed a commit to arenadata/celeborn that referenced this pull request Jun 5, 2026
Create a fresh Worker instance on each retry and stop with EXIT_IMMEDIATELY
to avoid port-already-in-use failures on rapid restarts (pattern from
apache#3704). Use exponential backoff between retries.
giggsoff pushed a commit to arenadata/celeborn that referenced this pull request Jun 5, 2026
Create a fresh Worker instance on each retry and stop with EXIT_IMMEDIATELY
to avoid port-already-in-use failures on rapid restarts (pattern from
apache#3704). Use exponential backoff between retries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant