Skip to content

(fix) retry transient Ray ActorUnavailableError during rollout engine bringup#2059

Open
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:upstream-retry-bringup
Open

(fix) retry transient Ray ActorUnavailableError during rollout engine bringup#2059
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:upstream-retry-bringup

Conversation

@EazyReal

@EazyReal EazyReal commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Problem

During colocated rollout-engine startup, the SGLang engine actors can already be healthy while Ray briefly reports a peer actor as unavailable because of a transient control-plane heartbeat / gRPC UNAVAILABLE miss:

ray.exceptions.ActorUnavailableError: ... RpcError ... UNAVAILABLE

Current main starts rollout engines first and then waits for the collected engine.init refs in RolloutManager.__init__. Encoder-disaggregated startup also has a synchronous encoder wait because encoder URLs are needed before language workers can start. A transient ActorUnavailableError at either wait boundary can abort startup even though re-awaiting the same already-submitted refs is safe.

Fix

Add one shared bounded retry primitive:

  • slime.utils.retry.retry_with_backoff(thunk, *, should_retry, what, ...)
  • The helper owns retry count, backoff, jitter, logging, and exhaustion behavior.
  • The caller owns the retry predicate, so backend-specific classification stays at the call site.

Use that helper for rollout engine bringup waits:

  • RolloutManager.__init__ waits on deferred rollout init handles through the helper.
  • Encoder-disaggregated startup waits on encoder init handles through the helper.
  • Only ray.exceptions.ActorUnavailableError is retried.
  • Real init failures such as CUDA OOM/config/assert errors and permanent actor death still propagate immediately.

The important invariant is: retry the wait, not engine creation. The same ObjectRefs are re-awaited; no engine actors are recreated.

Consistency with follow-up retry PRs

This establishes the retry contract used by the onload follow-up: every bounded retry uses retry_with_backoff; rollout/Ray call sites only provide an idempotent thunk plus _is_transient_ray_unavailable.

Tests

CPU-only tests:

  • tests/test_retry.py covers the generic helper: success after transient failures, exhaustion, immediate propagation of predicate-rejected errors, and not catching KeyboardInterrupt / SystemExit.
  • tests/test_rollout_bringup_retry.py covers the rollout call site: transient actor-unavailable is retried on the same handles, non-transient errors are not retried, and the encoder wait path uses the same helper.

Validation

  • PYTHONPATH=. uv run --with pytest pytest tests/test_retry.py tests/test_rollout_bringup_retry.py
  • uv run --with pre-commit pre-commit run --files .github/workflows/pr-test.yml .github/workflows/pr-test.yml.j2 slime/ray/rollout.py slime/utils/retry.py tests/test_retry.py tests/test_rollout_bringup_retry.py

@EazyReal EazyReal changed the title Retry transient Ray ActorUnavailableError during rollout engine bringup (fix) retry transient Ray ActorUnavailableError during rollout engine bringup Jun 12, 2026
@EazyReal EazyReal force-pushed the upstream-retry-bringup branch 2 times, most recently from f456935 to 52d247d Compare June 12, 2026 08:51
@EazyReal EazyReal force-pushed the upstream-retry-bringup branch from 52d247d to 63bebe6 Compare June 19, 2026 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant