(fix) retry transient Ray ActorUnavailableError during rollout engine bringup by EazyReal · Pull Request #2059 · THUDM/slime

EazyReal · 2026-06-12T01:14:30Z

Problem

During colocated rollout-engine startup, the SGLang engine actors can already be healthy while Ray briefly reports a peer actor as unavailable because of a transient control-plane heartbeat / gRPC UNAVAILABLE miss:

ray.exceptions.ActorUnavailableError: ... RpcError ... UNAVAILABLE

Current main starts rollout engines first and then waits for the collected engine.init refs in RolloutManager.__init__. Encoder-disaggregated startup also has a synchronous encoder wait because encoder URLs are needed before language workers can start. A transient ActorUnavailableError at either wait boundary can abort startup even though re-awaiting the same already-submitted refs is safe.

Fix

Add one shared bounded retry primitive:

slime.utils.retry.retry_with_backoff(thunk, *, should_retry, what, ...)
The helper owns retry count, backoff, jitter, logging, and exhaustion behavior.
The caller owns the retry predicate, so backend-specific classification stays at the call site.

Use that helper for rollout engine bringup waits:

RolloutManager.__init__ waits on deferred rollout init handles through the helper.
Encoder-disaggregated startup waits on encoder init handles through the helper.
Only ray.exceptions.ActorUnavailableError is retried.
Real init failures such as CUDA OOM/config/assert errors and permanent actor death still propagate immediately.

The important invariant is: retry the wait, not engine creation. The same ObjectRefs are re-awaited; no engine actors are recreated.

Consistency with follow-up retry PRs

This establishes the retry contract used by the onload follow-up: every bounded retry uses retry_with_backoff; rollout/Ray call sites only provide an idempotent thunk plus _is_transient_ray_unavailable.

Tests

CPU-only tests:

tests/test_retry.py covers the generic helper: success after transient failures, exhaustion, immediate propagation of predicate-rejected errors, and not catching KeyboardInterrupt / SystemExit.
tests/test_rollout_bringup_retry.py covers the rollout call site: transient actor-unavailable is retried on the same handles, non-transient errors are not retried, and the encoder wait path uses the same helper.

Validation

PYTHONPATH=. uv run --with pytest pytest tests/test_retry.py tests/test_rollout_bringup_retry.py
uv run --with pre-commit pre-commit run --files .github/workflows/pr-test.yml .github/workflows/pr-test.yml.j2 slime/ray/rollout.py slime/utils/retry.py tests/test_retry.py tests/test_rollout_bringup_retry.py

EazyReal changed the title ~~Retry transient Ray ActorUnavailableError during rollout engine bringup~~ (fix) retry transient Ray ActorUnavailableError during rollout engine bringup Jun 12, 2026

EazyReal force-pushed the upstream-retry-bringup branch 2 times, most recently from f456935 to 52d247d Compare June 12, 2026 08:51

Retry transient rollout engine bringup waits

63bebe6

EazyReal force-pushed the upstream-retry-bringup branch from 52d247d to 63bebe6 Compare June 19, 2026 03:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fix) retry transient Ray ActorUnavailableError during rollout engine bringup#2059

(fix) retry transient Ray ActorUnavailableError during rollout engine bringup#2059
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:upstream-retry-bringup

EazyReal commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EazyReal commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Consistency with follow-up retry PRs

Tests

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EazyReal commented Jun 12, 2026 •

edited

Loading