(fix) retry transient Ray ActorUnavailableError during rollout engine bringup#2059
Open
EazyReal wants to merge 1 commit into
Open
(fix) retry transient Ray ActorUnavailableError during rollout engine bringup#2059EazyReal wants to merge 1 commit into
EazyReal wants to merge 1 commit into
Conversation
f456935 to
52d247d
Compare
52d247d to
63bebe6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
During colocated rollout-engine startup, the SGLang engine actors can already be healthy while Ray briefly reports a peer actor as unavailable because of a transient control-plane heartbeat / gRPC
UNAVAILABLEmiss:Current
mainstarts rollout engines first and then waits for the collectedengine.initrefs inRolloutManager.__init__. Encoder-disaggregated startup also has a synchronous encoder wait because encoder URLs are needed before language workers can start. A transientActorUnavailableErrorat either wait boundary can abort startup even though re-awaiting the same already-submitted refs is safe.Fix
Add one shared bounded retry primitive:
slime.utils.retry.retry_with_backoff(thunk, *, should_retry, what, ...)Use that helper for rollout engine bringup waits:
RolloutManager.__init__waits on deferred rollout init handles through the helper.ray.exceptions.ActorUnavailableErroris retried.The important invariant is: retry the wait, not engine creation. The same ObjectRefs are re-awaited; no engine actors are recreated.
Consistency with follow-up retry PRs
This establishes the retry contract used by the onload follow-up: every bounded retry uses
retry_with_backoff; rollout/Ray call sites only provide an idempotent thunk plus_is_transient_ray_unavailable.Tests
CPU-only tests:
tests/test_retry.pycovers the generic helper: success after transient failures, exhaustion, immediate propagation of predicate-rejected errors, and not catchingKeyboardInterrupt/SystemExit.tests/test_rollout_bringup_retry.pycovers the rollout call site: transient actor-unavailable is retried on the same handles, non-transient errors are not retried, and the encoder wait path uses the same helper.Validation
PYTHONPATH=. uv run --with pytest pytest tests/test_retry.py tests/test_rollout_bringup_retry.pyuv run --with pre-commit pre-commit run --files .github/workflows/pr-test.yml .github/workflows/pr-test.yml.j2 slime/ray/rollout.py slime/utils/retry.py tests/test_retry.py tests/test_rollout_bringup_retry.py