Attach the failed actor's log tail when inference engines die during init#1771
Attach the failed actor's log tail when inference engines die during init#1771jamesbraza wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces helpers for Ray actor log handling to capture and recover log tails for diagnostics when an actor fails. It adds get_actor_logs_tail and reraise_with_actor_diagnostics to skyrl/train/utils/ray_logging.py and integrates them into inference engine initialization and server group startup. Feedback on these changes suggests: 1) using collections.deque when reading log tails to prevent memory overhead or OOM crashes on large files, 2) catching both RayTaskError and ActorDiedError during server group startup to ensure diagnostics are always collected, and 3) safely converting actor_id attributes to hex strings to ensure compatibility across different Ray versions.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…init
When a vLLM v1 engine-core child process dies during engine init, the
driver-side exception bottoms out at vLLM's wait_for_engine_startup with
just "Engine core initialization failed. ... Failed core proc(s): {}".
The root-cause traceback lands t only in the actor's stderr: the shared
SKYRL_LOG_FILE when set (actors dup2 their output there), else the Ray
session's worker-*.err on the actor's node.
Add get_actor_logs_tail() + reraise_with_actor_diagnostics() to
ray_logging.py, collecting both locations best-effort (Ray state API for
cross-node fetch of the failed actor named by the exception, bounded
timeouts, never masks the original error), and wire them into the two
engine-init blocking ray.get() sites:
- create_ray_wrapped_inference_engines (legacy path; catches
ActorDiedError from the sleep barrier)
- create_inference_servers (new stack; also catches RayTaskError since
engine init runs inside the still-alive actor's start())
Fixes NovaSky-AI#1673
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2564fa5 to
e6218e4
Compare
Fixes #1673.
When a vLLM v1 engine-core child process dies during engine init, this is everything the driver surfaces (1×H100;
gpu_memory_utilization=0.999forces the death; paths trimmed):The "See root cause above" mention doesn't actually have the root cause. Now, after this PR:
Now issues can be root caused immediately.