Attach the failed actor's log tail when inference engines die during init by jamesbraza · Pull Request #1771 · NovaSky-AI/SkyRL

jamesbraza · 2026-06-10T03:52:57Z

When a vLLM v1 engine-core child process dies during engine init, this is everything the driver surfaces (1×H100; gpu_memory_utilization=0.999 forces the death; paths trimmed):

  File ".../skyrl/backends/skyrl_train/inference_engines/ray_wrapped_inference_engine.py", line 325, in create_ray_wrapped_inference_engines
    ray.get(sleep_refs)
  ...
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::AsyncVLLMInferenceEngine.__init__() (pid=434344, ip=192.168.21.141, actor_id=61c0b935aa9d7e8671d62b4c01000000, repr=<...AsyncVLLMInferenceEngine object at 0xc635b8d3d70>)
  ...
  File ".../skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py", line 115, in __init__
    self.llm = self._create_engine(*args, **kwargs)
  ...
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

The "See root cause above" mention doesn't actually have the root cause. Now, after this PR:

Traceback (most recent call last):
  ...
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::AsyncVLLMInferenceEngine.__init__() (pid=434344, ip=192.168.21.141, actor_id=61c0b935aa9d7e8671d62b4c01000000, repr=<...AsyncVLLMInferenceEngine object at 0xc635b8d3d70>)
  ...
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  ...
RuntimeError: Inference engine actor(s) died during engine initialization. The root-cause traceback usually lives in the failed actor's logs, not in the current Ray exception.

--- stderr tail of actor 61c0b935aa9d7e8671d62b4c01000000 (full log: `ray logs actor --id 61c0b935aa9d7e8671d62b4c01000000 --err`) ---
...
(EngineCore pid=435556)   File ".../vllm/v1/worker/gpu_worker.py", line 300, in init_device
(EngineCore pid=435556)     self.requested_memory = request_memory(init_snapshot, self.cache_config)
(EngineCore pid=435556)   File ".../vllm/v1/worker/utils.py", line 413, in request_memory
(EngineCore pid=435556)     raise ValueError(
(EngineCore pid=435556) ValueError: Free memory on device cuda:0 (78.67/79.18 GiB) on startup is less than desired GPU memory utilization (0.999, 79.1 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.

Now issues can be root caused immediately.

gemini-code-assist

Code Review

This pull request introduces helpers for Ray actor log handling to capture and recover log tails for diagnostics when an actor fails. It adds get_actor_logs_tail and reraise_with_actor_diagnostics to skyrl/train/utils/ray_logging.py and integrates them into inference engine initialization and server group startup. Feedback on these changes suggests: 1) using collections.deque when reading log tails to prevent memory overhead or OOM crashes on large files, 2) catching both RayTaskError and ActorDiedError during server group startup to ensure diagnostics are always collected, and 3) safely converting actor_id attributes to hex strings to ensure compatibility across different Ray versions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

…init When a vLLM v1 engine-core child process dies during engine init, the driver-side exception bottoms out at vLLM's wait_for_engine_startup with just "Engine core initialization failed. ... Failed core proc(s): {}". The root-cause traceback lands t only in the actor's stderr: the shared SKYRL_LOG_FILE when set (actors dup2 their output there), else the Ray session's worker-*.err on the actor's node. Add get_actor_logs_tail() + reraise_with_actor_diagnostics() to ray_logging.py, collecting both locations best-effort (Ray state API for cross-node fetch of the failed actor named by the exception, bounded timeouts, never masks the original error), and wire them into the two engine-init blocking ray.get() sites: - create_ray_wrapped_inference_engines (legacy path; catches ActorDiedError from the sleep barrier) - create_inference_servers (new stack; also catches RayTaskError since engine init runs inside the still-alive actor's start()) Fixes NovaSky-AI#1673 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread skyrl/train/utils/ray_logging.py Outdated

Comment thread skyrl/backends/skyrl_train/inference_servers/setup.py

Comment thread skyrl/train/utils/ray_logging.py

Comment thread skyrl/backends/skyrl_train/inference_servers/setup.py

jamesbraza force-pushed the fix/1673-engine-init-stderr branch from 2564fa5 to e6218e4 Compare June 10, 2026 05:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attach the failed actor's log tail when inference engines die during init#1771

Attach the failed actor's log tail when inference engines die during init#1771
jamesbraza wants to merge 1 commit into
NovaSky-AI:mainfrom
EdisonScientific:fix/1673-engine-init-stderr

jamesbraza commented Jun 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jamesbraza commented Jun 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant