Skip to content

Attach the failed actor's log tail when inference engines die during init#1771

Open
jamesbraza wants to merge 1 commit into
NovaSky-AI:mainfrom
EdisonScientific:fix/1673-engine-init-stderr
Open

Attach the failed actor's log tail when inference engines die during init#1771
jamesbraza wants to merge 1 commit into
NovaSky-AI:mainfrom
EdisonScientific:fix/1673-engine-init-stderr

Conversation

@jamesbraza

Copy link
Copy Markdown
Contributor

Fixes #1673.

When a vLLM v1 engine-core child process dies during engine init, this is everything the driver surfaces (1×H100; gpu_memory_utilization=0.999 forces the death; paths trimmed):

  File ".../skyrl/backends/skyrl_train/inference_engines/ray_wrapped_inference_engine.py", line 325, in create_ray_wrapped_inference_engines
    ray.get(sleep_refs)
  ...
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::AsyncVLLMInferenceEngine.__init__() (pid=434344, ip=192.168.21.141, actor_id=61c0b935aa9d7e8671d62b4c01000000, repr=<...AsyncVLLMInferenceEngine object at 0xc635b8d3d70>)
  ...
  File ".../skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py", line 115, in __init__
    self.llm = self._create_engine(*args, **kwargs)
  ...
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

The "See root cause above" mention doesn't actually have the root cause. Now, after this PR:

Traceback (most recent call last):
  ...
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::AsyncVLLMInferenceEngine.__init__() (pid=434344, ip=192.168.21.141, actor_id=61c0b935aa9d7e8671d62b4c01000000, repr=<...AsyncVLLMInferenceEngine object at 0xc635b8d3d70>)
  ...
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  ...
RuntimeError: Inference engine actor(s) died during engine initialization. The root-cause traceback usually lives in the failed actor's logs, not in the current Ray exception.

--- stderr tail of actor 61c0b935aa9d7e8671d62b4c01000000 (full log: `ray logs actor --id 61c0b935aa9d7e8671d62b4c01000000 --err`) ---
...
(EngineCore pid=435556)   File ".../vllm/v1/worker/gpu_worker.py", line 300, in init_device
(EngineCore pid=435556)     self.requested_memory = request_memory(init_snapshot, self.cache_config)
(EngineCore pid=435556)   File ".../vllm/v1/worker/utils.py", line 413, in request_memory
(EngineCore pid=435556)     raise ValueError(
(EngineCore pid=435556) ValueError: Free memory on device cuda:0 (78.67/79.18 GiB) on startup is less than desired GPU memory utilization (0.999, 79.1 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.

Now issues can be root caused immediately.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces helpers for Ray actor log handling to capture and recover log tails for diagnostics when an actor fails. It adds get_actor_logs_tail and reraise_with_actor_diagnostics to skyrl/train/utils/ray_logging.py and integrates them into inference engine initialization and server group startup. Feedback on these changes suggests: 1) using collections.deque when reading log tails to prevent memory overhead or OOM crashes on large files, 2) catching both RayTaskError and ActorDiedError during server group startup to ensure diagnostics are always collected, and 3) safely converting actor_id attributes to hex strings to ensure compatibility across different Ray versions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread skyrl/train/utils/ray_logging.py Outdated
Comment thread skyrl/backends/skyrl_train/inference_servers/setup.py
Comment thread skyrl/train/utils/ray_logging.py
Comment thread skyrl/backends/skyrl_train/inference_servers/setup.py
…init

When a vLLM v1 engine-core child process dies during engine init, the
driver-side exception bottoms out at vLLM's wait_for_engine_startup with
just "Engine core initialization failed. ... Failed core proc(s): {}".
The root-cause traceback lands t only in the actor's stderr: the shared
SKYRL_LOG_FILE when set (actors dup2 their output there), else the Ray
session's worker-*.err on the actor's node.

Add get_actor_logs_tail() + reraise_with_actor_diagnostics() to
ray_logging.py, collecting both locations best-effort (Ray state API for
cross-node fetch of the failed actor named by the exception, bounded
timeouts, never masks the original error), and wire them into the two
engine-init blocking ray.get() sites:

- create_ray_wrapped_inference_engines (legacy path; catches
  ActorDiedError from the sleep barrier)
- create_inference_servers (new stack; also catches RayTaskError since
  engine init runs inside the still-alive actor's start())

Fixes NovaSky-AI#1673

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jamesbraza jamesbraza force-pushed the fix/1673-engine-init-stderr branch from 2564fa5 to e6218e4 Compare June 10, 2026 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

create_ray_wrapped_inference_engines drops the engine-core child's root cause on init failure

1 participant