Skip to content

fix: release slot on CancelledError instead of hard-exit at epoch boundaries#1733

Open
dinhxuanvu wants to merge 3 commits into
NovaSky-AI:mainfrom
dinhxuanvu:vdinh/fix-cancelled-error-epoch-boundary
Open

fix: release slot on CancelledError instead of hard-exit at epoch boundaries#1733
dinhxuanvu wants to merge 3 commits into
NovaSky-AI:mainfrom
dinhxuanvu:vdinh/fix-cancelled-error-epoch-boundary

Conversation

@dinhxuanvu

Copy link
Copy Markdown
Contributor

Summary

At epoch boundaries, the trainer cancels in-flight generation workers via task.cancel(). Workers that have already acquired a submission slot (slot_acquired=True) receive CancelledError — this is expected behavior during normal epoch transitions, not a fatal error.

Previously (#1691), os._exit(1) was called in this case, which killed the entire process during normal epoch transitions and prevented proper error log propagation.

Changes

  • Add on_submission_cancelled() to _AsyncStalenessManager that properly undoes acquire_submission_slot() by decrementing both submitted and running counters
  • Replace os._exit(1) in the CancelledError handler with on_submission_cancelled() + return
  • The os._exit(1) in the Exception handler (for EngineDeadError and other real errors) remains unchanged

Why

acquire_submission_slot() increments both submitted and running. When a worker is cancelled mid-flight:

  • on_rollout_rejected() only decrements running, leaving submitted != accepted at epoch end
  • os._exit(1) kills the process entirely (too aggressive for normal behavior)
  • on_submission_cancelled() correctly undoes both increments, so validate_state_at_epoch_end passes

Test plan

  • Verify multi-epoch training completes without premature exit
  • Verify validate_state_at_epoch_end assertions pass at epoch boundaries
  • Verify EngineDeadError still triggers os._exit(1) (Exception handler unchanged)

At epoch boundaries, the trainer cancels in-flight generation workers.
Workers that have already acquired a submission slot (slot_acquired=True)
will receive CancelledError — this is expected behavior, not a fatal error.

Previously, os._exit(1) was called in this case, killing the entire process
during normal epoch transitions.

Add on_submission_cancelled() to _AsyncStalenessManager that properly undoes
acquire_submission_slot() by decrementing both submitted and running counters.
This ensures validate_state_at_epoch_end assertions pass correctly.

The os._exit(1) in the Exception handler (for EngineDeadError and other real
errors) remains unchanged.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces a hard process exit with a graceful cancellation handler (on_submission_cancelled) when a generation worker is cancelled mid-flight. The reviewer recommends adding defensive assertions to prevent the submitted and running counters from underflowing.

Comment thread skyrl/train/fully_async_trainer.py
Prevent silent counter underflow by asserting submitted and running
are positive before decrementing.
@dinhxuanvu

Copy link
Copy Markdown
Contributor Author

@SumanthRH Hey Sunmanth, sorry to ping you directly. This is a follow-up PR from #1691 that we were discussing. The original fix to use os.exit() on CancelledError may have been a bit too aggressive. This PR is to do a more gentle handle + return on those cases instead. PTAL. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant