Skip to content

[fix] Use masked mean in advantage batch normalization#1782

Open
discobot wants to merge 1 commit into
NovaSky-AI:mainfrom
discobot:fix/1305-masked-adv-norm-mean
Open

[fix] Use masked mean in advantage batch normalization#1782
discobot wants to merge 1 commit into
NovaSky-AI:mainfrom
discobot:fix/1305-masked-adv-norm-mean

Conversation

@discobot

Copy link
Copy Markdown
Contributor

Fixes #1305.

The issue predates a refactor: normalize_advantages_dict has since moved byte-for-byte from ppo_utils.py into RayPPOTrainer._normalize_advantages on current main, bug intact, so the fix lands there. FullyAsyncRayPPOTrainer inherits the method, so the one change covers both the sync and fully-async paths.

The new regression test normalizes the same padded batch twice — once with zeros and once with garbage in the padding positions — and asserts identical results on valid positions, with masked mean 0 and masked std 1. It fails before the fix (padding garbage completely rewrites the normalized advantages on valid positions) and passes after, along with the rest of tests/train/test_trainer.py and tests/backends/skyrl_train/utils/test_ppo_utils.py.

_normalize_advantages computed the z-score mean over all positions including padding, while the variance was already masked by response_mask. Padding contents (zeros, or non-zero GAE leftovers) biased the mean and corrupted both the centering and the variance estimate. Compute the mean with masked_mean so both statistics use only valid response positions, and add a regression test covering padding invariance and the normalized statistics.

Fixes NovaSky-AI#1305

Signed-off-by: Philipp Sinitsin <ph.sinitsyn@gmail.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes an issue where advantage batch normalization was biased by padding positions by using masked_mean instead of a simple mean. It also adds a regression test to verify this behavior. A review comment points out a potential division-by-zero issue when response_mask has only zeros, which would make num_actions zero, and suggests clamping num_actions to a minimum of 1.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread skyrl/train/trainer.py
@@ -1236,7 +1236,7 @@ def _normalize_advantages(
# Step 1: Z-score normalization (if enabled)
if self.cfg.trainer.algorithm.advantage_batch_normalize:
num_actions = response_mask.sum()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If response_mask contains only zeros (e.g., in an all-padded batch or during certain edge-case evaluations), num_actions will be 0. This leads to a division by zero (0 / 0) when computing std / num_actions, resulting in NaN values. Since clamp on NaN still returns NaN, this propagates NaN to the advantages and subsequently to the loss and gradients, which can corrupt the model weights.\n\nTo prevent this, clamp num_actions to a minimum of 1.

Suggested change
num_actions = response_mask.sum()
num_actions = response_mask.sum().clamp(min=1)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The std / num_actions division is pre-existing and unchanged by this PR; the line this PR does change is already guarded, since masked_mean clamps its denominator to min=1.0 (torch_utils.py:180), same as the other ~20 call sites. An all-zero batch-wide response_mask means zero trainable tokens, which masked_var already treats as a hard error elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[train][bug] normalize_advantages_dict computes mean over padding positions

1 participant