[fix] Use masked mean in advantage batch normalization by discobot · Pull Request #1782 · NovaSky-AI/SkyRL

discobot · 2026-06-12T08:15:50Z

The issue predates a refactor: normalize_advantages_dict has since moved byte-for-byte from ppo_utils.py into RayPPOTrainer._normalize_advantages on current main, bug intact, so the fix lands there. FullyAsyncRayPPOTrainer inherits the method, so the one change covers both the sync and fully-async paths.

The new regression test normalizes the same padded batch twice — once with zeros and once with garbage in the padding positions — and asserts identical results on valid positions, with masked mean 0 and masked std 1. It fails before the fix (padding garbage completely rewrites the normalized advantages on valid positions) and passes after, along with the rest of tests/train/test_trainer.py and tests/backends/skyrl_train/utils/test_ppo_utils.py.

_normalize_advantages computed the z-score mean over all positions including padding, while the variance was already masked by response_mask. Padding contents (zeros, or non-zero GAE leftovers) biased the mean and corrupted both the centering and the variance estimate. Compute the mean with masked_mean so both statistics use only valid response positions, and add a regression test covering padding invariance and the normalized statistics. Fixes NovaSky-AI#1305 Signed-off-by: Philipp Sinitsin <ph.sinitsyn@gmail.com>

gemini-code-assist

Code Review

This pull request fixes an issue where advantage batch normalization was biased by padding positions by using masked_mean instead of a simple mean. It also adds a regression test to verify this behavior. A review comment points out a potential division-by-zero issue when response_mask has only zeros, which would make num_actions zero, and suggests clamping num_actions to a minimum of 1.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T08:32:51Z

@@ -1236,7 +1236,7 @@ def _normalize_advantages(
        # Step 1: Z-score normalization (if enabled)
        if self.cfg.trainer.algorithm.advantage_batch_normalize:
            num_actions = response_mask.sum()


If response_mask contains only zeros (e.g., in an all-padded batch or during certain edge-case evaluations), num_actions will be 0. This leads to a division by zero (0 / 0) when computing std / num_actions, resulting in NaN values. Since clamp on NaN still returns NaN, this propagates NaN to the advantages and subsequently to the loss and gradients, which can corrupt the model weights.\n\nTo prevent this, clamp num_actions to a minimum of 1.

Suggested change

num_actions = response_mask.sum()

num_actions = response_mask.sum().clamp(min=1)

The std / num_actions division is pre-existing and unchanged by this PR; the line this PR does change is already guarded, since masked_mean clamps its denominator to min=1.0 (torch_utils.py:180), same as the other ~20 call sites. An all-zero batch-wide response_mask means zero trainable tokens, which masked_var already treats as a hard error elsewhere.

discobot mentioned this pull request Jun 12, 2026

[train][bug] normalize_advantages_dict computes mean over padding positions #1305

Open

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Use masked mean in advantage batch normalization#1782

[fix] Use masked mean in advantage batch normalization#1782
discobot wants to merge 1 commit into
NovaSky-AI:mainfrom
discobot:fix/1305-masked-adv-norm-mean

discobot commented Jun 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

discobot Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	num_actions = response_mask.sum()
	num_actions = response_mask.sum().clamp(min=1)

Conversation

discobot commented Jun 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

discobot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant