[fix] Use masked mean in advantage batch normalization#1782
Conversation
_normalize_advantages computed the z-score mean over all positions including padding, while the variance was already masked by response_mask. Padding contents (zeros, or non-zero GAE leftovers) biased the mean and corrupted both the centering and the variance estimate. Compute the mean with masked_mean so both statistics use only valid response positions, and add a regression test covering padding invariance and the normalized statistics. Fixes NovaSky-AI#1305 Signed-off-by: Philipp Sinitsin <ph.sinitsyn@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request fixes an issue where advantage batch normalization was biased by padding positions by using masked_mean instead of a simple mean. It also adds a regression test to verify this behavior. A review comment points out a potential division-by-zero issue when response_mask has only zeros, which would make num_actions zero, and suggests clamping num_actions to a minimum of 1.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| @@ -1236,7 +1236,7 @@ def _normalize_advantages( | |||
| # Step 1: Z-score normalization (if enabled) | |||
| if self.cfg.trainer.algorithm.advantage_batch_normalize: | |||
| num_actions = response_mask.sum() | |||
There was a problem hiding this comment.
If response_mask contains only zeros (e.g., in an all-padded batch or during certain edge-case evaluations), num_actions will be 0. This leads to a division by zero (0 / 0) when computing std / num_actions, resulting in NaN values. Since clamp on NaN still returns NaN, this propagates NaN to the advantages and subsequently to the loss and gradients, which can corrupt the model weights.\n\nTo prevent this, clamp num_actions to a minimum of 1.
| num_actions = response_mask.sum() | |
| num_actions = response_mask.sum().clamp(min=1) |
There was a problem hiding this comment.
The std / num_actions division is pre-existing and unchanged by this PR; the line this PR does change is already guarded, since masked_mean clamps its denominator to min=1.0 (torch_utils.py:180), same as the other ~20 call sites. An all-zero batch-wide response_mask means zero trainable tokens, which masked_var already treats as a hard error elsewhere.
Fixes #1305.
The issue predates a refactor:
normalize_advantages_dicthas since moved byte-for-byte fromppo_utils.pyintoRayPPOTrainer._normalize_advantageson current main, bug intact, so the fix lands there.FullyAsyncRayPPOTrainerinherits the method, so the one change covers both the sync and fully-async paths.The new regression test normalizes the same padded batch twice — once with zeros and once with garbage in the padding positions — and asserts identical results on valid positions, with masked mean 0 and masked std 1. It fails before the fix (padding garbage completely rewrites the normalized advantages on valid positions) and passes after, along with the rest of
tests/train/test_trainer.pyandtests/backends/skyrl_train/utils/test_ppo_utils.py.