[PyTorch] Add distributed Muon optimizer by vcherepanov-nv · Pull Request #2920 · NVIDIA/TransformerEngine

vcherepanov-nv · 2026-04-23T19:15:38Z

Description

Add a distributed Muon optimizer, based on newton_schulz orthogonalization

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Add an optimizer class and tests

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-04-23T19:21:00Z

Greptile Summary

This PR adds a MuonOptimizer for tensor-parallel CUDA parameters, applying SGD-momentum followed by distributed Newton-Schulz orthogonalization over an NCCL process group. The implementation correctly handles distributed normalization (all-reduce of the global L2 norm), partition-dim transposition, Nesterov momentum, and decoupled/L2 weight decay; the closure is now properly wrapped in torch.enable_grad().

Confidence Score: 5/5

Safe to merge; all new findings are P2 suggestions and the core distributed math is correct.

The optimizer's distributed normalization, transpose handling, Nesterov/HeavyBall update, and weight-decay branches are all correct and consistent with the reference implementation in the test. Previously flagged P1s are either fixed (closure/enable_grad) or noted in prior threads. The only new findings are P2: a documentation gap about rank-symmetric gradient availability and incomplete scale-mode test coverage. Neither blocks correctness in the intended tensor-parallel use case.

transformer_engine/pytorch/optimizers/muon.py — collective-deadlock documentation; tests/pytorch/distributed/run_muon_optimizer.py — scale_mode coverage gap

Important Files Changed

Filename	Overview
transformer_engine/pytorch/optimizers/muon.py	New MuonOptimizer class implementing distributed SGD-momentum + Newton-Schulz orthogonalization. Core logic (distributed normalization, transpose/contiguous handling, Nesterov update, scale factor) looks correct. Closure now properly wrapped in `torch.enable_grad()`. Previously-flagged concerns remain open (unit_rms_norm ZeroDivision, non-Nesterov momentum alias); new P2 risk: deadlock if p.grad availability is uneven across ranks at step time.
tests/pytorch/distributed/run_muon_optimizer.py	Distributed test worker that validates optimizer output against a single-process float32 reference. Reference implementation is consistent with the optimizer (no world_size double-multiplication for global_shape). Only `spectral` scale mode is tested; `shape_scaling` and `unit_rms_norm` are not exercised.
tests/pytorch/distributed/test_muon_optimizer.py	pytest harness that launches the worker via torchrun. Correctly parametrizes over dtype, partition_dim, and weight_decay_mode; parses stdout/stderr for pass/fail markers. Clean structure.
transformer_engine/pytorch/optimizers/init.py	Adds MuonOptimizer and get_muon_scale_factor to the public optimizers namespace. Trivial one-line addition.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant MuonOptimizer
    participant _orthogonalize
    participant _distributed_normalize_p2_
    participant newton_schulz

    Caller->>MuonOptimizer: step()
    loop for each param with grad
        MuonOptimizer->>MuonOptimizer: apply weight decay (decoupled or L2)
        MuonOptimizer->>MuonOptimizer: momentum_buffer.lerp_(grad, 1-β)
        MuonOptimizer->>MuonOptimizer: compute nesterov/non-nesterov update
        MuonOptimizer->>_orthogonalize: update, partition_dim, ...
        _orthogonalize->>_orthogonalize: clone + optional transpose
        _orthogonalize->>_distributed_normalize_p2_: orth_grad
        _distributed_normalize_p2_-->>_distributed_normalize_p2_: dist.all_reduce(norm_sq)
        _distributed_normalize_p2_->>_orthogonalize: x /= global_norm
        _orthogonalize->>newton_schulz: orth_grad, CusolverMpCtx
        newton_schulz-->>newton_schulz: distributed NS iterations
        newton_schulz->>_orthogonalize: orth_grad (orthogonalized)
        _orthogonalize->>_orthogonalize: optional un-transpose + scale
        _orthogonalize->>MuonOptimizer: orth_update
        MuonOptimizer->>MuonOptimizer: p.add_(orth_update, alpha=-lr)
    end
    MuonOptimizer->>Caller: loss

_{Reviews (2): Last reviewed commit: "Fix Muon closure and reference test" | Re-trigger Greptile}

greptile-apps · 2026-04-23T19:21:04Z

+    def step(self, closure=None):
+        """Perform a single optimization step."""
+        loss = None
+        if closure is not None:
+            loss = closure()
+


Closure called inside @torch.no_grad(), preventing gradient computation

closure() is invoked while torch.no_grad() is active. Any loss.backward() call inside the closure will silently produce zero/no gradients. The standard PyTorch pattern (used in SGD, Adam, etc.) is to wrap the closure in with torch.enable_grad():.

Suggested change

def step(self, closure=None):

"""Perform a single optimization step."""

loss = None

if closure is not None:

loss = closure()

@torch.no_grad()

def step(self, closure=None):

"""Perform a single optimization step."""

loss = None

if closure is not None:

with torch.enable_grad():

loss = closure()

greptile-apps · 2026-04-23T19:21:04Z

+    scale_mode: str,
+    extra_scale_factor: float,
+    eps: float,
+) -> torch.Tensor:
+    global_shape = [grad.size(0), grad.size(1)]
+    global_shape[partition_dim] *= world_size


Reference global_shape incorrectly scales an already-full tensor

_reference_orthogonalize receives the full matrix (shape full_shape) but then multiplies global_shape[partition_dim] by world_size a second time. For partition_dim=1 with world_size=2 and full_shape=(96, 128) this gives global_shape=[96, 256], so get_muon_scale_factor returns max(96,256)^0.5 = 16. The optimizer, operating on the shard (96, 64), correctly reconstructs global_shape=[96, 128] and computes max(96,128)^0.5 ≈ 11.3. This √2 discrepancy means the reference cannot correctly validate the optimizer's output.

The global_shape[partition_dim] *= world_size line should be removed since the input is already the full matrix.

greptile-apps · 2026-04-23T19:21:06Z

+    if mode == "unit_rms_norm":
+        return (size_out / size_in) ** 0.5


unit_rms_norm mode can divide by zero when size_in == 0

(size_out / size_in) ** 0.5 raises ZeroDivisionError when size_in is 0. While the optimizer validates that the partition dimension is non-empty, it doesn't ensure the other dimension is non-zero. Consider adding a guard or documenting that both dimensions must be strictly positive.

greptile-apps · 2026-04-23T19:21:06Z

+                if group["nesterov"]:
+                    update = grad.lerp(momentum_buffer, group["momentum"])
+                else:
+                    update = momentum_buffer


Non-Nesterov update is an alias to momentum_buffer, not a copy

update = momentum_buffer holds a reference. If _orthogonalize ever modifies its input in-place in a future refactor, the momentum buffer will be silently corrupted. _orthogonalize currently clones the input immediately so this is safe today, but a defensive .clone() or comment would make the intent explicit.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

skyw

I'd advice NOT to expose it in public API. Keeping it in test only if that is the purpose.

Having an optimizer with most code copied invites fragmentation.

Before this, all optimizer TE provides are more optimized fused version. I'd say a highly optimized Fused Muon with similar concept can be justified, but would need more consideration because it has more dependencies on other part of the training pipeline than elementwise optimizers.

skyw · 2026-04-27T20:33:43Z

+    on tensor-parallel parameter shards. The local parameter shard must represent a
+    partition of a logical 2D matrix across the provided NCCL process group.
+
+    Args:


Q: Does TE use numpy style docstring instead of Google style?

skyw · 2026-04-27T20:34:33Z

+
+    def __init__(
+        self,
+        params: Iterable[torch.nn.Parameter | dict],


Nit: The type here doesn't match PyTorch internal. Should be fine for the purpose of this class.

skyw · 2026-04-27T20:35:30Z

+        scale_mode: MuonScaleT = "spectral",
+        extra_scale_factor: float = 1.0,
+        process_group: Optional[dist.ProcessGroup] = None,
+        partition_dim: int = 1,


Fix: partition_dim is per parameter.

skyw · 2026-04-27T20:36:28Z

+            raise ValueError(f"Invalid weight_decay value: {weight_decay}")
+        if num_ns_steps < 1:
+            raise ValueError(f"num_ns_steps must be at least 1, got {num_ns_steps}")
+        if partition_dim not in (0, 1):


Q: Does this class intend to support non-distributed case? partition_dim would be -1 in TE in such case.

skyw · 2026-04-27T20:37:09Z

+
+        if process_group is None:
+            if not dist.is_initialized():
+                raise RuntimeError("MuonOptimizer requires torch.distributed to be initialized.")


Same question above regarding single GPU support.

skyw · 2026-04-27T20:39:01Z

+        if process_group is None:
+            if not dist.is_initialized():
+                raise RuntimeError("MuonOptimizer requires torch.distributed to be initialized.")
+            process_group = dist.group.WORLD


Suggestion: This silent behavior is dangerous. If user forgot to pass the correct TP group, wrong group will be used.

skyw · 2026-04-27T20:40:56Z

+        eps: float,
+    ) -> torch.Tensor:
+        self._validate_param(grad, partition_dim)
+        world_size = dist.get_world_size(self.process_group)


Some suggestion as above. The silent behavior of None process group falling back to default is dangerous. (Understand it is from PyTorch for historical reasons)

skyw · 2026-04-27T20:43:42Z

+        global_shape[partition_dim] *= world_size
+
+        orth_grad = grad.clone()
+        transposed = partition_dim == 0


Attn: This is from common Row and Column wise tensor parallelism in most LLM. It would be sub optimal for anything other than that. Add comment if the assumption is made.

vcherepanov-nv and others added 2 commits April 23, 2026 18:50

Add distributed Muon optimizer

a2df6f8

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e332a8e

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Apr 23, 2026

View reviewed changes

Fix Muon closure and reference test

1304712

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

vcherepanov-nv changed the title ~~[Draft] [PyTorch] Add distributed Muon optimizer~~ [PyTorch] Add distributed Muon optimizer Apr 27, 2026

vcherepanov-nv requested a review from cyanguwa April 27, 2026 18:12

vcherepanov-nv added the 2.16.0 label Apr 27, 2026

vcherepanov-nv requested a review from timmoon10 April 27, 2026 18:13

skyw reviewed Apr 27, 2026

View reviewed changes

		if mode == "unit_rms_norm":
		return (size_out / size_in) ** 0.5

Conversation

vcherepanov-nv commented Apr 23, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

skyw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Apr 23, 2026 •

edited

Loading