guard fuser grad checks on non-leaf nodes by CarlosGomes98 · Pull Request #2919 · NVIDIA/TransformerEngine

CarlosGomes98 · 2026-04-23T09:51:46Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

timmoon10

I'm not sure if this is really addressing the root cause of the issue. Two problems:

We aren't actually protecting against setting requires_grad on non-leaf nodes. We're just skipping requires_grad logic when torch.is_grad_enabled() == True.
Do we even want to skip setting requires_grad on non-leaf nodes? The backward expects grads from each of the outputs, so we need requires_grad for autograd to do the right thing.

I think the right solution is smarter logic when setting requires_grad_. Maybe something like:

x_requires_grad = fuser.first_op_requiring_backward < fuser._num_basic_ops
if x_requires_grad != x.requires_grad:
    x = x.detach()
    if x_requires_grad:
        x.requires_grad_()

# Or maybe only detach if x is a non-leaf node?
# Need to check if the CPU overhead of checking
# is worth saving the CPU overhead of detaching.

...

return x

Another approach would be changing our ops to always return leaf nodes. For example, here is the forward pass of MakeExtraOutput:

TransformerEngine/transformer_engine/pytorch/ops/basic/make_extra_output.py

Line 76 in 0c2e7b0

return input_, [(input_,)]

This would be changed to:

out = input_.detach()
return out, [(out,)]

timmoon10 · 2026-04-23T17:51:14Z

            for idx, ys in zip(basic_op_idxs, fused_op_extra_outputs):
                for y in ys:
-                    y.requires_grad_(idx >= fuser.first_op_requiring_backward)
+                    if func_ctx is not None:


This logic is not intuitive. func_ctx is None when torch.is_grad_enabled() == False:

TransformerEngine/transformer_engine/pytorch/ops/fuser.py

Lines 504 to 509 in 0c2e7b0

if is_grad_enabled:

forward_func = _OperationFuserAutogradFunction.apply

args = []

else:

forward_func = _OperationFuserAutogradFunction.forward

args = [None]

It would be better to pass in is_grad_enabled as an arg so that we can be explicit and not rely on secret contracts.

As I understand it, the real issue is that when the forward_func is .apply, we are free to set requires_grad_ on returned tensors. But when it is .forward, we cannot mutate this state on non-leaf tensors.

When torch.is_grad_enabled() is false, we bypass .apply and call .forward directly with no func_ctx. In that path there is no OperationFuserAutogradFunction node registered, so no fuser backward will run. So I think this

The backward expects grads from each of the outputs, so we need requires_grad for autograd to do the right thing

is not true because we cannot run backward() through it.

I think it makes sense to pass this as an explicit argument as you say, instead of relying on the func_ctx being None. But I think the current logic is correct

guard fuser grad checks on non-leaf nodes

8ccf3e9

timmoon10 reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

guard fuser grad checks on non-leaf nodes#2919

guard fuser grad checks on non-leaf nodes#2919
CarlosGomes98 wants to merge 1 commit intoNVIDIA:mainfrom
CarlosGomes98:cgomes/fuser_non_leaf_nodes

CarlosGomes98 commented Apr 23, 2026

Uh oh!

timmoon10 left a comment •

edited

Loading

Uh oh!

timmoon10 Apr 23, 2026

Uh oh!

CarlosGomes98 Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if is_grad_enabled:
	forward_func = _OperationFuserAutogradFunction.apply
	args = []
	else:
	forward_func = _OperationFuserAutogradFunction.forward
	args = [None]

Conversation

CarlosGomes98 commented Apr 23, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

timmoon10 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timmoon10 Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

CarlosGomes98 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timmoon10 left a comment •

edited

Loading