Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
4058842
Changed VERSION to 2.15.0.dev0
ptrendx Mar 16, 2026
a945846
[Common] Fix linker error for to_string(DType) in distributed tests (…
vcherepanov-nv Mar 16, 2026
523801d
[NVFP4][Dense/MoE] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose…
zhongbozhu Mar 16, 2026
4017565
[PyTorch] Backwards compatible single param checkpointing in `Grouped…
ksivaman Mar 16, 2026
128f22e
[JAX][Core] Fix Grouped GEMM cuBLAS version and SM arch checks (#2765)
jberchtold-nvidia Mar 17, 2026
4e339a5
Update vermin version to fix precommit CI error with python 3.14 (#2773)
ksivaman Mar 17, 2026
53a41b2
Update cudnnFE to v1.20.0 (#2774)
ksivaman Mar 18, 2026
3e61687
[PyTorch] torch.compile support for permutation functions (#2686)
pggPL Mar 18, 2026
15760a5
[PyTorch] Add an API restore from function context to ensure tensors …
kainzhong Mar 19, 2026
b7598aa
[PyT] Install pytest in onnx L1 test as Pyt container no longer packa…
KshitijLakhani Mar 19, 2026
f11789e
[Core] Fix MXFP8 grouped quantize for zero-sized groups in update_tma…
jberchtold-nvidia Mar 19, 2026
487d68c
[PyT] [Common] Enable sm120 support for fused attn if cuDNN is 9.18.1…
KshitijLakhani Mar 22, 2026
f2a1a3e
[PyTorch Debug] Support tensor dump (#2645)
pggPL Mar 23, 2026
d2625e5
Optimize FSDP2 Pytest Timings (12 -> 2 mins) (#2787)
vthumbe1503 Mar 24, 2026
8477d3d
Enable fused RMSNorm dLN + add through CUDNN (#2778)
CarlosGomes98 Mar 24, 2026
4013c6c
add blackwell support filter for 9.7<=cudnn<9.18.1 (#2775)
sudhakarsingh27 Mar 24, 2026
4ead776
[PyT][Commong] Disable fused attention for sm120 if determinism is re…
KshitijLakhani Mar 25, 2026
e879bf8
[PyTorch][Fused Attn] Add support for cuDNN to return Softmax `Stats`…
sudhakarsingh27 Mar 25, 2026
15cf65a
Upgrade cuDNN FE to v1.21.0 (#2799)
ksivaman Mar 25, 2026
f4debf6
[JAX] Add warning if using BSHD and max_segments_per_seq > 1 (#2796)
jberchtold-nvidia Mar 30, 2026
bce4181
[JAX] Grouped GEMM Refactor to use first_dims and last_dims (#2749)
jberchtold-nvidia Apr 1, 2026
3af8792
Pass input_output_alias to TritonAutotunedKernelCall (#2814)
tdophung Apr 2, 2026
281ff06
Remove integration test for Lightning-Thunder (#2822)
timmoon10 Apr 2, 2026
4bf1c1c
Optimize fp8 block scaling Allgather for FSDP2 (#2789)
vthumbe1503 Apr 2, 2026
b048869
[PyTorch] Fix bug with PR 2677 (#2819)
sudhakarsingh27 Apr 2, 2026
42267ec
[Common] Persistent Grouped MXFP8 quantization kernel (#2738)
Oleg-Goncharov Apr 2, 2026
9d77dcb
[JAX] Fix: Use jitted kernels for generating THD (and BSHD) segment p…
KshitijLakhani Apr 3, 2026
29a8c2f
GEMM + Swiglu fused Grouped MLP for MXFP8 (#2769)
ksivaman Apr 3, 2026
8cf3c16
[PyT][Test] Add xfailing FSDP2 memory leak detection tests (#2803)
pstjohn Apr 3, 2026
85f5a84
Refactor Amax Kernel ldmatrix loads, TMA/compute barriers, swizzle_i…
cael-ling Apr 3, 2026
a88fdc1
[PyTorch] [CI] Capture subprocess stderr in distributed tests for bet…
sudhakarsingh27 Apr 3, 2026
509614d
Feature/unswizzle (#2732)
int-smart Apr 3, 2026
e83c097
Fix nvshmem build (#2815)
GaetanLepage Apr 3, 2026
5abadf4
[FSDP2/Megatron-FSDP/DCP] If model parameters are DTensors, optimizer…
cspades Apr 4, 2026
ac96651
Fix memory overheads with FP4 native weights (#2834)
WanZzzzzz Apr 6, 2026
86edac4
Comm gemm fixes (#2818)
almogsegal Apr 6, 2026
5f9550f
CPU offloading fix: If Data and Transpose is None depend on super Tor…
vthumbe1503 Apr 7, 2026
fdf9fb1
Add `NVTE_BACKWARD_OVERRIDE=high_precision|dequantized` (#2644)
zianglih Apr 7, 2026
edf10bb
Update the error message for cublas version check (#2843)
yaox12 Apr 7, 2026
a10b0b1
guard rmsnorm fused add tests behind appropriate cudnn version (#2844)
CarlosGomes98 Apr 7, 2026
e2470a7
[JAX] Use avg m,n,k heuristics for Grouped GEMM (#2840)
jberchtold-nvidia Apr 8, 2026
d3f88ee
[PyTorch][Flash Attn] Add fallback import for FA3 (#2806)
eattia-nvidia Apr 8, 2026
77b8681
add mark_not_offload() interface for cpu_offload_v1 (#2770)
lhb8125 Apr 8, 2026
a30a126
Fix zero input shape for bgrad_group_quantize (#2854)
vthumbe1503 Apr 8, 2026
0aea85f
[Common] Fix: IMA in `register_user_buffer_collective` on non-SM90 GP…
phu0ngng Apr 9, 2026
181322e
Simplify FA3 discovery (#2849)
vcherepanov-nv Apr 9, 2026
64bb9a2
[PyTorch] Support scaled + clamped SwiGLU in `te.ops` and enable fuse…
ksivaman Apr 9, 2026
ac73538
[JAX] Fix BF16 tolerance for CGEMM + RS + BF16 test (#2860)
phu0ngng Apr 9, 2026
53fefa4
add high precision init weights to fully_shard example (#2785)
pstjohn Apr 9, 2026
2f17c9b
Enforce minimum NCCL version for cuBLASMp (#2857)
vcherepanov-nv Apr 10, 2026
580e7aa
Bias Prob Scaling for GroupedLinear and Fused MOE Layers (#2864)
vthumbe1503 Apr 10, 2026
323582f
Add Megatron-FSDP E2E integration test to TE CI/CD (L1). (#2845)
cspades Apr 11, 2026
2dd31bb
Fix JAX extension build with NVTE_UB_WITH_MPI=1 (#2835)
GaetanLepage Apr 11, 2026
2b78e55
[PyTorch] Remove unnecessary save of weights (#2549)
pggPL Apr 13, 2026
9f5fde1
[PyTorch] Relax dimension constraints for using fused grouped MLP (#2…
ksivaman Apr 13, 2026
491c597
[PyTorch] Cache alpha and beta for cublas ggemm (#2870)
ksivaman Apr 13, 2026
d7c43bb
comm_gemm_test fixes (#2839)
almogsegal Apr 13, 2026
dc92b39
docs(readme): update convergence table, latest news, and outdated lin…
sbhavani Apr 13, 2026
72328b3
Cute Dsl kernel for Wgrad for Fused MOE Layer (#2869)
vthumbe1503 Apr 13, 2026
31f8ab4
Current Stream for Wgrad kernel (#2873)
vthumbe1503 Apr 14, 2026
4e57c21
[PyTorch] Avoid autograd's gradient accumulation in grouped MLP if po…
ksivaman Apr 14, 2026
c7205a7
Strip local version labels from package version checks (#2858)
pstjohn Apr 14, 2026
5d5065f
Reduce number of C++ test cases for MXFP8 cast and activation kernels…
timmoon10 Apr 14, 2026
70af730
[JAX] MXFP8 Grouped Quant+GEMM (#2763)
jberchtold-nvidia Apr 15, 2026
52d6e8b
Test Fused MOE with padded tokens (#2880)
vthumbe1503 Apr 15, 2026
17aa2e4
[PyTorch] [torch.compile] transformer_engine.pytorch.autocast suport …
pggPL Apr 15, 2026
c6853b6
[PyTorch] [torch.compile] Remove module reference from autograd funct…
pggPL Apr 15, 2026
a073ad5
Newton-Schulz via cuSOLVERMp (#2706)
vcherepanov-nv Apr 15, 2026
a817b60
[JAX] Tighten Triton autotuning version gate + autotuning enforce env…
tdophung Apr 15, 2026
a347e09
Add grouped unswizzle functionality for MXFP8 scaling factors (#2837)
int-smart Apr 15, 2026
92b0370
[Pytorch][JAX] Guard against invalid num_out_tokens in permute_with_m…
tdophung Apr 15, 2026
51d9eeb
[PyTorch] [torch.compile] Split linear forward into forward and setup…
pggPL Apr 16, 2026
3a78e15
[PyTorch] Add method for mcore to register wgrad accumulation hook (#…
ksivaman Apr 16, 2026
c9035a4
[PyTorch] Minor optimizations in fused grouped MLP (#2888)
ksivaman Apr 16, 2026
58a008f
[PyTorch] Add test to compare single vs multi-param fused GMLP (#2893)
ksivaman Apr 16, 2026
1e9e48c
[Common] Fix fused router for large top-K and expert counts (#2821)
harryzhou2000 Apr 16, 2026
fca261e
fix CUDA architectures cmake logic (#2832)
GaetanLepage Apr 16, 2026
be593b1
[Common, pyTorch] Grouped MXFP8 dequantize support (#2722)
ptrendx Apr 17, 2026
c5a4fd5
[PyTorch] Add FA4 Support (#2432)
yaox12 Apr 17, 2026
262bc6c
[JAX] Fix grouped quant checkpointing (#2889)
jberchtold-nvidia Apr 17, 2026
549f5ba
adds NVFP4 Fused Adam support (#2797)
jomitchellnv Apr 20, 2026
8d31dcc
auto-merge basis for IFU-dev-260419-v2.15 (raw, conflicts present)
VeeraRajasekhar Jun 3, 2026
a02dba7
[ROCm] IFU-dev-260419-v2.15: Resolve merge conflicts
VeeraRajasekhar Jun 3, 2026
50a837f
[ROCm] IFU-dev-260419-v2.15: Fix non-conflicting upstream changes
VeeraRajasekhar Jun 3, 2026
3d3f9e0
[ROCm] IFU-dev-260419-v2.15: Fix build errors
VeeraRajasekhar Jun 3, 2026
9285929
[ROCm] IFU-dev-260419-v2.15: Fix runtime errors in Linear module
VeeraRajasekhar Jun 4, 2026
5104a38
[ROCm] IFU-dev-260419-v2.15: Fix module reference and backward_overri…
VeeraRajasekhar Jun 4, 2026
a0cc937
[ROCm] IFU-dev-260419-v2.15: Fix build errors
VeeraRajasekhar Jun 10, 2026
5dd2c0c
[ROCm] IFU-dev-260419-v2.15: Fix JAX MXFP8 grouped quantize test fail…
VeeraRajasekhar Jun 11, 2026
0e97e1a
[ROCm] IFU-dev-260419-v2.15: Fix torch mGPU test failures
VeeraRajasekhar Jun 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ repos:
files: ^transformer_engine.*\.(c|cc|cxx|cpp|cu|cuh|h|hpp)$

- repo: https://github.com/netromdk/vermin
rev: c75aca72f4e85c6e47252139e8695f1c8b5f9ae3
rev: b70ff9611a01a2bf2f702aa537d14e71e330edba
hooks:
- id: vermin
args: ['-t=3.10-', '--violations']
44 changes: 19 additions & 25 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -320,11 +320,15 @@ upstream CUTLASS implementation:
Transformer Engine
******************

`Quickstart <#examples>`_ | `Installation <#installation>`_ | `User Guide <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html>`_ | `Examples <https://github.com/NVIDIA/TransformerEngine/tree/main/examples>`_ | `FP8 Convergence <#fp8-convergence>`_ | `Integrations <#integrations>`_ | `Release notes <https://docs.nvidia.com/deeplearning/transformer-engine/documentation-archive.html>`_
`Quickstart <#examples>`_ | `Installation <#installation>`_ | `User Guide <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html>`_ | `Examples <https://github.com/NVIDIA/TransformerEngine/tree/main/examples>`_ | `Convergence <#convergence>`_ | `Integrations <#integrations>`_ | `Release notes <https://docs.nvidia.com/deeplearning/transformer-engine/documentation-archive.html>`_

Latest News
===========

* [12/2025] `NVIDIA Nemotron 3: Efficient and Open Intelligence <https://arxiv.org/abs/2512.20856>`_ - trained with NVFP4 on Transformer Engine
* [11/2025] `NVIDIA Blackwell Architecture Sweeps MLPerf Training v5.1 Benchmarks <https://developer.nvidia.com/blog/nvidia-blackwell-architecture-sweeps-mlperf-training-v5-1-benchmarks/>`_
* [11/2025] `Scale Biology Transformer Models with PyTorch and NVIDIA BioNeMo Recipes <https://developer.nvidia.com/blog/scale-biology-transformer-models-with-pytorch-and-nvidia-bionemo-recipes/>`_
* [11/2025] `FP8 Training of Large-Scale RL Models <https://lmsys.org/blog/2025-11-25-fp8-rl/>`_
* [09/2025] `Pretraining Large Language Models with NVFP4 <https://www.arxiv.org/pdf/2509.25149>`_
* [09/2025] `Native FP8 Mixed Precision Training for Ling 2.0, Open Sourced! <https://huggingface.co/blog/im0qianqian/ling-mini-2-fp8-mixed-precision-training-solution>`_
* [09/2025] `Faster Training Throughput in FP8 Precision with NVIDIA NeMo <https://developer.nvidia.com/blog/faster-training-throughput-in-fp8-precision-with-nvidia-nemo/>`_
Expand All @@ -351,7 +355,8 @@ What is Transformer Engine?

Transformer Engine (TE) is a library for accelerating Transformer models on NVIDIA GPUs, including
using 8-bit floating point (FP8) precision on Hopper, Ada, and Blackwell GPUs, to provide better
performance with lower memory utilization in both training and inference. TE provides a collection
performance with lower memory utilization in both training and inference. On Blackwell GPUs, TE also
supports MXFP8 (Microscaling FP8) and NVFP4 formats for even greater efficiency. TE provides a collection
of highly optimized building blocks for popular Transformer architectures and an automatic mixed
precision-like API that can be used seamlessly with your framework-specific code. TE also includes a
framework agnostic C++ API that can be integrated with other deep learning libraries to enable FP8
Expand Down Expand Up @@ -379,6 +384,7 @@ Highlights
* Easy-to-use modules for building Transformer layers with FP8 support
* Optimizations (e.g. fused kernels) for Transformer models
* Support for FP8 on NVIDIA Hopper, Ada, and Blackwell GPUs
* Support for MXFP8 and NVFP4 on NVIDIA Blackwell GPUs
* Support for optimizations across all precisions (FP16, BF16) on NVIDIA Ampere GPU architecture generations and later

Examples
Expand Down Expand Up @@ -511,12 +517,11 @@ We recommend updating to the latest NGC container available here:
* https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch
* https://catalog.ngc.nvidia.com/orgs/nvidia/containers/jax

If you run any examples, please ensure you are using a matching version of TransformerEngine. TransformerEngine is pre-built and packaged inside the containers with examples available at ``/opt/transformerengine`` or ``/opt/transformer-engine``. If you would like to use examples from TE main branch and are running into import errors, please try the latest pip package or building from source, although NGC containers are recommended for ease-of-use for most users.
If you run any examples, please ensure you are using a matching version of TransformerEngine. TransformerEngine is pre-built and packaged inside the containers with examples available at ``/opt/transformerengine`` or ``/opt/transformer-engine``.

**Benefits of using NGC containers:**

* All dependencies pre-installed with compatible versions and optimized configurations
* NGC PyTorch 23.08+ containers include FlashAttention-2

pip Installation
^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -693,54 +698,43 @@ An example of this change is,
False, False, True, True, True,
False, False, False, False, True]

FP8 Convergence
===============
Convergence
===========

FP8 has been tested extensively across different model architectures and configurations and we found **no significant difference** between FP8 and BF16 training loss curves. FP8 has also been validated for accuracy on downstream LLM tasks (e.g. LAMBADA and WikiText). Below are examples of models tested for convergence across different frameworks.
FP8 and MXFP8 have been tested extensively across different model architectures and configurations and we found **no significant difference** between FP8/MXFP8 and BF16 training loss curves. FP8 and MXFP8 have also been validated for accuracy on downstream LLM tasks (e.g. LAMBADA and WikiText). Below are examples of models tested for convergence across different frameworks.

+------------+------------------+---------------------------------------------------------------------------------------------------------+
| Model | Framework | Source |
+============+==================+=========================================================================================================+
| T5-770M | JAX/T5x | https://github.com/NVIDIA/JAX-Toolbox/tree/main/rosetta/rosetta/projects/t5x#convergence-and-performance|
+------------+------------------+---------------------------------------------------------------------------------------------------------+
| MPT-1.3B | Mosaic Composer | https://www.mosaicml.com/blog/coreweave-nvidia-h100-part-1 |
+------------+------------------+---------------------------------------------------------------------------------------------------------+
| GPT-5B | JAX/Paxml | https://github.com/NVIDIA/JAX-Toolbox/tree/main/rosetta/rosetta/projects/pax#h100-results |
+------------+------------------+---------------------------------------------------------------------------------------------------------+
| GPT-5B | NeMo Framework | Available on request |
+------------+------------------+---------------------------------------------------------------------------------------------------------+
| LLama2-7B | Alibaba Pai | https://mp.weixin.qq.com/s/NQT0uKXLbXyh5031zBdeBQ |
+------------+------------------+---------------------------------------------------------------------------------------------------------+
| T5-11B | JAX/T5x | Available on request |
| LLM-8B | Megatron Core | https://arxiv.org/abs/2506.08027 |
+------------+------------------+---------------------------------------------------------------------------------------------------------+
| MPT-13B | Mosaic Composer | https://www.databricks.com/blog/turbocharged-training-optimizing-databricks-mosaic-ai-stack-fp8 |
+------------+------------------+---------------------------------------------------------------------------------------------------------+
| GPT-22B | NeMo Framework | Available on request |
| MoE-16B | Megatron Core | https://arxiv.org/abs/2506.08027 |
+------------+------------------+---------------------------------------------------------------------------------------------------------+
| LLama2-70B | Alibaba Pai | https://mp.weixin.qq.com/s/NQT0uKXLbXyh5031zBdeBQ |
+------------+------------------+---------------------------------------------------------------------------------------------------------+
| GPT-175B | JAX/Paxml | https://github.com/NVIDIA/JAX-Toolbox/tree/main/rosetta/rosetta/projects/pax#h100-results |
+------------+------------------+---------------------------------------------------------------------------------------------------------+

Integrations
============

Transformer Engine has been integrated with popular LLM frameworks such as:

* `DeepSpeed <https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/runtime/half_precision/test_fp8.py>`_
* `DeepSpeed <https://github.com/deepspeedai/DeepSpeed>`_
* `Hugging Face Accelerate <https://huggingface.co/docs/accelerate/main/en/usage_guides/low_precision_training#configuring-transformersengine>`_
* `Lightning <https://github.com/Lightning-AI/lightning/issues/17172>`_
* `Lightning <https://lightning.ai/docs/pytorch/stable/common/precision.html>`_
* `MosaicML Composer <https://github.com/mosaicml/composer/releases/tag/v0.13.1>`_
* `NVIDIA JAX Toolbox <https://github.com/NVIDIA/JAX-Toolbox>`_
* `NVIDIA Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_
* `NVIDIA NeMo Framework <https://github.com/NVIDIA/NeMo-Megatron-Launcher>`_
* `NVIDIA NeMo Megatron Bridge <https://github.com/NVIDIA-NeMo/Megatron-Bridge>`_
* `Amazon SageMaker Model Parallel Library <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features-v2-tensor-parallelism.html>`_
* `Levanter <https://github.com/stanford-crfm/levanter>`_
* `GPT-NeoX <https://github.com/EleutherAI/gpt-neox>`_
* `Hugging Face Nanotron <https://github.com/huggingface/nanotron>`_ - Coming soon!
* `Colossal-AI <https://github.com/hpcaitech/ColossalAI>`_ - Coming soon!
* `PeriFlow <https://github.com/friendliai/periflow-python-sdk>`_ - Coming soon!

* `Hugging Face Nanotron <https://github.com/huggingface/nanotron>`_

Contributing
============
Expand All @@ -759,7 +753,7 @@ Papers
Videos
======

* `Stable and Scalable FP8 Deep Learning Training on Blackwell | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc24-s62457/>`__
* `Stable and Scalable FP8 Deep Learning Training on Blackwell | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc25-s72778/>`_
* `Blackwell Numerics for AI | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc25-s72458/>`_
* `Building LLMs: Accelerating Pretraining of Foundational Models With FP8 Precision | GTC 2025 <https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=zoho#/session/1726152813607001vnYK>`_
* `From FP8 LLM Training to Inference: Language AI at Scale | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc25-s72799/>`_
Expand Down
Loading