Correctly pad scaling factor inverses to satisfy cuteDSL requirements by ksivaman · Pull Request #2924 · NVIDIA/TransformerEngine

ksivaman · 2026-04-24T20:37:37Z

Description

Fix grouped MXFP8 swizzle when per-expert rows aren't a multiple of 128 and pad each expert's scales to (128, 4).

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Make sure scaling factor inverses are 128x4 padded per tensor.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman · 2026-04-24T20:37:58Z

/te-ci

greptile-apps · 2026-04-24T20:42:59Z

Greptile Summary

This PR fixes grouped MXFP8 swizzle when per-expert rows are not a multiple of 128. The core change introduces a "compact" vs "per-tensor-padded" layout distinction: the quantize kernel writes a compact buffer (no padding between experts), while the swizzle output must be padded to (roundup(M,128), roundup(DIVUP(K,32),4)) per tensor for cuDNN consumption. The fix detects the input layout by comparing buffer sizes, sets separate input_stride_bytes/output_stride_bytes for the grouped kernels, and adds IS_PADDED_K/IS_PADDED_M compile-time template flags to prevent out-of-bounds loads past the compact per-tensor extent.

Confidence Score: 5/5

Safe to merge; no P0/P1 issues found; logic is correct and well-tested across edge cases.

All findings are P2 or below. The compact-layout detection, OOB-load prevention, and output buffer allocation are logically correct and consistent between swizzle.cu and swizzle.cpp. The test suite covers aligned, unaligned, and mixed shapes including the originally-failing workload shape.

No files require special attention.

Important Files Changed

Filename	Overview
transformer_engine/common/swizzle/swizzle.cu	Core fix: adds compact-layout detection, separate input/output strides, and IS_PADDED_K/IS_PADDED_M template specializations to avoid OOB loads in the grouped uniform-shape kernels. Logic is correct; earlier padding_m/padding_k compound checks are cleanly separated into orthogonal guards.
transformer_engine/pytorch/csrc/extensions/swizzle.cpp	Python-facing layer: allocates the output buffer in the correct per-tensor padded shape (roundup(M,128), roundup(DIVUP(K,32),4)) instead of using the compact input shape, so cuDNN sees the right strides between experts. The compute_padded_grouped_scale_shape lambda correctly mirrors the swizzle.cu padding formulas for both rowwise and colwise directions.
tests/cpp/operator/test_swizzle.cu	Adds SwizzleGroupedCompactInputTestSuite covering aligned/unaligned M, unaligned K, and combinations; includes the originally-failing shape (2, 2880, 2880). The gather_compact_grouped_scale helper faithfully replicates the quantize kernel's layout, including the trailing group-level alignment.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["maybe_swizzle_grouped_tensor (swizzle.cpp)"]
    A -->|allocate output| B["compute_padded_grouped_scale_shape\nnum_tensors × roundup(M,128) × roundup(⌈K/32⌉,4)"]
    A --> C["nvte_swizzle_grouped_scaling_factors (swizzle.cu)"]

    C --> D{Detect input layout}
    D -->|numel == num_tensors × padded_scale_elems| E["input_is_compact = false\ninput_stride = padded_m × padded_k"]
    D -->|numel == compact_total_scale_elems| F["input_is_compact = true\ninput_stride = m × padded_k (rowwise)\nor ⌈M/32⌉ × padded_m (colwise)"]
    D -->|mismatch| G[NVTE_ERROR]

    E --> H[dispatch_swizzle_*_kernel_impl]
    F --> H

    H -->|IS_PADDED_M=true, row ≥ original_M| I[Zero register, skip __ldg]
    H -->|IS_PADDED_K=true, k_coord ≥ original_K| J[Zero register, skip __ldg]
    H -->|in-bounds| K["__ldg + per-byte boundary zeroing"]

    I --> L[Output: per-tensor padded layout\noutput_stride = padded_m × padded_k]
    J --> L
    K --> L

_{Reviews (3): Last reviewed commit: "Add test for swizzle + padding fusion" | Re-trigger Greptile}

greptile-apps · 2026-04-24T20:43:03Z

+  const auto logical_shape_nvte = input.logical_shape();
+  NVTE_CHECK(logical_shape_nvte.ndim >= 2,
+             "Grouped GEMM swizzle expects logical_shape with ndim >= 2.");
+  const size_t per_tensor_first_dim = logical_shape_nvte.data[0] / num_tensors;


Silent truncation when logical_shape_nvte.data[0] is not divisible by num_tensors

per_tensor_first_dim is computed with plain integer division. If logical_shape_nvte.data[0] is not an exact multiple of num_tensors (e.g. due to a caller bug or unexpected grouped layout), the result is silently truncated, causing padded_m to be underestimated and the output buffer to be too small. A divisibility assertion would catch this much earlier with a clear error message.

Suggested change

const size_t per_tensor_first_dim = logical_shape_nvte.data[0] / num_tensors;

const size_t per_tensor_first_dim = logical_shape_nvte.data[0] / num_tensors;

NVTE_CHECK(logical_shape_nvte.data[0] % num_tensors == 0,

"Grouped GEMM swizzle expects logical_shape first dim to be divisible by num_tensors.");

greptile-apps · 2026-04-24T20:43:04Z

+    bool input_is_compact;
+    if (input_scale_numel == input->num_tensors * padded_scale_elems) {
+      input_is_compact = false;
+    } else if (input_scale_numel == compact_total_scale_elems) {
+      input_is_compact = true;
    } else {
-      NVTE_CHECK(input->columnwise_scale_inv.numel() == input->num_tensors * scale_elems,
-                 "Grouped input columnwise_scale_inv size does not match expected packed size.");
-      NVTE_CHECK(output->columnwise_scale_inv.numel() == output->num_tensors * scale_elems,
-                 "Grouped output columnwise_scale_inv size does not match expected packed size.");
+      NVTE_ERROR("Grouped input ", (rowwise ? "scale_inv" : "columnwise_scale_inv"),
+                 " size does not match expected packed size (got ", input_scale_numel,
+                 ", expected either ", input->num_tensors * padded_scale_elems,
+                 " (per-tensor padded) or ", compact_total_scale_elems, " (compact)).");
    }


Implicit contract on compact-buffer alignment is not validated

The compact_total_scale_elems formula assumes the upstream quantize kernel allocates the compact scale buffer with its total first dim rounded up to 128 (rowwise) or 4 (colwise). If a caller passes a "plain compact" buffer of size exactly num_tensors * m * padded_k (without trailing alignment slack), neither branch matches and NVTE_ERROR fires with a size-mismatch message that may be hard to diagnose.

Consider also accepting num_tensors * compact_scale_elems as a valid compact size, or documenting this alignment requirement in the error message.

ptrendx · 2026-04-24T22:06:44Z

@ksivaman Could you add a test exercising the change?

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman · 2026-04-27T21:29:47Z

/te-ci

Oleg-Goncharov

LGTM overall

Oleg-Goncharov · 2026-04-28T13:31:20Z

+  size_t group_first_align;
+  if (rowwise) {
+    per_tensor_first_unpadded = M_per_tensor;
+    const size_t scale_K = (K_per_tensor + BLOCK - 1) / BLOCK;


We already have divide_round_up and round_up_to_nearest_multiple helpers for this. Could you please use them instead?

Oleg-Goncharov · 2026-04-28T13:32:16Z

+    const NVTEShape rs = input->rowwise_scale_inv_shape();
+    zero_scale_inv_padding(input->rowwise_cpu_scale_inv_ptr<uint8_t>(),
+                           rs.data[0], rs.data[1],
+                           M, (K + BLOCK_SIZE - 1) / BLOCK_SIZE);


Oleg-Goncharov · 2026-04-28T13:32:33Z

+    const NVTEShape cs = input->columnwise_scale_inv_shape();
+    zero_scale_inv_padding(input->columnwise_cpu_scale_inv_ptr<uint8_t>(),
+                           cs.data[0], cs.data[1],
+                           (M + BLOCK_SIZE - 1) / BLOCK_SIZE, K);


Oleg-Goncharov · 2026-04-28T13:49:35Z

    void* output_ptr = rowwise ? output->scale_inv.dptr : output->columnwise_scale_inv.dptr;

    if (rowwise) {
      switch (vec_load_size) {


To avoid code duplication, I'd suggest replacing this switch with a macro similar to the one below:

#define TRANSFORMER_ENGINE_VECTORIZED_LOAD_INTEGER_TYPE_SWITCH(INTEGER_ELTS_NUM, type, ...) \ switch (INTEGER_ELTS_NUM) { \ case 1: { \ using type = int; \ { __VA_ARGS__ } \ } break; \ case 2: { \ using type = int2; \ { __VA_ARGS__ } \ } break; \ case 4: { \ using type = int4; \ { __VA_ARGS__ } \ } break; \ default: { \ NVTE_ERROR("Unsupported number of integer elements ", INTEGER_ELTS_NUM, \ ". Expected one of: 1, 2, or 4."); \ } \ }

Oleg-Goncharov · 2026-04-28T13:50:20Z

          NVTE_ERROR("Not valid vec_load_size.");
      }
    } else {
      switch (vec_load_size) {


And this switch too

Oleg-Goncharov · 2026-04-28T13:53:35Z

+      // Per-byte K masking is still needed when only part of the register is past
+      // original_K (i.e. row is in range but the K position spans the boundary).
+      if constexpr (IS_PADDED_K) {
        for (int j = 0; j < N_TILE_PER_TD * sizeof(int); j++) {


Adding #pragma unroll here would help performance.

ksivaman added 5 commits April 23, 2026 14:53

Fix contiguous path for k=2880

ffe94e4

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Merge branch 'NVIDIA:main' into pad_weight_scale_inv

5ed0f8c

format

5de0389

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Merge branch 'NVIDIA:main' into pad_weight_scale_inv

d13c30b

Review suggestion from @Oleg-Goncharov

bccbf6a

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman requested a review from Oleg-Goncharov April 24, 2026 20:37

ksivaman added MoE 2.15.0 labels Apr 24, 2026

greptile-apps Bot reviewed Apr 24, 2026

View reviewed changes

ksivaman added 3 commits April 27, 2026 13:35

Merge branch 'main' into pad_weight_scale_inv

af6f86f

Merge branch 'main' into pad_weight_scale_inv

5b0aadb

Add test for swizzle + padding fusion

7917042

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Oleg-Goncharov reviewed Apr 28, 2026

View reviewed changes

Conversation

ksivaman commented Apr 24, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

ksivaman commented Apr 24, 2026

Uh oh!

greptile-apps Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx commented Apr 24, 2026

Uh oh!

ksivaman commented Apr 27, 2026

Uh oh!

Oleg-Goncharov left a comment

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading