HipKittens MXFP8 GEMM Support by alextmagro · Pull Request #566 · ROCm/TransformerEngine

alextmagro · 2026-04-28T05:16:00Z

Creates an MXFP8 GEMM with HipKittens that outperforms hipBLASlt, and offers additional epilogues such as BIAS and GELU AUX

Requires a workspace sized relative to the model. Often larger than hipBLASlt, but with significant performance improvements. Only builds for gfx950, and requires M / 256 and N / 256.

Adds hipKittens header library as a submodule.

conflicts

ipanfilo · 2026-05-08T16:46:34Z

                         [](const testing::TestParamInfo<DqGEMMTestSuite::ParamType>& info) {
-                           return MKN(std::get<0>(info.param)) + "x" + TN(std::get<3>(info.param));
+                           return MKN(std::get<0>(info.param)) + "x" +
+                                  std::to_string(std::get<1>(info.param)) + "x" +


What is a point, they are set to false only

ipanfilo · 2026-05-08T17:15:28Z


-    return torch.empty(get_cublas_workspace_size_bytes(), dtype=torch.uint8, device=device)
+    key = (device, ub, grouped_gemm)
+    ws = _workspace_cache.get(key)


Why we don't rely on torch memory caching?

I have made this change. I will need to run an E2E run to make sure that performance isn't affected, but should be ok given my understanding of torch.empty()

ipanfilo · 2026-05-14T23:09:35Z

+  if (use_hipkittens) {
+    auto param = CanonicalizeGemmInput(*inputA, transa, *inputB, transb, m, n, k);
+
+    hipStream_t s = use_service_stream ? ss_ctl.stream : stream;


the same like with is_mxfp8, no point of having it defined for one branch only

ipanfilo · 2026-05-15T00:21:26Z

@@ -743,12 +786,15 @@ MAKE_DQ_GEMM_TEST(Testfp8xfp8xfp16, fp8, fp8, fp16)

 INSTANTIATE_TEST_SUITE_P(OperatorTest, DqGEMMTestSuite,


If you end up with having separate prefix for MXFP8, it has to be use for this suite for consistency

ipanfilo · 2026-05-15T00:37:41Z

@@ -30,7 +30,9 @@ std::vector<std::tuple<size_t, size_t, size_t>> test_case_sizes = {

 std::vector<std::tuple<size_t, size_t, size_t>> test_case_sizes_mxfp8 = {


test_case_sizes_mxfp8 is only used for DqGEMMTest, is it intention to add sizes there?

Yes, I wanted to add the minimum possible size that hipKittens supports, which is 256x256x256

ipanfilo · 2026-05-15T21:40:23Z

+
+    is_mxfp8 = isinstance(A, MXFP8TensorStorage) or isinstance(B, MXFP8TensorStorage)
+    if is_mxfp8 and _use_hipkittens():
+        a_size = A.size() if hasattr(A, "size") and callable(A.size) else A.shape


MXFP8TensorSttorage has callable size(). What other object could be here that require this condition

I was considering a scenario where A or B was not MXFP8, but we always have them both as MXFP8 so I think it is ok to simplify the logic

ipanfilo · 2026-06-04T19:58:05Z

+
+static int fp8_code(int dt) {
+    switch (dt) {
+    case KITTENS_FP8E4M3: return 0;


Are those and below codes just arbitrary indexes or some special values?

fp8_code are the values used within v_mfma_scale_f32_16x16x128_f8f6f4 to designate whether we are using e5m2 or e4m3.
outcode is arbitrary, and is used for the switch to cast to 16-bit dtypes when needed.

I found. they go down to dispatch_fp8_types. Please add similar comment here about 0/1 meaning or better replace 0/1 with enum/defines so they are easier tracked

I have added a comment there to help keep things clear.

ipanfilo · 2026-06-18T18:54:22Z

+
+static int fp8_code(int dt) {
+    switch (dt) {
+    case KITTENS_FP8E4M3: return 0;


I found. they go down to dispatch_fp8_types. Please add similar comment here about 0/1 meaning or better replace 0/1 with enum/defines so they are easier tracked

ipanfilo · 2026-06-18T18:58:42Z

+    return ws
+
+
+def _use_hipkittens() -> bool:


maybe add cache so the env is read once

Done. Cached both jax and pytorch.

ipanfilo · 2026-06-18T19:02:31Z

        raise


+def _use_hipkittens() -> bool:


maybe decorate it with cache not to re-read env everytime

HipKittens MXFP8 GEMM Support

f9d5ce2

alextmagro requested review from aris134, matthiasdiener and zstreet87 April 28, 2026 05:16

alextmagro requested review from ipanfilo, wangye805 and wenchenvincent as code owners April 28, 2026 05:16

alextmagro added the ci-level 1 CI test level 1 label Apr 28, 2026

wangye805 requested changes May 1, 2026

View reviewed changes

alextmagro added 3 commits May 5, 2026 15:05

Update HipKittens branch after upstream MXFP8 merge

aac5860

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

c917ed0

Update HipKittens commit and address PR comments

3a91321

alextmagro requested a review from wangye805 May 5, 2026 20:26

alextmagro added 5 commits May 5, 2026 20:26

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8 with

cc719fe

conflicts

Resolve conflicts, ensure fp4 workspace changes are harmonious

fcda154

min workspace size guaranteed

70fba6d

add hipkittens to wheels

455002e

fix issue with gfx942 for unified build

ba60ef5

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

ipanfilo requested changes May 8, 2026

View reviewed changes

alextmagro added 2 commits May 12, 2026 02:59

Cleanup and workspace changes

f72b7b8

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

731640a

alextmagro requested review from aris134 and ipanfilo May 12, 2026 13:24

alextmagro added 3 commits May 12, 2026 16:56

fix jax import issue

1960c06

Fix autotuning bug

320152e

fix pytorch import

a280cf7

Cleanup style and build_tools relics

824841d

alextmagro requested a review from ipanfilo May 14, 2026 17:18

alextmagro added ci-level 3 CI test level 3 and removed ci-level 1 CI test level 1 labels May 14, 2026

matthiasdiener reviewed May 14, 2026

View reviewed changes

Comment thread transformer_engine/jax/cpp_extensions/gemm.py

Fix whitespaces and comment issues

f66f77c

ipanfilo reviewed May 15, 2026

View reviewed changes

alextmagro added 5 commits May 18, 2026 17:52

Kernel optimizations

0b6e702

Add use_hipkittens_mxfp8 bool to test_cublaslt_gemm.cu

816c752

rocm_gemm.cu cleanup

aaa88d7

Add env check to jax file

e2203c0

Simplify Workspace Check

7648594

alextmagro requested a review from ipanfilo May 18, 2026 20:43

alextmagro added 5 commits May 18, 2026 22:49

Revert kernel optimizations

03f675b

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

f852c22

Readd dropped test code

3b307bb

Skip unsupported MXFP8 FSDP tests

33a5c45

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

cfa5fac

ipanfilo reviewed May 29, 2026

View reviewed changes

Comment thread tests/cpp/operator/test_cublaslt_gemm.cu Outdated

Fix inverted workspace logic in tests

44ea357

alextmagro requested a review from ipanfilo May 29, 2026 19:34

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

d76a8d2

ipanfilo reviewed Jun 4, 2026

View reviewed changes

alextmagro added 2 commits June 17, 2026 20:36

HK commit update and kernel optimization

495e271

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

14fb3e1

alextmagro requested a review from ipanfilo June 17, 2026 21:09

ipanfilo reviewed Jun 18, 2026

View reviewed changes

Add comments and caching

01b5203

alextmagro requested a review from ipanfilo June 18, 2026 22:47

		@@ -743,12 +786,15 @@ MAKE_DQ_GEMM_TEST(Testfp8xfp8xfp16, fp8, fp8, fp16)

		INSTANTIATE_TEST_SUITE_P(OperatorTest, DqGEMMTestSuite,

		@@ -30,7 +30,9 @@ std::vector<std::tuple<size_t, size_t, size_t>> test_case_sizes = {

		std::vector<std::tuple<size_t, size_t, size_t>> test_case_sizes_mxfp8 = {

Conversation

alextmagro commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alextmagro commented Apr 28, 2026 •

edited

Loading