[TRITON/GLUON]: Add moe_a16w4 gluon(gfx1250) kernel by rahulbatra85 · Pull Request #3277 · ROCm/aiter

rahulbatra85 · 2026-05-19T21:04:27Z

Motivation

Adds moe_a16w4 gluon kernel

Technical Details

Add moe_a16w4 gluon kernel. First 4-bit weight is scaled and upcasted to bf16 and then a bf16 gemm is performed. The gluon kernel also uses gfx1250 tdm ops and pipelining

Test Plan

Unit tests

Test Result

All tests should pass

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-05-19T21:04:58Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3277 --add-label <label>

lburzawa · 2026-06-19T07:26:06Z

@rahulbatra85 check out some gluon gfx1250 optimizations I added to a8w4 kernel that should also help in your case.

Simpler prologue with less loads/waits in case of not using xcd swizzle

aiter/aiter/ops/triton/_gluon_kernels/gfx1250/moe/moe_op_gemm_a8w4.py

Line 177 in a40c487

else:
Using int16 gather indices where available

aiter/aiter/ops/triton/_gluon_kernels/gfx1250/moe/moe_op_gemm_a8w4.py

Line 197 in a40c487

if GatherIndx.dtype.element_ty == gl.uint16:
Setting OOB index for gather indices that represent padding so we don't issues loads there

aiter/aiter/ops/triton/_gluon_kernels/gfx1250/moe/moe_op_gemm_a8w4.py

Line 201 in a40c487

oob_idx = (num_tokens).to(gl.uint16)
Using TDM store

aiter/aiter/ops/triton/_gluon_kernels/gfx1250/moe/moe_op_gemm_a8w4.py

Line 608 in a40c487

gl.amd.gfx1250.tdm.async_store(

rahulbatra85 · 2026-06-19T17:03:13Z

@rahulbatra85 check out some gluon gfx1250 optimizations I added to a8w4 kernel that should also help in your case.

Simpler prologue with less loads/waits in case of not using xcd swizzle

aiter/aiter/ops/triton/_gluon_kernels/gfx1250/moe/moe_op_gemm_a8w4.py

Line 177 in a40c487

else:

Using int16 gather indices where available

aiter/aiter/ops/triton/_gluon_kernels/gfx1250/moe/moe_op_gemm_a8w4.py

Line 197 in a40c487

if GatherIndx.dtype.element_ty == gl.uint16:

Setting OOB index for gather indices that represent padding so we don't issues loads there

aiter/aiter/ops/triton/_gluon_kernels/gfx1250/moe/moe_op_gemm_a8w4.py

Line 201 in a40c487

oob_idx = (num_tokens).to(gl.uint16)

Using TDM store

aiter/aiter/ops/triton/_gluon_kernels/gfx1250/moe/moe_op_gemm_a8w4.py

Line 608 in a40c487

gl.amd.gfx1250.tdm.async_store(

Ok, will update my code.

lburzawa · 2026-06-19T17:14:33Z

One more thing is we can use tdm for bias too, just have to be careful with waits in epilogue if issuing load beforehand. You can also take a look at how it's done in a8w4 kernel.

rahulbatra85 requested a review from a team May 19, 2026 21:04

rahulbatra85 marked this pull request as draft May 19, 2026 21:08

rahulbatra85 force-pushed the gluon_moe_a16w4_batra branch from c5aa5db to f62226f Compare June 8, 2026 23:39

lburzawa force-pushed the gluon_moe_a16w4_batra branch from f62226f to fed066b Compare June 10, 2026 02:15

rahulbatra85 force-pushed the gluon_moe_a16w4_batra branch 2 times, most recently from 8a6cabc to aaefe90 Compare June 10, 2026 20:15

rahulbatra85 marked this pull request as ready for review June 10, 2026 20:18

rahulbatra85 requested review from brunomazzottiamd and lburzawa June 10, 2026 20:19

rahulbatra85 changed the title ~~[TRITON/GLUON]: Add moe_a16w4 gluon kernel~~ [TRITON/GLUON]: Add moe_a16w4 gluon(gfx1250) kernel Jun 10, 2026

rahulbatra85 force-pushed the gluon_moe_a16w4_batra branch 6 times, most recently from 309ab0b to c036420 Compare June 12, 2026 04:04

rahulbatra85 requested a review from azaidy June 12, 2026 16:45

[TRITON/GLUON]: Add moe_a16w4 gluon kernel

8a08545

rahulbatra85 force-pushed the gluon_moe_a16w4_batra branch from c036420 to 8a08545 Compare June 19, 2026 17:00

Added optimizations from moe a8w4 kernel

a0f5bd0

rahulbatra85 force-pushed the gluon_moe_a16w4_batra branch from 97aceaa to a0f5bd0 Compare June 19, 2026 18:32

Use TDM for bias

7bf710a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TRITON/GLUON]: Add moe_a16w4 gluon(gfx1250) kernel#3277

[TRITON/GLUON]: Add moe_a16w4 gluon(gfx1250) kernel#3277
rahulbatra85 wants to merge 3 commits into
mainfrom
gluon_moe_a16w4_batra

rahulbatra85 commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

lburzawa commented Jun 19, 2026 •

edited

Loading

Uh oh!

rahulbatra85 commented Jun 19, 2026

Uh oh!

lburzawa commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rahulbatra85 commented May 19, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented May 19, 2026

🏷️ CI Guide

Uh oh!

lburzawa commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rahulbatra85 commented Jun 19, 2026

Uh oh!

lburzawa commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lburzawa commented Jun 19, 2026 •

edited

Loading