Skip to content

[TRITON/GLUON]: Add moe_a16w4 gluon(gfx1250) kernel#3277

Open
rahulbatra85 wants to merge 3 commits into
mainfrom
gluon_moe_a16w4_batra
Open

[TRITON/GLUON]: Add moe_a16w4 gluon(gfx1250) kernel#3277
rahulbatra85 wants to merge 3 commits into
mainfrom
gluon_moe_a16w4_batra

Conversation

@rahulbatra85

Copy link
Copy Markdown
Contributor

Motivation

Adds moe_a16w4 gluon kernel

Technical Details

Add moe_a16w4 gluon kernel. First 4-bit weight is scaled and upcasted to bf16 and then a bf16 gemm is performed. The gluon kernel also uses gfx1250 tdm ops and pipelining

Test Plan

Unit tests

Test Result

All tests should pass

Submission Checklist

@rahulbatra85 rahulbatra85 requested a review from a team May 19, 2026 21:04
@github-actions

Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3277 --add-label <label>

@rahulbatra85 rahulbatra85 marked this pull request as draft May 19, 2026 21:08
@rahulbatra85 rahulbatra85 force-pushed the gluon_moe_a16w4_batra branch from c5aa5db to f62226f Compare June 8, 2026 23:39
@lburzawa lburzawa force-pushed the gluon_moe_a16w4_batra branch from f62226f to fed066b Compare June 10, 2026 02:15
@rahulbatra85 rahulbatra85 force-pushed the gluon_moe_a16w4_batra branch 2 times, most recently from 8a6cabc to aaefe90 Compare June 10, 2026 20:15
@rahulbatra85 rahulbatra85 marked this pull request as ready for review June 10, 2026 20:18
@rahulbatra85 rahulbatra85 changed the title [TRITON/GLUON]: Add moe_a16w4 gluon kernel [TRITON/GLUON]: Add moe_a16w4 gluon(gfx1250) kernel Jun 10, 2026
@rahulbatra85 rahulbatra85 force-pushed the gluon_moe_a16w4_batra branch 6 times, most recently from 309ab0b to c036420 Compare June 12, 2026 04:04
@rahulbatra85 rahulbatra85 requested a review from azaidy June 12, 2026 16:45
@lburzawa

lburzawa commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

@rahulbatra85 check out some gluon gfx1250 optimizations I added to a8w4 kernel that should also help in your case.

@rahulbatra85 rahulbatra85 force-pushed the gluon_moe_a16w4_batra branch from c036420 to 8a08545 Compare June 19, 2026 17:00
@rahulbatra85

Copy link
Copy Markdown
Contributor Author

@rahulbatra85 check out some gluon gfx1250 optimizations I added to a8w4 kernel that should also help in your case.

Ok, will update my code.

@lburzawa

Copy link
Copy Markdown
Contributor

One more thing is we can use tdm for bias too, just have to be careful with waits in epilogue if issuing load beforehand. You can also take a look at how it's done in a8w4 kernel.

@rahulbatra85 rahulbatra85 force-pushed the gluon_moe_a16w4_batra branch from 97aceaa to a0f5bd0 Compare June 19, 2026 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants