[fea] Add fp32 RMSNorm output for fused qk group quant by wuhuikx · Pull Request #3785 · ROCm/aiter

wuhuikx · 2026-06-18T02:16:53Z

Summary

Add an optional q_out_unquantized_fp32 output to fused_qk_rmsnorm_group_quant so callers can reuse the kernel's internal fp32 RMSNorm result.
Wire the Python wrapper, pybind binding, C++ declaration, dispatch macros, and kernel store path for the new output while preserving existing call sites.
Keep per-token quant forwarding compatible by passing std::nullopt for the new optional output.

Performance

Measured with MiniMax-M3-MXFP4 on ATOM using the same aiter build and only toggling the ATOM router path between baseline bf16 norm + .float() and optimized fp32 norm output reuse. Workload: ISL=8000, OSL=1000, TP=4, max_num_seqs=32.

CON=4: output tok/s 395.33 -> 407.61 (+3.11%), mean TPOT 9.45 -> 9.14 ms (-3.28%)
CON=8: output tok/s 707.74 -> 745.23 (+5.30%), mean TPOT 10.32 -> 9.97 ms (-3.39%)
CON=16: output tok/s 1217.98 -> 1254.47 (+3.00%), mean TPOT 12.23 -> 11.99 ms (-1.96%)

Test plan

python3 -m py_compile aiter/ops/fused_qk_rmsnorm_group_quant.py
JIT smoke test for fused_qk_rmsnorm_group_quant(..., q_out_unquantized_fp32=...) on CUDA/bf16 input.
MiniMax-M3 ATOM serving benchmark for CON=4,8,16 with the above workload.

Expose the internal fp32 RMSNorm result as an optional output so callers can reuse it for fp32 router projections without an extra cast.

github-actions · 2026-06-18T02:17:17Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3785 --add-label <label>

[fea] Add fp32 RMSNorm output for fused qk group quant

6a057ea

Expose the internal fp32 RMSNorm result as an optional output so callers can reuse it for fp32 router projections without an extra cast.

wuhuikx requested a review from a team June 18, 2026 02:16

Merge branch 'main' into xiaobing/norm_fp32

32424a8

XiaobingSuper requested a review from valarLip June 18, 2026 02:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fea] Add fp32 RMSNorm output for fused qk group quant#3785

[fea] Add fp32 RMSNorm output for fused qk group quant#3785
wuhuikx wants to merge 2 commits into
mainfrom
xiaobing/norm_fp32

wuhuikx commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wuhuikx commented Jun 18, 2026

Summary

Performance

Test plan

Uh oh!

github-actions Bot commented Jun 18, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants