Skip to content

idxsknorm_shuffle_layout support shuffle kv cache layout#3795

Open
ganyi1996ppo wants to merge 1 commit into
mainfrom
ganyi/idxqknorm_shuffle_layout
Open

idxsknorm_shuffle_layout support shuffle kv cache layout#3795
ganyi1996ppo wants to merge 1 commit into
mainfrom
ganyi/idxqknorm_shuffle_layout

Conversation

@ganyi1996ppo

Copy link
Copy Markdown
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@ganyi1996ppo ganyi1996ppo requested review from a team and Copilot June 18, 2026 06:25
@github-actions

Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3795 --add-label <label>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for a SHUFFLE (“asm_layout”) KV-cache layout in the fused_qknorm_idxrqknorm ROCm op by switching the op API to accept separate K/V cache tensors and introducing an asm_layout flag to select addressing logic, along with test coverage for the new layout (including fp8 variants).

Changes:

  • Updated the fused kernel + C++/pybind API to take kv_cache_k and kv_cache_v separately and added asm_layout to enable SHUFFLE (page-16) cache addressing.
  • Extended the Python wrapper to pass the new arguments through and added new asm_layout test modes.
  • Added test-side reference validation for SHUFFLE caches using reshape_and_cache(asm_layout=True) as ground truth.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
op_tests/test_fused_qknorm_idxrqknorm.py Adds SHUFFLE cache allocation + reference comparisons; adds asm_layout parameterized cases.
csrc/kernels/fused_qknorm_idxrqknorm.cu Implements SHUFFLE K/V cache addressing and updates kernel launcher/API to use separate K/V caches + asm_layout.
csrc/include/rocm_ops.hpp Updates pybind signatures/args for the new cache parameters and asm_layout.
csrc/include/fused_qknorm_idxrqknorm.h Updates public C++ header signatures to kv_cache_k/kv_cache_v and adds asm_layout.
aiter/ops/fused_qknorm_idxrqknorm.py Updates Python entrypoint signatures to kv_cache_k/kv_cache_v and forwards asm_layout.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread csrc/include/rocm_ops.hpp
py::arg("k_scale"), \
py::arg("v_scale"))
py::arg("v_scale"), \
py::arg("asm_layout") = false)
Comment on lines +102 to 111
# The main K/V caches are always passed as separate kv_cache_k / kv_cache_v
# tensors. asm_layout selects the in-cache addressing: page-16 SHUFFLE
# (asm_layout=True) vs plain page-128 (asm_layout=False, where kv_cache_k /
# kv_cache_v are typically the key/value slices of a fused
# [num_blocks, 2, block_size, num_kv_heads, head_dim] cache).
if (
kv_cache is not None
kv_cache_k is not None
and isinstance(kv_cache_dtype, str)
and kv_cache_dtype.startswith("fp8")
):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants