[fea]: gfx1250 allreduce poc by TennyWang1223 · Pull Request #3802 · ROCm/aiter

TennyWang1223 · 2026-06-18T12:44:47Z

Motivation

allreduce poc for gfx1250

Technical Details

naive impl, no perf optimize

Test Plan

run test script with tp4 and tp2

Test Result

passed

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: HaonanWang98 <hwang@amd.com>

github-actions · 2026-06-18T12:45:19Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3802 --add-label <label>

Signed-off-by: HaonanWang98 <hwang@amd.com>

Split gfx1250 (MI450) allreduce kernel and dispatch logic out of the shared custom_all_reduce.cuh into a self-contained compilation unit (custom_all_reduce_gfx1250.cuh/.cu) with its own JIT module (module_custom_all_reduce_gfx1250). This avoids CK header incompatibility on gfx1250 and lets each arch own its Signal struct size (gfx1250: kMaxBlocks=256, old arch: kMaxBlocks=80). Old arch code is unchanged except for removing gfx1250-specific code and reverting kMaxBlocks/kMaxBlocksLegacy back to a single kMaxBlocks=80. Python side selects the correct module at runtime based on gcnArchName. The gfx1250 C++ API uses direct device pointers (no hipIpc) in preparation for a VMM-based IPC implementation (hipIpcOpenMemHandle is not available on gfx1250). Co-Authored-By: Claude <noreply@anthropic.com>

Signed-off-by: HaonanWang98 <hwang@amd.com>

[fea]: gfx1250 allreduce poc

9dc59f8

Signed-off-by: HaonanWang98 <hwang@amd.com>

TennyWang1223 requested a review from a team June 18, 2026 12:44

HaonanWang98 and others added 6 commits June 18, 2026 14:25

[fix]: fix old arch block num

2f50c4a

Signed-off-by: HaonanWang98 <hwang@amd.com>

[fix] test script format

fe517f6

Signed-off-by: HaonanWang98 <hwang@amd.com>

[fix]: test script no use var

3a5876c

Signed-off-by: HaonanWang98 <hwang@amd.com>

[fix]: cross device comm replace by vmm on gfx1250

7e241ae

Signed-off-by: HaonanWang98 <hwang@amd.com>

[fix]: code format

00e5965

Signed-off-by: HaonanWang98 <hwang@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fea]: gfx1250 allreduce poc#3802

[fea]: gfx1250 allreduce poc#3802
TennyWang1223 wants to merge 7 commits into
mainfrom
ar_poc_450

TennyWang1223 commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TennyWang1223 commented Jun 18, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented Jun 18, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants