Skip to content

[fea]: gfx1250 allreduce poc#3802

Open
TennyWang1223 wants to merge 7 commits into
mainfrom
ar_poc_450
Open

[fea]: gfx1250 allreduce poc#3802
TennyWang1223 wants to merge 7 commits into
mainfrom
ar_poc_450

Conversation

@TennyWang1223

Copy link
Copy Markdown
Contributor

Motivation

allreduce poc for gfx1250

Technical Details

naive impl, no perf optimize

Test Plan

run test script with tp4 and tp2

Test Result

passed

Submission Checklist

Signed-off-by: HaonanWang98 <hwang@amd.com>
@TennyWang1223 TennyWang1223 requested a review from a team June 18, 2026 12:44
@github-actions

Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3802 --add-label <label>

HaonanWang98 and others added 6 commits June 18, 2026 14:25
Signed-off-by: HaonanWang98 <hwang@amd.com>
Signed-off-by: HaonanWang98 <hwang@amd.com>
Signed-off-by: HaonanWang98 <hwang@amd.com>
Split gfx1250 (MI450) allreduce kernel and dispatch logic out of the
shared custom_all_reduce.cuh into a self-contained compilation unit
(custom_all_reduce_gfx1250.cuh/.cu) with its own JIT module
(module_custom_all_reduce_gfx1250). This avoids CK header
incompatibility on gfx1250 and lets each arch own its Signal struct
size (gfx1250: kMaxBlocks=256, old arch: kMaxBlocks=80).

Old arch code is unchanged except for removing gfx1250-specific code
and reverting kMaxBlocks/kMaxBlocksLegacy back to a single kMaxBlocks=80.

Python side selects the correct module at runtime based on gcnArchName.
The gfx1250 C++ API uses direct device pointers (no hipIpc) in
preparation for a VMM-based IPC implementation (hipIpcOpenMemHandle is
not available on gfx1250).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: HaonanWang98 <hwang@amd.com>
Signed-off-by: HaonanWang98 <hwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants