Skip to content

[gfx1250] Add cluster launch support with TDM multicast bandwidth tests#699

Open
jli-melchior wants to merge 4 commits into
mainfrom
feat/gfx1250-cluster-launch
Open

[gfx1250] Add cluster launch support with TDM multicast bandwidth tests#699
jli-melchior wants to merge 4 commits into
mainfrom
feat/gfx1250-cluster-launch

Conversation

@jli-melchior

@jli-melchior jli-melchior commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Motivation

gfx1250 (MI450) introduces cluster launch and TDM (Tensor Data Mover) hardware features for inter-workgroup async DMA and L2 multicast. The existing mgpuLaunchClusterKernel in FlyDSL runtime uses #ifdef hipLaunchAttributeClusterDimension for conditional compilation, but this is an enum value, not a preprocessor macro, making the guard unreliable. Additionally, there are no end-to-end tests or performance benchmarks for cluster launch and TDM multicast.

Technical Details

Runtime fix (lib/Runtime/ROCm/FlyRocmRuntimeWrappers.cpp):

  • Replace #ifdef hipLaunchAttributeClusterDimension with #if defined(HIP_VERSION) && (HIP_VERSION >= 70200000)
  • Simplify error handling: remove the cluster=(1,1,1) fallback logic — hipDrvLaunchKernelEx should work correctly on ROCm 7.2+
  • The #else branch for HIP < 7.2 retains the fallback to hipModuleLaunchKernel with no behavioral change

Cluster launch tests (tests/unit/test_cluster_launch_gfx1250.py):

  • vec_add smoke test: verifies hipDrvLaunchKernelEx + cluster dims end-to-end correctness
  • cluster_barrier test: verifies cluster_barrier() cross-WG synchronization

TDM multicast correctness tests (tests/unit/test_tdm_mcast_add_gfx1250.py):

  • TDM 2D async load + LDS add + buffer store, parametrized over cluster configs: (2,1), (1,2), (2,2)
  • Bandwidth comparison benchmark (@pytest.mark.benchmark)

TDM bandwidth benchmark (tests/perf/bench_tdm_bandwidth_gfx1250.py):

  • Three modes: read-only (pure TDM HBM read BW), unique (TDM load+add+store R/W BW), multicast (cluster multicast L2→LDS
    throughput)
  • L2 flush + CUDA event timing + IQR median, sweeping multiple grid and cluster configurations
  • Default --mode all runs all modes in sequence

Cluster multicast GEMM (tests/unit/test_cluster_mcast_gemm_gfx1250.py):

  • WMMA GEMM + TDM multicast prototype, currently @pytest.mark.skip (JIT compilation hangs with cluster params, deferred to a follow-up PR)

Test Plan

  • python -m pytest tests/unit/test_cluster_launch_gfx1250.py -v --tb=short — cluster launch smoke + barrier
  • python -m pytest tests/unit/test_tdm_mcast_add_gfx1250.py -v --tb=short — TDM multicast correctness
  • python tests/perf/bench_tdm_bandwidth_gfx1250.py — three-mode bandwidth benchmark
  • Verify mgpuLaunchClusterKernel does not affect existing kernel launches on non-gfx1250 architectures

Test Result

Tested on gfx1250 hardware:

  • Cluster launch: vec_add and barrier tests pass
  • TDM multicast: correctness verified for (2,1), (1,2), (2,2) cluster configs
  • Bandwidth benchmark: read-only mode reaches ~19.9 TB/s at 256x256 grid (90.3% of 22 TB/s peak); unique mode reaches ~16 TB/s (72.8%)

Submission Checklist

…tests

Replace broken #ifdef on enum hipLaunchAttributeClusterDimension with
mgpuLaunchClusterKernel, matching the CUDA-side CUDA_VERSION pattern.

Add cluster launch tests (vec_add smoke + cluster_barrier) that
exercise the hipDrvLaunchKernelEx runtime path end-to-end.

TDM multicast GEMM test split to a separate file with
@pytest.mark.skip (JIT compilation hangs with cluster params,
deferred to another PR).
@jli-melchior jli-melchior force-pushed the feat/gfx1250-cluster-launch branch from 7d0cd77 to cc06947 Compare June 17, 2026 04:01
Add test_tdm_mcast_add_gfx1250.py exercising TDM async DMA loads with
cluster multicast masks and elementwise add. Includes correctness tests
parametrized over cluster configs and a bandwidth comparison test.

Add bench_tdm_load_gfx1250.py with three modes:
  - read-only: pure TDM HBM read bandwidth (no store)
  - unique: TDM load + add + store with unique tiles per WG
  - shared: GEMM-like shared tiles with cluster multicast throughput
@jli-melchior jli-melchior force-pushed the feat/gfx1250-cluster-launch branch from cc06947 to 54ea818 Compare June 17, 2026 06:08
@jli-melchior jli-melchior requested a review from aoli26 June 17, 2026 06:09
Comment thread lib/Runtime/ROCm/FlyRocmRuntimeWrappers.cpp Outdated
Comment thread tests/unit/test_cluster_launch_gfx1250.py
…tion for cluster launch

The HIP_VERSION >= 70200000 threshold could not be verified from public
ROCm releases — hipLaunchAttributeClusterDimension is absent from the
public ROCm 7.2 headers. Use check_cxx_source_compiles to detect the
API at build time, matching the approach used by CK and Tensile.
hipLaunchAttributeID id = hipLaunchAttributeClusterDimension;
return 0;
}
" HIP_HAS_CLUSTER_LAUNCH)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a good idea. We use the same wheel for all hip versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants