Skip to content

Add: a5 chevron launch path for AICore SIMT validation#734

Open
ChaoZheng109 wants to merge 2 commits into
hw-native-sys:mainfrom
ChaoZheng109:fix-a5-aicore-local-memory-size
Open

Add: a5 chevron launch path for AICore SIMT validation#734
ChaoZheng109 wants to merge 2 commits into
hw-native-sys:mainfrom
ChaoZheng109:fix-a5-aicore-local-memory-size

Conversation

@ChaoZheng109
Copy link
Copy Markdown
Collaborator

@ChaoZheng109 ChaoZheng109 commented May 11, 2026

Summary

Two commits that together unblock SIMT execution on a5:

  1. Fix: set AICore localMemorySize for a5 SIMT launch — define PLATFORM_AICORE_LOCAL_MEMORY_SIZE = 229376 (224 KB) and pass it through rtTaskCfgInfo_t::localMemorySize in the existing rtKernelLaunchWithHandleV2 path.
  2. Add: a5 chevron launch path for AICore SIMT validation — parallel launch path using the bisheng chevron syntax (kernel<<<numBlocks, dynamic_shmem_sz, stream>>>). The kernel body delegates to the same aicore_execute() handshake loop so AICPU keeps dispatching tasks via registers. Enabled with SIMPLER_USE_CHEVRON_LAUNCH=1; default behavior unchanged.

The chevron form is required for SIMT because it lowers to LaunchAscendKernel, which programs the AICore local memory window through the SIMT-specific runtime path.

aicore_execute moved from .cpp to header

The chevron mix kernel needs aicore_execute visible on the same TU it is called from — ld.lld (inside bisheng) cannot resolve .cube/.vector references across separately compiled TUs, and bisheng -c -o file.o rejects multiple input sources. So:

  • src/a5/runtime/{tensormap_and_ringbuffer,host_build_graph}/aicore/aicore_executor.cpp → renamed to aicore_executor.h, signature changed to inline __aicore__.
  • The legacy kernel.cpp and the new chevron_launch.cpp both #include "aicore_executor.h" and emit their own instantiation. The host SO uses one launch path at a time, so the device-side duplication is benign.
  • build_config.py exposes each runtime's aicore/ directory to the aicore and host targets so both compilers find the header. No Python build-driver plumbing needed.

Testing

  • Build on a5 hardware (CANN 9.x): pip install --no-build-isolation . produces chevron_launch.o and host_runtime.so
  • nm libhost_runtime.so | grep launch_aicore_chevron shows the symbol
  • Existing a5 hardware test still passes without env var
  • Same test with SIMPLER_USE_CHEVRON_LAUNCH=1 exercises the chevron path; log shows using chevron (<<<>>>) launch path

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new constant PLATFORM_AICORE_LOCAL_MEMORY_SIZE (224 KB) in platform_config.h and updates the DeviceRunner::launch_aicore_kernel function in device_runner.cpp to utilize this constant when configuring the AICore task's local memory size. I have no feedback to provide.

cfg.localMemorySize was left unset in launch_aicore_kernel, leaving the
field at zero so the runtime allocated no local memory for the AICore
task and SIMT execution failed. Define
PLATFORM_AICORE_LOCAL_MEMORY_SIZE (224 KB) in platform_config.h and pass
it through rtTaskCfgInfo_t so every a5 AICore launch reserves the
required local memory.
@ChaoZheng109 ChaoZheng109 force-pushed the fix-a5-aicore-local-memory-size branch from 6eeb806 to 0e39aa9 Compare May 11, 2026 07:30
@ChaoZheng109 ChaoZheng109 changed the title Fix: set AICore localMemorySize for a5 SIMT launch Add: a5 chevron launch path for AICore SIMT validation May 11, 2026
@ChaoZheng109 ChaoZheng109 force-pushed the fix-a5-aicore-local-memory-size branch 9 times, most recently from a5316aa to 7724cb5 Compare May 11, 2026 11:11
Introduce a parallel AICore launch path on a5 that uses the bisheng
chevron syntax (`kernel<<<numBlocks, dynamic_shmem_sz, stream>>>`)
instead of rtKernelLaunchWithHandleV2. The chevron form is required
for SIMT kernels because it lowers to LaunchAscendKernel, which in
turn programs the AICore local memory window via the SIMT-specific
runtime path.

- chevron_launch.cpp: __global__ __aicore__ entry that delegates to
  the existing runtime aicore_execute() handshake loop, plus an
  extern "C" host wrapper that calls
  `aicore_chevron_entry<<<blockDim, PLATFORM_AICORE_LOCAL_MEMORY_SIZE,
  stream>>>(runtime)`. Compiled with `bisheng --asc-aicore-lang
  --npu-arch=dav-c310` into a host-linkable .o whose .ascend.kernel.*
  section embeds the AICore ELF.
- host/CMakeLists.txt: add a custom command that runs bisheng on
  chevron_launch.cpp and links the resulting object into
  host_runtime.so.
- device_runner.cpp: env-gated branch in launch_aicore_kernel —
  `SIMPLER_USE_CHEVRON_LAUNCH=1` selects the chevron path; default
  behavior unchanged.

Refactor aicore_execute into a header so chevron_launch.cpp can link
cleanly. The chevron mix kernel needs `aicore_execute` visible on the
same TU it is called from — the device-side linker (ld.lld inside
bisheng) cannot resolve .cube/.vector references across separately
compiled TUs, and bisheng `-c -o file.o` rejects multiple inputs. Move
the body from runtime/{tensormap_and_ringbuffer,host_build_graph}/
aicore/aicore_executor.cpp into a new aicore_executor.h with
`inline __aicore__` linkage. Both the legacy AICore kernel.cpp and
chevron_launch.cpp #include the header and emit their own
instantiation; the host SO uses one launch path at a time, so the
device-side duplication is benign. build_config.py exposes each
runtime's aicore/ dir to the aicore and host targets so both
compilers find the header.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ChaoZheng109 ChaoZheng109 force-pushed the fix-a5-aicore-local-memory-size branch from 7724cb5 to 45d8d4c Compare May 11, 2026 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant