Skip to content

Add: validate block_dim against stream resource limit via aclrtGetStreamResLimit#760

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
doraemonmj:partgoodaicore
May 13, 2026
Merged

Add: validate block_dim against stream resource limit via aclrtGetStreamResLimit#760
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
doraemonmj:partgoodaicore

Conversation

@doraemonmj
Copy link
Copy Markdown
Contributor

@doraemonmj doraemonmj commented May 12, 2026

Summary

  • Add DeviceRunner::validate_block_dim() (a2a3 + a5 onboard): query the
    stream's CUBE/VECTOR core limits via aclrtGetStreamResLimit and reject a
    block_dim that exceeds hardware capacity, with a clear error showing
    max_block_dim, cube, and vector counts. max_block_dim is derived
    from PLATFORM_AIC_CORES_PER_BLOCKDIM / PLATFORM_AIV_CORES_PER_BLOCKDIM
    (not a hardcoded ratio), and a zero core count is treated as "query
    unavailable".
  • When aclrtGetStreamResLimit is unavailable (older firmware) or reports no
    cores, fall back to the static PLATFORM_MAX_BLOCKDIM cap so block_dim
    stays bounded — error stays phrased in block_dim terms.
  • Consolidate the onboard block_dim validation (lower bound + capacity
    check) into validate_block_dim(), called from DeviceRunner::run() once
    the device is initialized; removes the old inline checks.
  • Drop the block_dim % scheduler_thread_num divisibility check from both
    onboard and sim DeviceRunner::run(). Both runtimes already tolerate an
    uneven split: host_build_graph (AicpuExecutor::assign_cores_to_threads)
    and tensormap_and_ringbuffer (SchedulerContext::assign_cores_to_threads
    / reassign_cores_for_all_threads) assign cores to scheduler threads
    cluster-aligned round-robin and size each thread's tracker from its actual
    cluster count, so block_dim need not divide evenly. Sim keeps the static
    PLATFORM_MAX_BLOCKDIM bound (it has no stream resource query), so onboard
    and sim validation stay consistent.
  • Prevents handshake deadlock by failing fast with actionable diagnostics.

Testing

  • Sim runtimes (a2a3sim / a5sim libhost_runtime.so) rebuild cleanly
    with the changes; pre-commit (clang-format, clang-tidy, cpplint) passes.
  • Hardware: confirm an over-capacity block_dim is rejected with the new
    max_block_dim=... cube=... vector=... diagnostic instead of a
    handshake hang; confirm the PLATFORM_MAX_BLOCKDIM fallback path on
    firmware without aclrtGetStreamResLimit.
  • Hardware: a non-divisible block_dim (e.g. block_dim=5,
    aicpu_thread_num=4) runs to completion (sanity-check the round-robin
    core split now that the divisibility gate is gone).

Review notes addressed

  • Magic /2 AIC:AIV ratio → platform constants.
  • Lost PLATFORM_MAX_BLOCKDIM fallback restored (was only a LOG_WARN).
  • Zero core count from a "successful" query no longer reports max_block_dim=0.
  • Divisibility check removed from sim too, so onboard/sim no longer diverge.
  • validate_block_dim doc comment updated to match the actual contract.

Notes

  • PR-title typo "GetSteamResLimit" fixed.
  • Runtime::get_orch_built_on_host() is kept — this PR only drops its one
    caller in the platform layer (the divisibility check). The flag is still
    load-bearing inside tensormap_and_ringbuffer (aicpu_executor.cpp,
    scheduler_cold_path.cpp orchestrator_done_, runtime_maker.cpp
    set_orch_built_on_host) and host_build_graph (runtime.h returns
    true); removing it is out of scope and would change runtime behavior.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors block_dim validation into a dedicated validate_block_dim method across DeviceRunner implementations, introducing dynamic resource limit checks for Cube and Vector cores via aclrtGetStreamResLimit. Review feedback points out that critical divisibility checks for scheduler threads were lost during the consolidation, which could lead to handshake deadlocks or logic errors. Additionally, the reviewer noted that error logs may be misleading by reporting zero available cores if the hardware resource query fails, suggesting more conditional logging.

Comment thread src/a2a3/platform/onboard/host/device_runner.cpp
Comment thread src/a5/platform/onboard/host/device_runner.cpp
Comment thread src/a2a3/platform/onboard/host/device_runner.cpp Outdated
Comment thread src/a5/platform/onboard/host/device_runner.cpp Outdated
@doraemonmj doraemonmj force-pushed the partgoodaicore branch 2 times, most recently from aad8b9a to 25d9766 Compare May 12, 2026 11:57
…eamResLimit

- Add DeviceRunner::validate_block_dim() (a2a3 + a5 onboard): query the
  stream's CUBE/VECTOR core limits via aclrtGetStreamResLimit and reject
  block_dim that exceeds hardware capacity, with a clear error showing
  max_block_dim, cube and vector counts. Derive max_block_dim from
  PLATFORM_AIC_CORES_PER_BLOCKDIM / PLATFORM_AIV_CORES_PER_BLOCKDIM rather
  than a hardcoded ratio, and treat a zero core count as "query unavailable".
- When aclrtGetStreamResLimit is unavailable (older firmware) or reports no
  cores, fall back to the static PLATFORM_MAX_BLOCKDIM cap so block_dim stays
  bounded, with the error still phrased in block_dim terms.
- Consolidate the onboard block_dim validation (lower bound + capacity check)
  into validate_block_dim(), called from DeviceRunner::run() once the device
  is initialized; remove the old inline checks.
- Drop the block_dim % scheduler_thread_num divisibility check from both
  onboard and sim DeviceRunner::run(): the scheduler assigns cores to threads
  cluster-aligned round-robin, so block_dim need not divide evenly. Sim keeps
  the static PLATFORM_MAX_BLOCKDIM bound (it has no stream resource query).
- Prevents handshake deadlock by failing fast with actionable diagnostics.
@ChaoWao ChaoWao changed the title Add: validate block_dim against stream resource limit via aclrtGetSteamResLimit Add: validate block_dim against stream resource limit via aclrtGetStreamResLimit May 13, 2026
@ChaoWao ChaoWao merged commit e39bfd4 into hw-native-sys:main May 13, 2026
40 of 42 checks passed
poursoul pushed a commit that referenced this pull request May 13, 2026
`Runtime::orch_built_on_host_` distinguished the host-built-graph runtime
from the (removed) aicpu_build_graph one. With only host_build_graph and
tensormap_and_ringbuffer left, the flag is a per-runtime constant:

- host_build_graph hard-coded `get_orch_built_on_host()` to `true` and, since
  #760 removed the last platform-layer caller, nothing reads it — delete it.
- tensormap_and_ringbuffer always sets it to `false` in
  bind_prepared_to_runtime_impl (the runtime ctor's `= true` was overwritten
  before any reader saw it), so every `get_orch_built_on_host()` site there
  takes the device-orchestration branch. Inline that: drop the field, getter,
  setter; the AICPU executor's "host orchestration, no-op" dead branch
  becomes a plain scope; the SM-header spin-wait and the rt-destroy guard
  lose their always-true `!get_orch_built_on_host()` prefix; the scheduler's
  `orchestrator_done_` init becomes a literal `false`.

No behavior change — every removed branch was statically unreachable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants