Add: validate block_dim against stream resource limit via aclrtGetStreamResLimit by doraemonmj · Pull Request #760 · hw-native-sys/simpler

doraemonmj · 2026-05-12T11:26:58Z

Summary

Add DeviceRunner::validate_block_dim() (a2a3 + a5 onboard): query the
stream's CUBE/VECTOR core limits via aclrtGetStreamResLimit and reject a
block_dim that exceeds hardware capacity, with a clear error showing
max_block_dim, cube, and vector counts. max_block_dim is derived
from PLATFORM_AIC_CORES_PER_BLOCKDIM / PLATFORM_AIV_CORES_PER_BLOCKDIM
(not a hardcoded ratio), and a zero core count is treated as "query
unavailable".
When aclrtGetStreamResLimit is unavailable (older firmware) or reports no
cores, fall back to the static PLATFORM_MAX_BLOCKDIM cap so block_dim
stays bounded — error stays phrased in block_dim terms.
Consolidate the onboard block_dim validation (lower bound + capacity
check) into validate_block_dim(), called from DeviceRunner::run() once
the device is initialized; removes the old inline checks.
Drop the block_dim % scheduler_thread_num divisibility check from both
onboard and sim DeviceRunner::run(). Both runtimes already tolerate an
uneven split: host_build_graph (AicpuExecutor::assign_cores_to_threads)
and tensormap_and_ringbuffer (SchedulerContext::assign_cores_to_threads
/ reassign_cores_for_all_threads) assign cores to scheduler threads
cluster-aligned round-robin and size each thread's tracker from its actual
cluster count, so block_dim need not divide evenly. Sim keeps the static
PLATFORM_MAX_BLOCKDIM bound (it has no stream resource query), so onboard
and sim validation stay consistent.
Prevents handshake deadlock by failing fast with actionable diagnostics.

Testing

Sim runtimes (a2a3sim / a5sim libhost_runtime.so) rebuild cleanly
with the changes; pre-commit (clang-format, clang-tidy, cpplint) passes.
Hardware: confirm an over-capacity block_dim is rejected with the new
max_block_dim=... cube=... vector=... diagnostic instead of a
handshake hang; confirm the PLATFORM_MAX_BLOCKDIM fallback path on
firmware without aclrtGetStreamResLimit.
Hardware: a non-divisible block_dim (e.g. block_dim=5,
aicpu_thread_num=4) runs to completion (sanity-check the round-robin
core split now that the divisibility gate is gone).

Review notes addressed

Magic /2 AIC:AIV ratio → platform constants.
Lost PLATFORM_MAX_BLOCKDIM fallback restored (was only a LOG_WARN).
Zero core count from a "successful" query no longer reports max_block_dim=0.
Divisibility check removed from sim too, so onboard/sim no longer diverge.
validate_block_dim doc comment updated to match the actual contract.

Notes

PR-title typo "GetSteamResLimit" fixed.
Runtime::get_orch_built_on_host() is kept — this PR only drops its one
caller in the platform layer (the divisibility check). The flag is still
load-bearing inside tensormap_and_ringbuffer (aicpu_executor.cpp,
scheduler_cold_path.cpp orchestrator_done_, runtime_maker.cpp
set_orch_built_on_host) and host_build_graph (runtime.h returns
true); removing it is out of scope and would change runtime behavior.

gemini-code-assist

Code Review

This pull request refactors block_dim validation into a dedicated validate_block_dim method across DeviceRunner implementations, introducing dynamic resource limit checks for Cube and Vector cores via aclrtGetStreamResLimit. Review feedback points out that critical divisibility checks for scheduler threads were lost during the consolidation, which could lead to handshake deadlocks or logic errors. Additionally, the reviewer noted that error logs may be misleading by reporting zero available cores if the hardware resource query fails, suggesting more conditional logging.

…eamResLimit - Add DeviceRunner::validate_block_dim() (a2a3 + a5 onboard): query the stream's CUBE/VECTOR core limits via aclrtGetStreamResLimit and reject block_dim that exceeds hardware capacity, with a clear error showing max_block_dim, cube and vector counts. Derive max_block_dim from PLATFORM_AIC_CORES_PER_BLOCKDIM / PLATFORM_AIV_CORES_PER_BLOCKDIM rather than a hardcoded ratio, and treat a zero core count as "query unavailable". - When aclrtGetStreamResLimit is unavailable (older firmware) or reports no cores, fall back to the static PLATFORM_MAX_BLOCKDIM cap so block_dim stays bounded, with the error still phrased in block_dim terms. - Consolidate the onboard block_dim validation (lower bound + capacity check) into validate_block_dim(), called from DeviceRunner::run() once the device is initialized; remove the old inline checks. - Drop the block_dim % scheduler_thread_num divisibility check from both onboard and sim DeviceRunner::run(): the scheduler assigns cores to threads cluster-aligned round-robin, so block_dim need not divide evenly. Sim keeps the static PLATFORM_MAX_BLOCKDIM bound (it has no stream resource query). - Prevents handshake deadlock by failing fast with actionable diagnostics.

`Runtime::orch_built_on_host_` distinguished the host-built-graph runtime from the (removed) aicpu_build_graph one. With only host_build_graph and tensormap_and_ringbuffer left, the flag is a per-runtime constant: - host_build_graph hard-coded `get_orch_built_on_host()` to `true` and, since #760 removed the last platform-layer caller, nothing reads it — delete it. - tensormap_and_ringbuffer always sets it to `false` in bind_prepared_to_runtime_impl (the runtime ctor's `= true` was overwritten before any reader saw it), so every `get_orch_built_on_host()` site there takes the device-orchestration branch. Inline that: drop the field, getter, setter; the AICPU executor's "host orchestration, no-op" dead branch becomes a plain scope; the SM-header spin-wait and the rt-destroy guard lose their always-true `!get_orch_built_on_host()` prefix; the scheduler's `orchestrator_done_` init becomes a literal `false`. No behavior change — every removed branch was statically unreachable.

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/a2a3/platform/onboard/host/device_runner.cpp

Comment thread src/a5/platform/onboard/host/device_runner.cpp

Comment thread src/a2a3/platform/onboard/host/device_runner.cpp Outdated

Comment thread src/a5/platform/onboard/host/device_runner.cpp Outdated

doraemonmj force-pushed the partgoodaicore branch 2 times, most recently from aad8b9a to 25d9766 Compare May 12, 2026 11:57

ChaoWao force-pushed the partgoodaicore branch from 25d9766 to 983337f Compare May 13, 2026 02:10

ChaoWao changed the title ~~Add: validate block_dim against stream resource limit via aclrtGetSteamResLimit~~ Add: validate block_dim against stream resource limit via aclrtGetStreamResLimit May 13, 2026

ChaoWao approved these changes May 13, 2026

View reviewed changes

ChaoWao merged commit e39bfd4 into hw-native-sys:main May 13, 2026
40 of 42 checks passed

doraemonmj mentioned this pull request May 13, 2026

[Bug] Hard-coded block_dim=24 in execute_on_device hangs runs on devices with fewer usable cores hw-native-sys/pypto#1173

Open

ChaoWao mentioned this pull request May 13, 2026

Refactor: drop vestigial orch_built_on_host flag #766

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: validate block_dim against stream resource limit via aclrtGetStreamResLimit#760

Add: validate block_dim against stream resource limit via aclrtGetStreamResLimit#760
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
doraemonmj:partgoodaicore

doraemonmj commented May 12, 2026 •

edited by ChaoWao

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

doraemonmj commented May 12, 2026 • edited by ChaoWao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Review notes addressed

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

doraemonmj commented May 12, 2026 •

edited by ChaoWao

Loading