Add: validate block_dim against stream resource limit via aclrtGetStreamResLimit#760
Merged
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors block_dim validation into a dedicated validate_block_dim method across DeviceRunner implementations, introducing dynamic resource limit checks for Cube and Vector cores via aclrtGetStreamResLimit. Review feedback points out that critical divisibility checks for scheduler threads were lost during the consolidation, which could lead to handshake deadlocks or logic errors. Additionally, the reviewer noted that error logs may be misleading by reporting zero available cores if the hardware resource query fails, suggesting more conditional logging.
aad8b9a to
25d9766
Compare
…eamResLimit - Add DeviceRunner::validate_block_dim() (a2a3 + a5 onboard): query the stream's CUBE/VECTOR core limits via aclrtGetStreamResLimit and reject block_dim that exceeds hardware capacity, with a clear error showing max_block_dim, cube and vector counts. Derive max_block_dim from PLATFORM_AIC_CORES_PER_BLOCKDIM / PLATFORM_AIV_CORES_PER_BLOCKDIM rather than a hardcoded ratio, and treat a zero core count as "query unavailable". - When aclrtGetStreamResLimit is unavailable (older firmware) or reports no cores, fall back to the static PLATFORM_MAX_BLOCKDIM cap so block_dim stays bounded, with the error still phrased in block_dim terms. - Consolidate the onboard block_dim validation (lower bound + capacity check) into validate_block_dim(), called from DeviceRunner::run() once the device is initialized; remove the old inline checks. - Drop the block_dim % scheduler_thread_num divisibility check from both onboard and sim DeviceRunner::run(): the scheduler assigns cores to threads cluster-aligned round-robin, so block_dim need not divide evenly. Sim keeps the static PLATFORM_MAX_BLOCKDIM bound (it has no stream resource query). - Prevents handshake deadlock by failing fast with actionable diagnostics.
ChaoWao
approved these changes
May 13, 2026
2 tasks
poursoul
pushed a commit
that referenced
this pull request
May 13, 2026
`Runtime::orch_built_on_host_` distinguished the host-built-graph runtime from the (removed) aicpu_build_graph one. With only host_build_graph and tensormap_and_ringbuffer left, the flag is a per-runtime constant: - host_build_graph hard-coded `get_orch_built_on_host()` to `true` and, since #760 removed the last platform-layer caller, nothing reads it — delete it. - tensormap_and_ringbuffer always sets it to `false` in bind_prepared_to_runtime_impl (the runtime ctor's `= true` was overwritten before any reader saw it), so every `get_orch_built_on_host()` site there takes the device-orchestration branch. Inline that: drop the field, getter, setter; the AICPU executor's "host orchestration, no-op" dead branch becomes a plain scope; the SM-header spin-wait and the rt-destroy guard lose their always-true `!get_orch_built_on_host()` prefix; the scheduler's `orchestrator_done_` init becomes a literal `false`. No behavior change — every removed branch was statically unreachable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DeviceRunner::validate_block_dim()(a2a3 + a5 onboard): query thestream's CUBE/VECTOR core limits via
aclrtGetStreamResLimitand reject ablock_dimthat exceeds hardware capacity, with a clear error showingmax_block_dim,cube, andvectorcounts.max_block_dimis derivedfrom
PLATFORM_AIC_CORES_PER_BLOCKDIM/PLATFORM_AIV_CORES_PER_BLOCKDIM(not a hardcoded ratio), and a zero core count is treated as "query
unavailable".
aclrtGetStreamResLimitis unavailable (older firmware) or reports nocores, fall back to the static
PLATFORM_MAX_BLOCKDIMcap soblock_dimstays bounded — error stays phrased in
block_dimterms.block_dimvalidation (lower bound + capacitycheck) into
validate_block_dim(), called fromDeviceRunner::run()oncethe device is initialized; removes the old inline checks.
block_dim % scheduler_thread_numdivisibility check from bothonboard and sim
DeviceRunner::run(). Both runtimes already tolerate anuneven split:
host_build_graph(AicpuExecutor::assign_cores_to_threads)and
tensormap_and_ringbuffer(SchedulerContext::assign_cores_to_threads/
reassign_cores_for_all_threads) assign cores to scheduler threadscluster-aligned round-robin and size each thread's tracker from its actual
cluster count, so
block_dimneed not divide evenly. Sim keeps the staticPLATFORM_MAX_BLOCKDIMbound (it has no stream resource query), so onboardand sim validation stay consistent.
Testing
a2a3sim/a5simlibhost_runtime.so) rebuild cleanlywith the changes; pre-commit (clang-format, clang-tidy, cpplint) passes.
block_dimis rejected with the newmax_block_dim=... cube=... vector=...diagnostic instead of ahandshake hang; confirm the
PLATFORM_MAX_BLOCKDIMfallback path onfirmware without
aclrtGetStreamResLimit.block_dim(e.g.block_dim=5,aicpu_thread_num=4) runs to completion (sanity-check the round-robincore split now that the divisibility gate is gone).
Review notes addressed
/2AIC:AIV ratio → platform constants.PLATFORM_MAX_BLOCKDIMfallback restored (was only aLOG_WARN).max_block_dim=0.validate_block_dimdoc comment updated to match the actual contract.Notes
Runtime::get_orch_built_on_host()is kept — this PR only drops its onecaller in the platform layer (the divisibility check). The flag is still
load-bearing inside
tensormap_and_ringbuffer(aicpu_executor.cpp,scheduler_cold_path.cpporchestrator_done_,runtime_maker.cppset_orch_built_on_host) andhost_build_graph(runtime.hreturnstrue); removing it is out of scope and would change runtime behavior.