Add: ST exercising AICore op-execution timeout#762
Merged
ChaoWao merged 1 commit intoMay 13, 2026
Conversation
PR hw-native-sys#718 added a 3-layer timeout chain (STARS op watchdog, AICPU deinit watchdog, host aclrtSynchronizeStreamWithTimeout) but shipped no coverage because the behavior is hardware-only. This adds a negative ST that hangs an AICore op on real silicon and asserts the host comes back with a RuntimeError in bounded time instead of deadlocking. - AIC kernel kernel_hang.cpp: touches args[0] then spins forever (volatile sink defeats DCE); STARS reaps it after ~1 s - Orchestration aicore_op_timeout_orch.cpp: allocates a 1-element INT32 output and dispatches one AIC task via rt_submit_aic_task - test_aicore_op_timeout.py: drives the raw Worker API (mirrors tests/st/explicit_fatal/), asserts RuntimeError('run_prepared failed with code 507046') (ACL_ERROR_RT_STREAM_SYNC_TIMEOUT) within 10 s Scope: a2a3 onboard only. The simulator has no STARS watchdog, so a while(true) kernel would wedge it; a5 has no equivalent timeout chain yet — mirror this test there once it does. Verified on Ascend910 / a2a3 onboard: observed elapsed ~6.3 s, captured exception RuntimeError('run_prepared failed with code 507046'). pytest-timeout (@pytest.mark.timeout(60)) is already pulled in by the .[test] extra on the hardware runner.
There was a problem hiding this comment.
Code Review
This pull request introduces a regression test for the AICore op-execution timeout mechanism. It includes a C++ kernel designed to hang indefinitely, orchestration code to dispatch this kernel, and a Python test script that verifies the system correctly identifies the hang and surfaces a RuntimeError within the expected time frame. I have no feedback to provide as there were no review comments.
ChaoWao
approved these changes
May 13, 2026
5 tasks
hw-native-sys-bot
pushed a commit
to hw-native-sys-bot/simpler
that referenced
this pull request
May 13, 2026
PR hw-native-sys#723 (collapsed ChipWorker init/set_device into a single simpler_init) flipped the order of attach_current_thread and the dlog_setlevel block inside simpler_init on both a2a3 and a5 onboard. CANN snapshots the device-side log session's level at device-context open time (rtSetDevice inside attach_current_thread), so a dlog_setlevel issued after that is a no-op for the device side. Net effect: when ASCEND_GLOBAL_LOG_LEVEL is not set in the environment, the log_level the user passed to Worker(...) / configure_logging(...) silently fails to reach the device-side filter, and ~/ascend/log/{debug,run}/device-N/*.log files are either missing or pinned at CANN's default (level 3 / ERROR). Pre-hw-native-sys#715/hw-native-sys#723 the order was correct because init and set_device were two separate C entries called in the right sequence; hw-native-sys#723 merged them and the dlog ordering was silent collateral. Sim has no CANN dlog and is unaffected. The fix: hoist the existing dlog_setlevel block above attach_current_thread in both onboard simpler_init's. HostLogger is already seeded by libsimpler_log.so's simpler_log_init() (runs earlier in ChipWorker::init), so HostLogger::get_instance().level() is already the user's choice at this point — no new plumbing. Comment on the hoisted block explains the rtSetDevice ordering constraint so this doesn't silently regress again. Header doc (pto_runtime_c_api.h) reorders the three responsibilities and docs (logging.md, dynamic-linking.md, chip-level-arch.md) update their call-flow diagrams to match the new order. Hardware verification on Ascend910 / a2a3 onboard (ASCEND_GLOBAL_LOG_LEVEL unset, configure_logging("debug")): before ~/ascend/log/run/device-2/device-845511_*.log: logLevel=3 (no DEBUG entries, debug/ dir empty) after ~/ascend/log/run/device-2/device-856602_*.log: logLevel=0 (76 KB of DEBUG entries) With ASCEND_GLOBAL_LOG_LEVEL=1 set, device shows logLevel=1 regardless of configure_logging — env-var path unchanged. Existing onboard ST (tests/st/aicore_op_timeout, PR hw-native-sys#762) still passes after rebuild.
ChaoWao
added a commit
that referenced
this pull request
May 13, 2026
#763) PR #723 (collapsed ChipWorker init/set_device into a single simpler_init) flipped the order of attach_current_thread and the dlog_setlevel block inside simpler_init on both a2a3 and a5 onboard. CANN snapshots the device-side log session's level at device-context open time (rtSetDevice inside attach_current_thread), so a dlog_setlevel issued after that is a no-op for the device side. Net effect: when ASCEND_GLOBAL_LOG_LEVEL is not set in the environment, the log_level the user passed to Worker(...) / configure_logging(...) silently fails to reach the device-side filter, and ~/ascend/log/{debug,run}/device-N/*.log files are either missing or pinned at CANN's default (level 3 / ERROR). Pre-#715/#723 the order was correct because init and set_device were two separate C entries called in the right sequence; #723 merged them and the dlog ordering was silent collateral. Sim has no CANN dlog and is unaffected. The fix: hoist the existing dlog_setlevel block above attach_current_thread in both onboard simpler_init's. HostLogger is already seeded by libsimpler_log.so's simpler_log_init() (runs earlier in ChipWorker::init), so HostLogger::get_instance().level() is already the user's choice at this point — no new plumbing. Comment on the hoisted block explains the rtSetDevice ordering constraint so this doesn't silently regress again. Header doc (pto_runtime_c_api.h) reorders the three responsibilities and docs (logging.md, dynamic-linking.md, chip-level-arch.md) update their call-flow diagrams to match the new order. Hardware verification on Ascend910 / a2a3 onboard (ASCEND_GLOBAL_LOG_LEVEL unset, configure_logging("debug")): before ~/ascend/log/run/device-2/device-845511_*.log: logLevel=3 (no DEBUG entries, debug/ dir empty) after ~/ascend/log/run/device-2/device-856602_*.log: logLevel=0 (76 KB of DEBUG entries) With ASCEND_GLOBAL_LOG_LEVEL=1 set, device shows logLevel=1 regardless of configure_logging — env-var path unchanged. Existing onboard ST (tests/st/aicore_op_timeout, PR #762) still passes after rebuild. Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Negative ST that exercises the 3-layer AICore op-execution timeout
chain added in #718. The original PR shipped no coverage because
the behavior is hardware-only — this adds a regression test that
hangs an AICore op on real silicon and asserts the host returns a
RuntimeErrorin bounded time instead of deadlocking.kernels/aic/kernel_hang.cpp— AIC kernel that touchesargs[0](volatile sink defeats DCE) then spins forever; STARS reaps it
after ~1 s.
kernels/orchestration/aicore_op_timeout_orch.cpp— allocates a1-element INT32 output and dispatches one AIC task via
rt_submit_aic_task.test_aicore_op_timeout.py— drives the rawWorkerAPI(mirrors
tests/st/explicit_fatal/), assertsRuntimeError('run_prepared failed with code 507046')(=
ACL_ERROR_RT_STREAM_SYNC_TIMEOUT) within 10 s, and is wrappedin
@pytest.mark.timeout(60)as a hard belt-and-suspenders bound.Scope
a2a3 onboard only.
watchdog, so a
while(true)kernel would wedge it.once Fix: add AICore task execution timeout mechanism #718's mechanism is ported.
Verification
Hardware: Ascend910 /
a2a3onboard,tensormap_and_ringbuffer,device 2.
check-headers — all green, no
SKIP=/--no-verify.pytest-timeoutis already pulled in by the.[test]extra onthe hardware runner; no new dependency.
Test plan
pytest tests/st/aicore_op_timeout --platform a2a3 --device 2 -vpasses on Ascend910 in ~8 s outer / ~6 s inner.
elapsed < 10) — proves thetimeout chain fired and didn't deadlock.
Ascend686airunner in CI.