Add: ST exercising AICore op-execution timeout by hw-native-sys-bot · Pull Request #762 · hw-native-sys/simpler

hw-native-sys-bot · 2026-05-13T01:20:06Z

Summary

Negative ST that exercises the 3-layer AICore op-execution timeout
chain added in #718. The original PR shipped no coverage because
the behavior is hardware-only — this adds a regression test that
hangs an AICore op on real silicon and asserts the host returns a
RuntimeError in bounded time instead of deadlocking.

kernels/aic/kernel_hang.cpp — AIC kernel that touches args[0]
(volatile sink defeats DCE) then spins forever; STARS reaps it
after ~1 s.
kernels/orchestration/aicore_op_timeout_orch.cpp — allocates a
1-element INT32 output and dispatches one AIC task via
rt_submit_aic_task.
test_aicore_op_timeout.py — drives the raw Worker API
(mirrors tests/st/explicit_fatal/), asserts
RuntimeError('run_prepared failed with code 507046')
(= ACL_ERROR_RT_STREAM_SYNC_TIMEOUT) within 10 s, and is wrapped
in @pytest.mark.timeout(60) as a hard belt-and-suspenders bound.

Scope

a2a3 onboard only.

a2a3sim is intentionally excluded — the simulator has no STARS
watchdog, so a while(true) kernel would wedge it.
a5 has no equivalent timeout chain yet — mirror this test there
once Fix: add AICore task execution timeout mechanism #718's mechanism is ported.

Verification

Hardware: Ascend910 / a2a3 onboard, tensormap_and_ringbuffer,
device 2.

test_aicore_op_timeout_surfaces_as_runtime_error PASSED
elapsed=6.26s
exception=RuntimeError('run_prepared failed with code 507046')

Pre-commit hooks: clang-format, cpplint, ruff, pyright,
check-headers — all green, no SKIP= / --no-verify.
pytest-timeout is already pulled in by the .[test] extra on
the hardware runner; no new dependency.

Test plan

pytest tests/st/aicore_op_timeout --platform a2a3 --device 2 -v
passes on Ascend910 in ~8 s outer / ~6 s inner.
Test fails fast (asserts elapsed < 10) — proves the
timeout chain fired and didn't deadlock.
Pre-commit hooks pass on the new files.
Re-run on the official Ascend686ai runner in CI.

PR hw-native-sys#718 added a 3-layer timeout chain (STARS op watchdog, AICPU deinit watchdog, host aclrtSynchronizeStreamWithTimeout) but shipped no coverage because the behavior is hardware-only. This adds a negative ST that hangs an AICore op on real silicon and asserts the host comes back with a RuntimeError in bounded time instead of deadlocking. - AIC kernel kernel_hang.cpp: touches args[0] then spins forever (volatile sink defeats DCE); STARS reaps it after ~1 s - Orchestration aicore_op_timeout_orch.cpp: allocates a 1-element INT32 output and dispatches one AIC task via rt_submit_aic_task - test_aicore_op_timeout.py: drives the raw Worker API (mirrors tests/st/explicit_fatal/), asserts RuntimeError('run_prepared failed with code 507046') (ACL_ERROR_RT_STREAM_SYNC_TIMEOUT) within 10 s Scope: a2a3 onboard only. The simulator has no STARS watchdog, so a while(true) kernel would wedge it; a5 has no equivalent timeout chain yet — mirror this test there once it does. Verified on Ascend910 / a2a3 onboard: observed elapsed ~6.3 s, captured exception RuntimeError('run_prepared failed with code 507046'). pytest-timeout (@pytest.mark.timeout(60)) is already pulled in by the .[test] extra on the hardware runner.

gemini-code-assist

Code Review

This pull request introduces a regression test for the AICore op-execution timeout mechanism. It includes a C++ kernel designed to hang indefinitely, orchestration code to dispatch this kernel, and a Python test script that verifies the system correctly identifies the hang and surfaces a RuntimeError within the expected time frame. I have no feedback to provide as there were no review comments.

PR hw-native-sys#723 (collapsed ChipWorker init/set_device into a single simpler_init) flipped the order of attach_current_thread and the dlog_setlevel block inside simpler_init on both a2a3 and a5 onboard. CANN snapshots the device-side log session's level at device-context open time (rtSetDevice inside attach_current_thread), so a dlog_setlevel issued after that is a no-op for the device side. Net effect: when ASCEND_GLOBAL_LOG_LEVEL is not set in the environment, the log_level the user passed to Worker(...) / configure_logging(...) silently fails to reach the device-side filter, and ~/ascend/log/{debug,run}/device-N/*.log files are either missing or pinned at CANN's default (level 3 / ERROR). Pre-hw-native-sys#715/hw-native-sys#723 the order was correct because init and set_device were two separate C entries called in the right sequence; hw-native-sys#723 merged them and the dlog ordering was silent collateral. Sim has no CANN dlog and is unaffected. The fix: hoist the existing dlog_setlevel block above attach_current_thread in both onboard simpler_init's. HostLogger is already seeded by libsimpler_log.so's simpler_log_init() (runs earlier in ChipWorker::init), so HostLogger::get_instance().level() is already the user's choice at this point — no new plumbing. Comment on the hoisted block explains the rtSetDevice ordering constraint so this doesn't silently regress again. Header doc (pto_runtime_c_api.h) reorders the three responsibilities and docs (logging.md, dynamic-linking.md, chip-level-arch.md) update their call-flow diagrams to match the new order. Hardware verification on Ascend910 / a2a3 onboard (ASCEND_GLOBAL_LOG_LEVEL unset, configure_logging("debug")): before ~/ascend/log/run/device-2/device-845511_*.log: logLevel=3 (no DEBUG entries, debug/ dir empty) after ~/ascend/log/run/device-2/device-856602_*.log: logLevel=0 (76 KB of DEBUG entries) With ASCEND_GLOBAL_LOG_LEVEL=1 set, device shows logLevel=1 regardless of configure_logging — env-var path unchanged. Existing onboard ST (tests/st/aicore_op_timeout, PR hw-native-sys#762) still passes after rebuild.

#763) PR #723 (collapsed ChipWorker init/set_device into a single simpler_init) flipped the order of attach_current_thread and the dlog_setlevel block inside simpler_init on both a2a3 and a5 onboard. CANN snapshots the device-side log session's level at device-context open time (rtSetDevice inside attach_current_thread), so a dlog_setlevel issued after that is a no-op for the device side. Net effect: when ASCEND_GLOBAL_LOG_LEVEL is not set in the environment, the log_level the user passed to Worker(...) / configure_logging(...) silently fails to reach the device-side filter, and ~/ascend/log/{debug,run}/device-N/*.log files are either missing or pinned at CANN's default (level 3 / ERROR). Pre-#715/#723 the order was correct because init and set_device were two separate C entries called in the right sequence; #723 merged them and the dlog ordering was silent collateral. Sim has no CANN dlog and is unaffected. The fix: hoist the existing dlog_setlevel block above attach_current_thread in both onboard simpler_init's. HostLogger is already seeded by libsimpler_log.so's simpler_log_init() (runs earlier in ChipWorker::init), so HostLogger::get_instance().level() is already the user's choice at this point — no new plumbing. Comment on the hoisted block explains the rtSetDevice ordering constraint so this doesn't silently regress again. Header doc (pto_runtime_c_api.h) reorders the three responsibilities and docs (logging.md, dynamic-linking.md, chip-level-arch.md) update their call-flow diagrams to match the new order. Hardware verification on Ascend910 / a2a3 onboard (ASCEND_GLOBAL_LOG_LEVEL unset, configure_logging("debug")): before ~/ascend/log/run/device-2/device-845511_*.log: logLevel=3 (no DEBUG entries, debug/ dir empty) after ~/ascend/log/run/device-2/device-856602_*.log: logLevel=0 (76 KB of DEBUG entries) With ASCEND_GLOBAL_LOG_LEVEL=1 set, device shows logLevel=1 regardless of configure_logging — env-var path unchanged. Existing onboard ST (tests/st/aicore_op_timeout, PR #762) still passes after rebuild. Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

ChaoWao approved these changes May 13, 2026

View reviewed changes

ChaoWao merged commit b1cee3a into hw-native-sys:main May 13, 2026
14 checks passed

ChaoWao changed the title ~~Add: ST exercising AICore op-execution timeout (#718)~~ Add: ST exercising AICore op-execution timeout May 13, 2026

ChaoWao deleted the feat/aicore-op-timeout-st branch May 13, 2026 01:27

hw-native-sys-bot mentioned this pull request May 13, 2026

Fix: level CANN dlog before rtSetDevice so device logs honor log_level #763

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: ST exercising AICore op-execution timeout#762

Add: ST exercising AICore op-execution timeout#762
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:feat/aicore-op-timeout-st

hw-native-sys-bot commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hw-native-sys-bot commented May 13, 2026

Summary

Scope

Verification

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants