Skip to content

Add: ST exercising AICore op-execution timeout#762

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:feat/aicore-op-timeout-st
May 13, 2026
Merged

Add: ST exercising AICore op-execution timeout#762
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:feat/aicore-op-timeout-st

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

Summary

Negative ST that exercises the 3-layer AICore op-execution timeout
chain added in #718. The original PR shipped no coverage because
the behavior is hardware-only — this adds a regression test that
hangs an AICore op on real silicon and asserts the host returns a
RuntimeError in bounded time instead of deadlocking.

  • kernels/aic/kernel_hang.cpp — AIC kernel that touches args[0]
    (volatile sink defeats DCE) then spins forever; STARS reaps it
    after ~1 s.
  • kernels/orchestration/aicore_op_timeout_orch.cpp — allocates a
    1-element INT32 output and dispatches one AIC task via
    rt_submit_aic_task.
  • test_aicore_op_timeout.py — drives the raw Worker API
    (mirrors tests/st/explicit_fatal/), asserts
    RuntimeError('run_prepared failed with code 507046')
    (= ACL_ERROR_RT_STREAM_SYNC_TIMEOUT) within 10 s, and is wrapped
    in @pytest.mark.timeout(60) as a hard belt-and-suspenders bound.

Scope

a2a3 onboard only.

Verification

Hardware: Ascend910 / a2a3 onboard, tensormap_and_ringbuffer,
device 2.

test_aicore_op_timeout_surfaces_as_runtime_error PASSED
elapsed=6.26s
exception=RuntimeError('run_prepared failed with code 507046')
  • Pre-commit hooks: clang-format, cpplint, ruff, pyright,
    check-headers — all green, no SKIP= / --no-verify.
  • pytest-timeout is already pulled in by the .[test] extra on
    the hardware runner; no new dependency.

Test plan

  • pytest tests/st/aicore_op_timeout --platform a2a3 --device 2 -v
    passes on Ascend910 in ~8 s outer / ~6 s inner.
  • Test fails fast (asserts elapsed < 10) — proves the
    timeout chain fired and didn't deadlock.
  • Pre-commit hooks pass on the new files.
  • Re-run on the official Ascend686ai runner in CI.

PR hw-native-sys#718 added a 3-layer timeout chain (STARS op watchdog,
AICPU deinit watchdog, host aclrtSynchronizeStreamWithTimeout)
but shipped no coverage because the behavior is hardware-only.
This adds a negative ST that hangs an AICore op on real silicon
and asserts the host comes back with a RuntimeError in bounded
time instead of deadlocking.

- AIC kernel kernel_hang.cpp: touches args[0] then spins forever
  (volatile sink defeats DCE); STARS reaps it after ~1 s
- Orchestration aicore_op_timeout_orch.cpp: allocates a 1-element
  INT32 output and dispatches one AIC task via rt_submit_aic_task
- test_aicore_op_timeout.py: drives the raw Worker API
  (mirrors tests/st/explicit_fatal/), asserts
  RuntimeError('run_prepared failed with code 507046')
  (ACL_ERROR_RT_STREAM_SYNC_TIMEOUT) within 10 s

Scope: a2a3 onboard only. The simulator has no STARS watchdog,
so a while(true) kernel would wedge it; a5 has no equivalent
timeout chain yet — mirror this test there once it does.

Verified on Ascend910 / a2a3 onboard: observed elapsed ~6.3 s,
captured exception RuntimeError('run_prepared failed with
code 507046'). pytest-timeout (@pytest.mark.timeout(60)) is
already pulled in by the .[test] extra on the hardware runner.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a regression test for the AICore op-execution timeout mechanism. It includes a C++ kernel designed to hang indefinitely, orchestration code to dispatch this kernel, and a Python test script that verifies the system correctly identifies the hang and surfaces a RuntimeError within the expected time frame. I have no feedback to provide as there were no review comments.

@ChaoWao ChaoWao merged commit b1cee3a into hw-native-sys:main May 13, 2026
14 checks passed
@ChaoWao ChaoWao changed the title Add: ST exercising AICore op-execution timeout (#718) Add: ST exercising AICore op-execution timeout May 13, 2026
@ChaoWao ChaoWao deleted the feat/aicore-op-timeout-st branch May 13, 2026 01:27
hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request May 13, 2026
PR hw-native-sys#723 (collapsed ChipWorker init/set_device into a single
simpler_init) flipped the order of attach_current_thread and the
dlog_setlevel block inside simpler_init on both a2a3 and a5
onboard. CANN snapshots the device-side log session's level at
device-context open time (rtSetDevice inside attach_current_thread),
so a dlog_setlevel issued after that is a no-op for the device
side. Net effect: when ASCEND_GLOBAL_LOG_LEVEL is not set in the
environment, the log_level the user passed to Worker(...) /
configure_logging(...) silently fails to reach the device-side
filter, and ~/ascend/log/{debug,run}/device-N/*.log files are
either missing or pinned at CANN's default (level 3 / ERROR).

Pre-hw-native-sys#715/hw-native-sys#723 the order was correct because init and set_device
were two separate C entries called in the right sequence; hw-native-sys#723
merged them and the dlog ordering was silent collateral. Sim has
no CANN dlog and is unaffected.

The fix: hoist the existing dlog_setlevel block above
attach_current_thread in both onboard simpler_init's. HostLogger
is already seeded by libsimpler_log.so's simpler_log_init() (runs
earlier in ChipWorker::init), so HostLogger::get_instance().level()
is already the user's choice at this point — no new plumbing.

Comment on the hoisted block explains the rtSetDevice ordering
constraint so this doesn't silently regress again. Header doc
(pto_runtime_c_api.h) reorders the three responsibilities and
docs (logging.md, dynamic-linking.md, chip-level-arch.md) update
their call-flow diagrams to match the new order.

Hardware verification on Ascend910 / a2a3 onboard
(ASCEND_GLOBAL_LOG_LEVEL unset, configure_logging("debug")):

  before  ~/ascend/log/run/device-2/device-845511_*.log:
              logLevel=3        (no DEBUG entries, debug/ dir empty)
  after   ~/ascend/log/run/device-2/device-856602_*.log:
              logLevel=0        (76 KB of DEBUG entries)

With ASCEND_GLOBAL_LOG_LEVEL=1 set, device shows logLevel=1
regardless of configure_logging — env-var path unchanged.
Existing onboard ST (tests/st/aicore_op_timeout, PR hw-native-sys#762) still
passes after rebuild.
ChaoWao added a commit that referenced this pull request May 13, 2026
#763)

PR #723 (collapsed ChipWorker init/set_device into a single
simpler_init) flipped the order of attach_current_thread and the
dlog_setlevel block inside simpler_init on both a2a3 and a5
onboard. CANN snapshots the device-side log session's level at
device-context open time (rtSetDevice inside attach_current_thread),
so a dlog_setlevel issued after that is a no-op for the device
side. Net effect: when ASCEND_GLOBAL_LOG_LEVEL is not set in the
environment, the log_level the user passed to Worker(...) /
configure_logging(...) silently fails to reach the device-side
filter, and ~/ascend/log/{debug,run}/device-N/*.log files are
either missing or pinned at CANN's default (level 3 / ERROR).

Pre-#715/#723 the order was correct because init and set_device
were two separate C entries called in the right sequence; #723
merged them and the dlog ordering was silent collateral. Sim has
no CANN dlog and is unaffected.

The fix: hoist the existing dlog_setlevel block above
attach_current_thread in both onboard simpler_init's. HostLogger
is already seeded by libsimpler_log.so's simpler_log_init() (runs
earlier in ChipWorker::init), so HostLogger::get_instance().level()
is already the user's choice at this point — no new plumbing.

Comment on the hoisted block explains the rtSetDevice ordering
constraint so this doesn't silently regress again. Header doc
(pto_runtime_c_api.h) reorders the three responsibilities and
docs (logging.md, dynamic-linking.md, chip-level-arch.md) update
their call-flow diagrams to match the new order.

Hardware verification on Ascend910 / a2a3 onboard
(ASCEND_GLOBAL_LOG_LEVEL unset, configure_logging("debug")):

  before  ~/ascend/log/run/device-2/device-845511_*.log:
              logLevel=3        (no DEBUG entries, debug/ dir empty)
  after   ~/ascend/log/run/device-2/device-856602_*.log:
              logLevel=0        (76 KB of DEBUG entries)

With ASCEND_GLOBAL_LOG_LEVEL=1 set, device shows logLevel=1
regardless of configure_logging — env-var path unchanged.
Existing onboard ST (tests/st/aicore_op_timeout, PR #762) still
passes after rebuild.

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants