Skip to content

Fix PTO2 dep pool fatal exits#730

Open
zhaozhaozz wants to merge 2 commits into
hw-native-sys:mainfrom
zhaozhaozz:fix/pto2-dep-pool-fatal-exit
Open

Fix PTO2 dep pool fatal exits#730
zhaozhaozz wants to merge 2 commits into
hw-native-sys:mainfrom
zhaozhaozz:fix/pto2-dep-pool-fatal-exit

Conversation

@zhaozhaozz
Copy link
Copy Markdown

Summary

This PR handles the first layer of #729: when PTO2 detects a fanin/dep pool fatal condition, runtime should report the original fatal status and exit cleanly instead of getting stuck in an unrecoverable spin or secondary failure path.

Changes:

  • Make fanin/dep pool ensure_space() return false and latch PTO2_ERROR_DEP_POOL_OVERFLOW instead of calling exit(1).
  • Stop orchestrator submission when fanin spill allocation cannot make progress.
  • Teach scheduler threads to observe orchestrator/scheduler fatal state before orchestration completes, perform one emergency shutdown, and return the latched runtime status.
  • Preserve AICore deinit timeout handling and propagate shutdown failures instead of hiding them.
  • Add C++ unit coverage for deadlocked ensure_space() returning false and latching the error.

This does not attempt to fix the second-layer root cause from #729: over-broad whole-tensor dependencies after disjoint view writes. That still needs codegen/runtime region precision or backpressure work.

Validation

  • git diff --check
  • source .venv/bin/activate && python simpler_setup/build_runtimes.py --platforms a2a3sim a5sim
  • cmake -S tests/ut/cpp -B temp/ut_cpp_build && cmake --build temp/ut_cpp_build --target test_fanin_pool test_dep_list_pool --parallel 8 && ctest --test-dir temp/ut_cpp_build -R 'test_(fanin|dep_list)_pool' --output-on-failure
  • source .venv/bin/activate && python tests/lint/clang_tidy.py --diff-base origin/main
  • From pypto-lib, after reinstalling this runtime branch into that local venv: timeout 180s env PTOAS_ROOT=/home/bot/code/PTO/ptoas-bin PYTHONPATH=.:models/deepseek/v4 python models/deepseek/v4/repro_fanin_spill_pool_deadlock.py -p a2a3sim
    • Expected fatal was reported as orch_error_code=4, Python raised RuntimeError: run_runtime failed with code -4, and the process returned before timeout.

Notes

pre-commit run --all-files is currently blocked locally by unrelated existing repository issues: missing copyright headers in files outside this diff and markdownlint table style issues in existing docs. Running pre-commit --files on this diff also hit a local clang-tidy compile-database cache race; the direct clang_tidy.py --diff-base origin/main gate passes.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements robust error handling and timeout detection for AICore deinitialization and scheduler execution. Key changes include transitioning from exit(1) to atomic error latching in memory pools, updating platform deinitialization to return status codes, and enhancing the scheduler to handle timeouts and fatal errors gracefully. Review feedback highlights a potential off-by-one error in the error bitmap indexing and suggests using unsigned literals for bitwise shifts to prevent undefined behavior.

Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp Outdated
@zhaozhaozz zhaozhaozz force-pushed the fix/pto2-dep-pool-fatal-exit branch from 6fa3524 to 23f2fac Compare May 10, 2026 03:44
@jvjhfhg
Copy link
Copy Markdown
Collaborator

jvjhfhg commented May 11, 2026

Hi @zhaozhaozz, thanks for contribution! A few minor alignment nits before merging:

  1. a5 unit tests are missing the new case: please mirror EnsureSpaceDeadlockReturnsFalseAndLatchesError into tests/ut/cpp/a5/test_dep_list_pool.cpp and tests/ut/cpp/a5/test_fanin_pool.cpp — the a2a3/a5 test cases were fully aligned before this PR.

  2. a5 deinit timeout should be time-based: AICORE_EXIT_ACK_SPIN_LIMIT = 100000 is an iteration count and isn't comparable across CPU frequencies; please add a matching PLATFORM_DEINIT_TIMEOUT_TICKS constant in a5's platform_config.h and reuse the same get_sys_cnt_aicpu() pattern as a2a3.

  3. Align the shutdown log wording: change a5's "Core %d shutdown timed out" to a2a3's "Core %d deinit timed out", and also bring over the // Timeout means AICore is unresponsive. Log and continue deiniting remaining cores. comment that a2a3 has.

  4. Add a one-liner on latch_scheduler_error: clarify that sched_error_code/sched_error_thread is last-writer-wins (CAS) while sched_error_bitmap is cumulative (OR), so future readers don't mistake the asymmetry for a bug.

  5. Unify the completed_ pattern on the scheduler fatal paths: in handle_orchestrator_exit / check_idle_fatal_error, the sched_err branch uses an unconditional completed_.store(true) while the orch_err branch uses completed_.exchange(true) to gate emergency_shutdown — please unify on the latter.

- Mirror dep pool deadlock tests into a5 coverage
- Make a5 AICore deinit timeout use system counter ticks
- Align scheduler fatal shutdown gating and diagnostics comments
@zhaozhaozz
Copy link
Copy Markdown
Author

@jvjhfhg Thanks for the review. Fixed in 99339a8:

  1. Mirrored EnsureSpaceDeadlockReturnsFalseAndLatchesError into the a5 dep-list and fanin pool unit tests.
  2. Replaced the a5 fixed spin-count AICore exit-ack timeout with PLATFORM_DEINIT_TIMEOUT_TICKS based on get_sys_cnt_aicpu().
  3. Aligned the a5 shutdown timeout comment/log wording with a2a3 (deinit timed out).
  4. Added a latch_scheduler_error comment clarifying that the first error code/thread pair wins while the bitmap accumulates all reporting threads.
  5. Updated both a2a3 and a5 scheduler sched_err fatal paths to gate emergency_shutdown() with completed_.exchange(true, ...), matching the existing orchestrator fatal path.

I also checked the touched a2a3/a5 paired sections after the changes: the two new a5 UT files now match their a2a3 counterparts, and the scheduler fatal/shutdown snippets are aligned. Validation passed locally with git diff --check, the focused C++ UTs, build_runtimes.py --platforms a2a3sim a5sim, and tests/lint/clang_tidy.py --diff-base origin/main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants