Fix PTO2 dep pool fatal exits#730
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements robust error handling and timeout detection for AICore deinitialization and scheduler execution. Key changes include transitioning from exit(1) to atomic error latching in memory pools, updating platform deinitialization to return status codes, and enhancing the scheduler to handle timeouts and fatal errors gracefully. Review feedback highlights a potential off-by-one error in the error bitmap indexing and suggests using unsigned literals for bitwise shifts to prevent undefined behavior.
6fa3524 to
23f2fac
Compare
|
Hi @zhaozhaozz, thanks for contribution! A few minor alignment nits before merging:
|
- Mirror dep pool deadlock tests into a5 coverage - Make a5 AICore deinit timeout use system counter ticks - Align scheduler fatal shutdown gating and diagnostics comments
|
@jvjhfhg Thanks for the review. Fixed in
I also checked the touched a2a3/a5 paired sections after the changes: the two new a5 UT files now match their a2a3 counterparts, and the scheduler fatal/shutdown snippets are aligned. Validation passed locally with |
Summary
This PR handles the first layer of #729: when PTO2 detects a fanin/dep pool fatal condition, runtime should report the original fatal status and exit cleanly instead of getting stuck in an unrecoverable spin or secondary failure path.
Changes:
ensure_space()returnfalseand latchPTO2_ERROR_DEP_POOL_OVERFLOWinstead of callingexit(1).ensure_space()returning false and latching the error.This does not attempt to fix the second-layer root cause from #729: over-broad whole-tensor dependencies after disjoint view writes. That still needs codegen/runtime region precision or backpressure work.
Validation
git diff --checksource .venv/bin/activate && python simpler_setup/build_runtimes.py --platforms a2a3sim a5simcmake -S tests/ut/cpp -B temp/ut_cpp_build && cmake --build temp/ut_cpp_build --target test_fanin_pool test_dep_list_pool --parallel 8 && ctest --test-dir temp/ut_cpp_build -R 'test_(fanin|dep_list)_pool' --output-on-failuresource .venv/bin/activate && python tests/lint/clang_tidy.py --diff-base origin/mainpypto-lib, after reinstalling this runtime branch into that local venv:timeout 180s env PTOAS_ROOT=/home/bot/code/PTO/ptoas-bin PYTHONPATH=.:models/deepseek/v4 python models/deepseek/v4/repro_fanin_spill_pool_deadlock.py -p a2a3simorch_error_code=4, Python raisedRuntimeError: run_runtime failed with code -4, and the process returned before timeout.Notes
pre-commit run --all-filesis currently blocked locally by unrelated existing repository issues: missing copyright headers in files outside this diff and markdownlint table style issues in existing docs. Runningpre-commit --fileson this diff also hit a local clang-tidy compile-database cache race; the directclang_tidy.py --diff-base origin/maingate passes.