Add: dep_gen replay → deps.json + swimlane integration + fanout ⊆ deps gate#737
Merged
ChaoWao merged 1 commit intoMay 11, 2026
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces dep_gen, a diagnostic feature for capturing and replaying task submission traces to reconstruct complete dependency graphs. The implementation includes shared-memory structures, host-side collection, and offline replay logic. Review feedback focused on critical concurrency issues, recommending the use of memory barriers and atomic operations to ensure data visibility between the device and host. Suggestions were also made to improve safety by clamping input values from shared memory and to maintain consistency between runtime and platform constants using static assertions.
75de404 to
b08f0a1
Compare
…s gate Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that together turn the captured submit_trace.bin into a user-visible artifact: 1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each record through a host-resident PTO2TensorMap using the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId). - device_runner (onboard + sim) calls the replay post-reconcile when the dep_gen trace is clean. Skips on drops to avoid producing a partial graph users might mistake for complete. - To share PTO2TensorMap between aicpu and host targets, pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale include of pto_orchestrator.h is dropped — no orch member is used). aicpu still picks it up via the recursive glob; host now does too. tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move. - device_runner.cpp (onboard + sim) provides a weak, visibility("hidden") fallback stub for dep_gen_replay_to_deps_json so host_runtime.so still links cleanly when the host_build_graph runtime (which has no replay implementation) is loaded. The strong symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins within its own .so; host_build_graph falls through to the stub. Mirrors the existing dep_gen_aicpu_record_submit pattern. - Auto-sizes per-ring task windows from the trace (rounds max observed local_id up to next pow2) so slot indexing never aliases. 2. swimlane_converter integration (PR4) — when deps.json sits next to l2_perf_records.json, prefer those edges over task["fanout"]. Each flow event is checked for a happens-before violation (pred.end_time > succ.start_time) and emitted under a distinct "hb_violation" name so Perfetto colors it apart from clean dependencies. Verbose output reports the chosen edge source and HB violation count. 3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts: - deps.json exists and contains the 6 expected edges from example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4). - When l2_perf_records.json is also present (--enable-l2-swimlane on), every fanout edge is a subset of deps.json. The standalone main auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a single command runs the full gate. 4. **deps.json viewer (`simpler_setup/tools/deps_to_graph.py`)** — turns the replay product into a self-contained pan/zoom HTML page (Graphviz SVG + 80-line vanilla-JS shim, no CDN/offline-capable). Distinct shape + color per node type so AIC (cube, blue box), AIV (vector, orange ellipse), mix (green diamond — single submit_task spanning both core types via MixedKernels), and alloc (gray dashed note — tasks from `alloc_tensors` that produced output tensors but never dispatched a kernel) stay readable even without color. Auto-loads the colocated `l2_perf_records.json` and `name_map_*.json` sidecars for label enrichment; isolated tasks (no inbound/outbound edges) still show up. `--engine sfdp` for graphs past ~500 nodes. 5. **In-memory capture → replay (drops `submit_trace.bin`)** — the host collector now accumulates `DepGenRecord` entries directly in a `std::vector<DepGenRecord>` instead of streaming them to `submit_trace.bin` on disk. The replay function takes a pointer + count from `DepGenCollector::records()` and skips the file round-trip entirely. deps.json is the only on-disk dep_gen artifact now; the `make_dep_gen_path` helper is gone, `DepGenCollector::init` no longer takes a path, and the replay C ABI is now `dep_gen_replay_emit_deps_json(records, n, deps_json_path, …)`. Also clamps `args.tensor_count()` to `MAX_TENSOR_ARGS` at the capture call-site (defensive — the Arg builder already caps at MAX_TENSOR_ARGS but this prevents a future builder bypass from overflowing the stack buffers). The weak fallback in non-dep_gen runtimes (host_build_graph) drops from LOG_WARN to LOG_DEBUG since that path is unreachable for end users — it exists only to keep the .so loadable. Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is int32_t but the AICPU writer reinterprets the tag array as uint8_t[]. On little-endian this silently kept only every fourth tag byte, turning (INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag to uint8_t explicitly before passing it to the writer — keeps the on-disk uint8_t[16] arg_types layout intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Fixes hw-native-sys#599.
b08f0a1 to
16040fe
Compare
4 tasks
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 11, 2026
Follow-up to hw-native-sys#737 addressing post-merge review findings: 1. **Dedup replay fanin per-successor** ``dep_gen_replay_emit_deps_json`` now mirrors the runtime's ``PTO2FaninBuilder::append_fanin_or_fail`` semantics: an ``std::unordered_set<uint64_t>`` tracks predecessor task ids seen so far for the current successor, and both STEP 1 (``explicit_deps``) and STEP 3 (creator retention + tensormap lookup) push through a single ``emit_unique`` lambda. Previously an ``explicit_dep`` that the tensormap also surfaced (via ``owner_task_id`` or an overlap hit) emitted two edges, which double-counted ``deps.json`` and made ``swimlane_converter.py`` draw duplicate flow events. 2. **Document the OUTPUT-slot safety contract at the capture site** ``dep_gen_replay.cpp`` sets ``tref_buf[i].ptr`` for every captured tensor slot including OUTPUT — the on-disk blob for OUTPUT is zeroed by the AICPU writer. Added an inline comment pointing at ``pto_dep_compute.h``'s per-tag dispatch (which is what makes the never-dereferenced-on-OUTPUT contract hold) so the next reader of the arg_types width-fix area doesn't have to re-derive it. 3. **``make_deps_json_path`` helper** Both onboard + sim device_runner used to build ``deps.json`` with ``output_prefix_ + "/deps.json"`` inline — out of step with the ``make_<feature>_path()`` convention shared by PMU and (previously) submit_trace. Added ``make_deps_json_path`` in ``dep_gen_collector.h``; both call sites now go through it, and the helper also handles ``create_directories`` so the path is safe even when the output dir hasn't been touched by anything else yet. 4. **``_task_id(ring, local)`` helper in the validation test** The 6-edge expected-set in ``test_dep_gen_capture.py`` was open- coded ``1 << 32`` arithmetic at every call. One helper, layout stated once. 5. **``docs/dep_gen.md``** First user-facing doc for the feature. Covers motivation (links hw-native-sys#599), enable flags, ``deps.json`` format, ``deps_to_graph.py`` usage + node shape/color legend (AIC cube box, AIV vector ellipse, mix diamond, alloc dashed note), the ``fanout ⊆ deps`` validation gate, and the architecture touchpoints. 6. **CI smoke step for dep_gen** (``.github/workflows/ci.yml``) The default ``pytest tests/st`` invocation does not pass any DFX flag, so the dep_gen capture path never executed in CI. Added a second pytest step in ``st-sim-a2a3`` that re-runs only ``test_dep_gen_capture`` with ``--enable-dep-gen --enable-l2-swimlane``, which forces the full capture → replay → deps.json → fanout ⊆ deps gate path to execute. Same pattern is the model for future PMU / tensor_dump / swimlane smoke steps once those grow dedicated tests. 7. **UT roundtrip for ``enable_dep_gen``** (``tests/ut/py/test_chip_worker.py``) Adds the missing nanobind setter / getter / repr roundtrip for the ``enable_dep_gen`` ``CallConfig`` field, alongside the existing round-trips for ``enable_l2_swimlane`` / ``enable_dump_tensor`` / ``enable_pmu``. Catches binding-ABI regressions at the UT layer (e.g. the ``bool`` vs ``int32`` Python-wrapper bug that broke 9 st jobs earlier in the PR series — 10 s ut would have caught it). 8. **Pytest-friendly test hook via ``test_run`` override** instead of a framework-level ``_post_validate`` callback. ``test_dep_gen_capture`` overrides the inherited ``SceneTestCase.test_run`` to call ``super()`` then walk its cases and assert the dep_gen artifacts. Keeps the framework unchanged (no implicit ``hasattr`` check on every SceneTestCase), localises the test-specific behavior, and the ``--enable-dep-gen`` flag gate prevents the assertion from firing in default invocations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 11, 2026
Follow-up to hw-native-sys#737 addressing post-merge review findings: 1. **Dedup replay fanin per-successor** ``dep_gen_replay_emit_deps_json`` now mirrors the runtime's ``PTO2FaninBuilder::append_fanin_or_fail`` semantics: an ``std::unordered_set<uint64_t>`` tracks predecessor task ids seen so far for the current successor, and both STEP 1 (``explicit_deps``) and STEP 3 (creator retention + tensormap lookup) push through a single ``emit_unique`` lambda. Previously an ``explicit_dep`` that the tensormap also surfaced (via ``owner_task_id`` or an overlap hit) emitted two edges, which double-counted ``deps.json`` and made ``swimlane_converter.py`` draw duplicate flow events. 2. **Document the OUTPUT-slot safety contract at the capture site** ``dep_gen_replay.cpp`` sets ``tref_buf[i].ptr`` for every captured tensor slot including OUTPUT — the on-disk blob for OUTPUT is zeroed by the AICPU writer. Added an inline comment pointing at ``pto_dep_compute.h``'s per-tag dispatch (which is what makes the never-dereferenced-on-OUTPUT contract hold) so the next reader of the arg_types width-fix area doesn't have to re-derive it. 3. **``make_deps_json_path`` helper** Both onboard + sim device_runner used to build ``deps.json`` with ``output_prefix_ + "/deps.json"`` inline — out of step with the ``make_<feature>_path()`` convention shared by PMU and (previously) submit_trace. Added ``make_deps_json_path`` in ``dep_gen_collector.h``; both call sites now go through it, and the helper also handles ``create_directories`` so the path is safe even when the output dir hasn't been touched by anything else yet. 4. **``_task_id(ring, local)`` helper in the validation test** The 6-edge expected-set in ``test_dep_gen_capture.py`` was open- coded ``1 << 32`` arithmetic at every call. One helper, layout stated once. 5. **``docs/dep_gen.md``** First user-facing doc for the feature. Covers motivation (links hw-native-sys#599), enable flags, ``deps.json`` format, ``deps_to_graph.py`` usage + node shape/color legend (AIC cube box, AIV vector ellipse, mix diamond, alloc dashed note), the ``fanout ⊆ deps`` validation gate, and the architecture touchpoints. 6. **CI smoke step for dep_gen** (``.github/workflows/ci.yml``) The default ``pytest tests/st`` invocation does not pass any DFX flag, so the dep_gen capture path never executed in CI. Added a second pytest step in ``st-sim-a2a3`` that re-runs only ``test_dep_gen_capture`` with ``--enable-dep-gen --enable-l2-swimlane``, which forces the full capture → replay → deps.json → fanout ⊆ deps gate path to execute. Same pattern is the model for future PMU / tensor_dump / swimlane smoke steps once those grow dedicated tests. 7. **UT roundtrip for ``enable_dep_gen``** (``tests/ut/py/test_chip_worker.py``) Adds the missing nanobind setter / getter / repr roundtrip for the ``enable_dep_gen`` ``CallConfig`` field, alongside the existing round-trips for ``enable_l2_swimlane`` / ``enable_dump_tensor`` / ``enable_pmu``. Catches binding-ABI regressions at the UT layer (e.g. the ``bool`` vs ``int32`` Python-wrapper bug that broke 9 st jobs earlier in the PR series — 10 s ut would have caught it). 8. **Pytest-friendly test hook via ``test_run`` override** instead of a framework-level ``_post_validate`` callback. ``test_dep_gen_capture`` overrides the inherited ``SceneTestCase.test_run`` to call ``super()`` then walk its cases and assert the dep_gen artifacts. Keeps the framework unchanged (no implicit ``hasattr`` check on every SceneTestCase), localises the test-specific behavior, and the ``--enable-dep-gen`` flag gate prevents the assertion from firing in default invocations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 11, 2026
Follow-up to hw-native-sys#737 addressing post-merge review findings: 1. **Dedup replay fanin per-successor** ``dep_gen_replay_emit_deps_json`` now mirrors the runtime's ``PTO2FaninBuilder::append_fanin_or_fail`` semantics: an ``std::unordered_set<uint64_t>`` tracks predecessor task ids seen so far for the current successor, and both STEP 1 (``explicit_deps``) and STEP 3 (creator retention + tensormap lookup) push through a single ``emit_unique`` lambda. Previously an ``explicit_dep`` that the tensormap also surfaced (via ``owner_task_id`` or an overlap hit) emitted two edges, which double-counted ``deps.json`` and made ``swimlane_converter.py`` draw duplicate flow events. 2. **Document the OUTPUT-slot safety contract at the capture site** ``dep_gen_replay.cpp`` sets ``tref_buf[i].ptr`` for every captured tensor slot including OUTPUT — the on-disk blob for OUTPUT is zeroed by the AICPU writer. Added an inline comment pointing at ``pto_dep_compute.h``'s per-tag dispatch (which is what makes the never-dereferenced-on-OUTPUT contract hold) so the next reader of the arg_types width-fix area doesn't have to re-derive it. 3. **``make_deps_json_path`` helper** Both onboard + sim device_runner used to build ``deps.json`` with ``output_prefix_ + "/deps.json"`` inline — out of step with the ``make_<feature>_path()`` convention shared by PMU and (previously) submit_trace. Added ``make_deps_json_path`` in ``dep_gen_collector.h``; both call sites now go through it, and the helper also handles ``create_directories`` so the path is safe even when the output dir hasn't been touched by anything else yet. 4. **``_task_id(ring, local)`` helper in the validation test** The 6-edge expected-set in ``test_dep_gen_capture.py`` was open- coded ``1 << 32`` arithmetic at every call. One helper, layout stated once. 5. **``docs/dep_gen.md``** First user-facing doc for the feature. Covers motivation (links hw-native-sys#599), enable flags, ``deps.json`` format, ``deps_to_graph.py`` usage + node shape/color legend (AIC cube box, AIV vector ellipse, mix diamond, alloc dashed note), the ``fanout ⊆ deps`` validation gate, and the architecture touchpoints. 6. **CI smoke step for dep_gen** (``.github/workflows/ci.yml``) The default ``pytest tests/st`` invocation does not pass any DFX flag, so the dep_gen capture path never executed in CI. Added a second pytest step in ``st-sim-a2a3`` that re-runs only ``test_dep_gen_capture`` with ``--enable-dep-gen --enable-l2-swimlane``, which forces the full capture → replay → deps.json → fanout ⊆ deps gate path to execute. Same pattern is the model for future PMU / tensor_dump / swimlane smoke steps once those grow dedicated tests. 7. **UT roundtrip for ``enable_dep_gen``** (``tests/ut/py/test_chip_worker.py``) Adds the missing nanobind setter / getter / repr roundtrip for the ``enable_dep_gen`` ``CallConfig`` field, alongside the existing round-trips for ``enable_l2_swimlane`` / ``enable_dump_tensor`` / ``enable_pmu``. Catches binding-ABI regressions at the UT layer (e.g. the ``bool`` vs ``int32`` Python-wrapper bug that broke 9 st jobs earlier in the PR series — 10 s ut would have caught it). 8. **Pytest-friendly test hook via ``test_run`` override** instead of a framework-level ``_post_validate`` callback. ``test_dep_gen_capture`` overrides the inherited ``SceneTestCase.test_run`` to call ``super()`` then walk its cases and assert the dep_gen artifacts. Keeps the framework unchanged (no implicit ``hasattr`` check on every SceneTestCase), localises the test-specific behavior, and the ``--enable-dep-gen`` flag gate prevents the assertion from firing in default invocations. 9. **Unified DFX smoke folder** (``tests/st/a2a3/tensormap_and_ringbuffer/dfx/``) Moved ``dep_gen_capture/`` → ``dfx/`` and added three sibling smokes for the other 3 DFX features so all four profilers have a CI gate: - ``test_dep_gen.py`` (renamed from test_dep_gen_capture.py, class ``TestDepGen``) - ``test_l2_swimlane.py`` — asserts ``l2_perf_records.json`` shape - ``test_pmu.py`` — asserts ``pmu.csv`` header + row count - ``test_tensor_dump.py`` — asserts ``tensor_dump/`` manifest + bin Each test uses ``vector_example`` as a fixed 5-task workload; each opens only its own flag and asserts only its own artifact, so default ``pytest tests/st`` invocations (no flag) just run the case for golden compare without DFX paths. ci.yml gets four matching smoke steps under ``st-sim-a2a3``, each a fresh pytest invocation — running ≥2 DFX tests that open ``--enable-l2-swimlane`` in one session trips ``L2PerfCollector::initialize``'s dup-init guard because the L2 worker pool reuses the DeviceRunner across tests. The runtime-side fix (make init re-init-safe across runs) is a separate concern. 10. **Sibling helper consistency** (``make_pmu_csv_path``) Converted ``make_pmu_csv_path`` in ``pmu_collector.h`` to use ``std::filesystem::path`` operator/ instead of bare string concat, matching the new ``make_deps_json_path`` convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on #736 (dep_gen capture). Three closely-coupled changes that turn the captured
submit_trace.bininto a user-visible artifact:src/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.{h,cpp}): readssubmit_trace.bin, runs each record through a host-residentPTO2TensorMapusing the samecompute_task_fanin/register_task_outputsprimitives the device orchestrator uses, emitsdeps.json(edge list keyed by rawPTO2TaskId). Wired intodevice_runner.cpp(onboard + sim) post-reconcile; skipped on dropped records to avoid producing a partial graph users could mistake for complete.simpler_setup/tools/swimlane_converter.py): whendeps.jsonsits next tol2_perf_records.json, prefer those edges overtask["fanout"]. Each flow event is checked for a happens-before violation (pred.end_time > succ.start_time) and emitted under a distincthb_violationname so Perfetto colors it apart from clean dependencies.tests/st/a2a3/.../test_dep_gen_capture.py): assertsdeps.jsonexists with the 6 expected edges fromexample_orchestration.cpp, and whenl2_perf_records.jsonis also present (--enable-l2-swimlaneon), every fanout edge is a subset ofdeps.json. Standalone main auto-adds--enable-l2-swimlaneso a single command exercises the full gate.Also fixes a width-mismatch bug in #736's capture path:
TensorArgTypeisint32_tbut the AICPU writer reinterpreted the tag array asuint8_t[]. On little-endian this silently kept only every fourth tag byte, turning(INPUT, INPUT, OUTPUT)into(0, 0, 0)and synthesizing a phantom self-edget0→t0in replay. Fixed at the call site by narrowing each tag touint8_texplicitly before passing it to the writer — keeps the on-diskuint8_t[16]arg_types layout intact.Note on stacking: this PR includes the #736 commit at its base. It will rebase to a single-commit PR once #736 merges.
To share
PTO2TensorMapbetween aicpu and host targets,pto_tensormap.cppmoves fromruntime/toruntime/shared/and its stale#include \"pto_orchestrator.h\"is dropped (no orch member is used). aicpu still picks it up via the recursive glob.Test plan
Fixes [Bug] Swimlane profiling drops fanout edges for producers completing before consumer wiring #599 — swimlane fanout drops dependency edges for fast producers.