Skip to content

Add: dep_gen replay → deps.json + swimlane integration + fanout ⊆ deps gate#737

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:feat/dep-gen-replay-and-validate-a2a3
May 11, 2026
Merged

Add: dep_gen replay → deps.json + swimlane integration + fanout ⊆ deps gate#737
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:feat/dep-gen-replay-and-validate-a2a3

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented May 11, 2026

Summary

Stacked on #736 (dep_gen capture). Three closely-coupled changes that turn the captured submit_trace.bin into a user-visible artifact:

  • PR3 — Host replay (src/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.{h,cpp}): reads submit_trace.bin, runs each record through a host-resident PTO2TensorMap using the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId). Wired into device_runner.cpp (onboard + sim) post-reconcile; skipped on dropped records to avoid producing a partial graph users could mistake for complete.
  • PR4 — swimlane_converter integration (simpler_setup/tools/swimlane_converter.py): when deps.json sits next to l2_perf_records.json, prefer those edges over task["fanout"]. Each flow event is checked for a happens-before violation (pred.end_time > succ.start_time) and emitted under a distinct hb_violation name so Perfetto colors it apart from clean dependencies.
  • PR5 — Validation gate (tests/st/a2a3/.../test_dep_gen_capture.py): asserts deps.json exists with the 6 expected edges from example_orchestration.cpp, and when l2_perf_records.json is also present (--enable-l2-swimlane on), every fanout edge is a subset of deps.json. Standalone main auto-adds --enable-l2-swimlane so a single command exercises the full gate.

Also fixes a width-mismatch bug in #736's capture path: TensorArgType is int32_t but the AICPU writer reinterpreted the tag array as uint8_t[]. On little-endian this silently kept only every fourth tag byte, turning (INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag to uint8_t explicitly before passing it to the writer — keeps the on-disk uint8_t[16] arg_types layout intact.

Note on stacking: this PR includes the #736 commit at its base. It will rebase to a single-commit PR once #736 merges.

To share PTO2TensorMap between aicpu and host targets, pto_tensormap.cpp moves from runtime/ to runtime/shared/ and its stale #include \"pto_orchestrator.h\" is dropped (no orch member is used). aicpu still picks it up via the recursive glob.

Test plan

  • `python simpler_setup/build_runtimes.py --platform a2a3sim` — clean
  • `pre-commit run --files ` — clang-format, clang-tidy, cpplint, ruff, pyright all green
  • `python tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/test_dep_gen_capture.py --enable-dep-gen -p a2a3sim` — PASSED + `[dep_gen_capture] post-run verification PASSED` (deps.json has the expected 6 edges, fanout ⊆ deps gate clean for vector_example)
  • CI: full a2a3 + a2a3sim ut/st matrix
    Fixes [Bug] Swimlane profiling drops fanout edges for producers completing before consumer wiring #599 — swimlane fanout drops dependency edges for fast producers.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces dep_gen, a diagnostic feature for capturing and replaying task submission traces to reconstruct complete dependency graphs. The implementation includes shared-memory structures, host-side collection, and offline replay logic. Review feedback focused on critical concurrency issues, recommending the use of memory barriers and atomic operations to ensure data visibility between the device and host. Suggestions were also made to improve safety by clamping input values from shared memory and to maintain consistency between runtime and platform constants using static assertions.

Comment thread src/a2a3/platform/src/aicpu/dep_gen_collector_aicpu.cpp
Comment thread src/a2a3/platform/src/aicpu/dep_gen_collector_aicpu.cpp
Comment thread src/a2a3/platform/src/host/dep_gen_collector.cpp
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
@ChaoWao ChaoWao force-pushed the feat/dep-gen-replay-and-validate-a2a3 branch 5 times, most recently from 75de404 to b08f0a1 Compare May 11, 2026 11:29
…s gate

Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that
together turn the captured submit_trace.bin into a user-visible artifact:

1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under
   runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each
   record through a host-resident PTO2TensorMap using the same
   compute_task_fanin / register_task_outputs primitives the device
   orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId).

   - device_runner (onboard + sim) calls the replay post-reconcile when
     the dep_gen trace is clean. Skips on drops to avoid producing a
     partial graph users might mistake for complete.
   - To share PTO2TensorMap between aicpu and host targets,
     pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale
     include of pto_orchestrator.h is dropped — no orch member is used).
     aicpu still picks it up via the recursive glob; host now does too.
     tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move.
   - device_runner.cpp (onboard + sim) provides a weak,
     visibility("hidden") fallback stub for dep_gen_replay_to_deps_json
     so host_runtime.so still links cleanly when the host_build_graph
     runtime (which has no replay implementation) is loaded. The strong
     symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins
     within its own .so; host_build_graph falls through to the stub.
     Mirrors the existing dep_gen_aicpu_record_submit pattern.
   - Auto-sizes per-ring task windows from the trace (rounds max
     observed local_id up to next pow2) so slot indexing never aliases.

2. swimlane_converter integration (PR4) — when deps.json sits next
   to l2_perf_records.json, prefer those edges over task["fanout"]. Each
   flow event is checked for a happens-before violation
   (pred.end_time > succ.start_time) and emitted under a distinct
   "hb_violation" name so Perfetto colors it apart from clean
   dependencies. Verbose output reports the chosen edge source and HB
   violation count.

3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts:
   - deps.json exists and contains the 6 expected edges from
     example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4).
   - When l2_perf_records.json is also present (--enable-l2-swimlane on),
     every fanout edge is a subset of deps.json. The standalone main
     auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a
     single command runs the full gate.

4. **deps.json viewer (`simpler_setup/tools/deps_to_graph.py`)** — turns the
   replay product into a self-contained pan/zoom HTML page (Graphviz SVG +
   80-line vanilla-JS shim, no CDN/offline-capable). Distinct shape + color
   per node type so AIC (cube, blue box), AIV (vector, orange ellipse),
   mix (green diamond — single submit_task spanning both core types via
   MixedKernels), and alloc (gray dashed note — tasks from `alloc_tensors`
   that produced output tensors but never dispatched a kernel) stay readable
   even without color. Auto-loads the colocated `l2_perf_records.json` and
   `name_map_*.json` sidecars for label enrichment; isolated tasks (no
   inbound/outbound edges) still show up. `--engine sfdp` for graphs past
   ~500 nodes.

5. **In-memory capture → replay (drops `submit_trace.bin`)** — the host
   collector now accumulates `DepGenRecord` entries directly in a
   `std::vector<DepGenRecord>` instead of streaming them to
   `submit_trace.bin` on disk. The replay function takes a pointer +
   count from `DepGenCollector::records()` and skips the file round-trip
   entirely. deps.json is the only on-disk dep_gen artifact now; the
   `make_dep_gen_path` helper is gone, `DepGenCollector::init` no longer
   takes a path, and the replay C ABI is now
   `dep_gen_replay_emit_deps_json(records, n, deps_json_path, …)`. Also
   clamps `args.tensor_count()` to `MAX_TENSOR_ARGS` at the capture
   call-site (defensive — the Arg builder already caps at MAX_TENSOR_ARGS
   but this prevents a future builder bypass from overflowing the stack
   buffers). The weak fallback in non-dep_gen runtimes (host_build_graph)
   drops from LOG_WARN to LOG_DEBUG since that path is unreachable for
   end users — it exists only to keep the .so loadable.

Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is
int32_t but the AICPU writer reinterprets the tag array as uint8_t[].
On little-endian this silently kept only every fourth tag byte, turning
(INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom
self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag
to uint8_t explicitly before passing it to the writer — keeps the
on-disk uint8_t[16] arg_types layout intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fixes hw-native-sys#599.
@ChaoWao ChaoWao force-pushed the feat/dep-gen-replay-and-validate-a2a3 branch from b08f0a1 to 16040fe Compare May 11, 2026 11:51
@ChaoWao ChaoWao merged commit 446a016 into hw-native-sys:main May 11, 2026
14 checks passed
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request May 11, 2026
Follow-up to hw-native-sys#737 addressing post-merge review findings:

1. **Dedup replay fanin per-successor**
   ``dep_gen_replay_emit_deps_json`` now mirrors the runtime's
   ``PTO2FaninBuilder::append_fanin_or_fail`` semantics: an
   ``std::unordered_set<uint64_t>`` tracks predecessor task ids seen so
   far for the current successor, and both STEP 1 (``explicit_deps``)
   and STEP 3 (creator retention + tensormap lookup) push through a
   single ``emit_unique`` lambda. Previously an ``explicit_dep`` that
   the tensormap also surfaced (via ``owner_task_id`` or an overlap
   hit) emitted two edges, which double-counted ``deps.json`` and made
   ``swimlane_converter.py`` draw duplicate flow events.

2. **Document the OUTPUT-slot safety contract at the capture site**
   ``dep_gen_replay.cpp`` sets ``tref_buf[i].ptr`` for every captured
   tensor slot including OUTPUT — the on-disk blob for OUTPUT is zeroed
   by the AICPU writer. Added an inline comment pointing at
   ``pto_dep_compute.h``'s per-tag dispatch (which is what makes the
   never-dereferenced-on-OUTPUT contract hold) so the next reader of
   the arg_types width-fix area doesn't have to re-derive it.

3. **``make_deps_json_path`` helper**
   Both onboard + sim device_runner used to build ``deps.json`` with
   ``output_prefix_ + "/deps.json"`` inline — out of step with the
   ``make_<feature>_path()`` convention shared by PMU and (previously)
   submit_trace. Added ``make_deps_json_path`` in
   ``dep_gen_collector.h``; both call sites now go through it, and
   the helper also handles ``create_directories`` so the path is safe
   even when the output dir hasn't been touched by anything else yet.

4. **``_task_id(ring, local)`` helper in the validation test**
   The 6-edge expected-set in ``test_dep_gen_capture.py`` was open-
   coded ``1 << 32`` arithmetic at every call. One helper, layout
   stated once.

5. **``docs/dep_gen.md``**
   First user-facing doc for the feature. Covers motivation (links
   hw-native-sys#599), enable flags, ``deps.json`` format, ``deps_to_graph.py``
   usage + node shape/color legend (AIC cube box, AIV vector ellipse,
   mix diamond, alloc dashed note), the ``fanout ⊆ deps`` validation
   gate, and the architecture touchpoints.
6. **CI smoke step for dep_gen** (``.github/workflows/ci.yml``)
   The default ``pytest tests/st`` invocation does not pass any DFX flag, so
   the dep_gen capture path never executed in CI. Added a second pytest step
   in ``st-sim-a2a3`` that re-runs only ``test_dep_gen_capture`` with
   ``--enable-dep-gen --enable-l2-swimlane``, which forces the full
   capture → replay → deps.json → fanout ⊆ deps gate path to execute.
   Same pattern is the model for future PMU / tensor_dump / swimlane smoke
   steps once those grow dedicated tests.

7. **UT roundtrip for ``enable_dep_gen``** (``tests/ut/py/test_chip_worker.py``)
   Adds the missing nanobind setter / getter / repr roundtrip for the
   ``enable_dep_gen`` ``CallConfig`` field, alongside the existing
   round-trips for ``enable_l2_swimlane`` / ``enable_dump_tensor`` /
   ``enable_pmu``. Catches binding-ABI regressions at the UT layer
   (e.g. the ``bool`` vs ``int32`` Python-wrapper bug that broke 9 st
   jobs earlier in the PR series — 10 s ut would have caught it).

8. **Pytest-friendly test hook via ``test_run`` override** instead of a
   framework-level ``_post_validate`` callback. ``test_dep_gen_capture``
   overrides the inherited ``SceneTestCase.test_run`` to call ``super()``
   then walk its cases and assert the dep_gen artifacts. Keeps the
   framework unchanged (no implicit ``hasattr`` check on every
   SceneTestCase), localises the test-specific behavior, and the
   ``--enable-dep-gen`` flag gate prevents the assertion from firing in
   default invocations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request May 11, 2026
Follow-up to hw-native-sys#737 addressing post-merge review findings:

1. **Dedup replay fanin per-successor**
   ``dep_gen_replay_emit_deps_json`` now mirrors the runtime's
   ``PTO2FaninBuilder::append_fanin_or_fail`` semantics: an
   ``std::unordered_set<uint64_t>`` tracks predecessor task ids seen so
   far for the current successor, and both STEP 1 (``explicit_deps``)
   and STEP 3 (creator retention + tensormap lookup) push through a
   single ``emit_unique`` lambda. Previously an ``explicit_dep`` that
   the tensormap also surfaced (via ``owner_task_id`` or an overlap
   hit) emitted two edges, which double-counted ``deps.json`` and made
   ``swimlane_converter.py`` draw duplicate flow events.

2. **Document the OUTPUT-slot safety contract at the capture site**
   ``dep_gen_replay.cpp`` sets ``tref_buf[i].ptr`` for every captured
   tensor slot including OUTPUT — the on-disk blob for OUTPUT is zeroed
   by the AICPU writer. Added an inline comment pointing at
   ``pto_dep_compute.h``'s per-tag dispatch (which is what makes the
   never-dereferenced-on-OUTPUT contract hold) so the next reader of
   the arg_types width-fix area doesn't have to re-derive it.

3. **``make_deps_json_path`` helper**
   Both onboard + sim device_runner used to build ``deps.json`` with
   ``output_prefix_ + "/deps.json"`` inline — out of step with the
   ``make_<feature>_path()`` convention shared by PMU and (previously)
   submit_trace. Added ``make_deps_json_path`` in
   ``dep_gen_collector.h``; both call sites now go through it, and
   the helper also handles ``create_directories`` so the path is safe
   even when the output dir hasn't been touched by anything else yet.

4. **``_task_id(ring, local)`` helper in the validation test**
   The 6-edge expected-set in ``test_dep_gen_capture.py`` was open-
   coded ``1 << 32`` arithmetic at every call. One helper, layout
   stated once.

5. **``docs/dep_gen.md``**
   First user-facing doc for the feature. Covers motivation (links
   hw-native-sys#599), enable flags, ``deps.json`` format, ``deps_to_graph.py``
   usage + node shape/color legend (AIC cube box, AIV vector ellipse,
   mix diamond, alloc dashed note), the ``fanout ⊆ deps`` validation
   gate, and the architecture touchpoints.
6. **CI smoke step for dep_gen** (``.github/workflows/ci.yml``)
   The default ``pytest tests/st`` invocation does not pass any DFX flag, so
   the dep_gen capture path never executed in CI. Added a second pytest step
   in ``st-sim-a2a3`` that re-runs only ``test_dep_gen_capture`` with
   ``--enable-dep-gen --enable-l2-swimlane``, which forces the full
   capture → replay → deps.json → fanout ⊆ deps gate path to execute.
   Same pattern is the model for future PMU / tensor_dump / swimlane smoke
   steps once those grow dedicated tests.

7. **UT roundtrip for ``enable_dep_gen``** (``tests/ut/py/test_chip_worker.py``)
   Adds the missing nanobind setter / getter / repr roundtrip for the
   ``enable_dep_gen`` ``CallConfig`` field, alongside the existing
   round-trips for ``enable_l2_swimlane`` / ``enable_dump_tensor`` /
   ``enable_pmu``. Catches binding-ABI regressions at the UT layer
   (e.g. the ``bool`` vs ``int32`` Python-wrapper bug that broke 9 st
   jobs earlier in the PR series — 10 s ut would have caught it).

8. **Pytest-friendly test hook via ``test_run`` override** instead of a
   framework-level ``_post_validate`` callback. ``test_dep_gen_capture``
   overrides the inherited ``SceneTestCase.test_run`` to call ``super()``
   then walk its cases and assert the dep_gen artifacts. Keeps the
   framework unchanged (no implicit ``hasattr`` check on every
   SceneTestCase), localises the test-specific behavior, and the
   ``--enable-dep-gen`` flag gate prevents the assertion from firing in
   default invocations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request May 11, 2026
Follow-up to hw-native-sys#737 addressing post-merge review findings:

1. **Dedup replay fanin per-successor**
   ``dep_gen_replay_emit_deps_json`` now mirrors the runtime's
   ``PTO2FaninBuilder::append_fanin_or_fail`` semantics: an
   ``std::unordered_set<uint64_t>`` tracks predecessor task ids seen so
   far for the current successor, and both STEP 1 (``explicit_deps``)
   and STEP 3 (creator retention + tensormap lookup) push through a
   single ``emit_unique`` lambda. Previously an ``explicit_dep`` that
   the tensormap also surfaced (via ``owner_task_id`` or an overlap
   hit) emitted two edges, which double-counted ``deps.json`` and made
   ``swimlane_converter.py`` draw duplicate flow events.

2. **Document the OUTPUT-slot safety contract at the capture site**
   ``dep_gen_replay.cpp`` sets ``tref_buf[i].ptr`` for every captured
   tensor slot including OUTPUT — the on-disk blob for OUTPUT is zeroed
   by the AICPU writer. Added an inline comment pointing at
   ``pto_dep_compute.h``'s per-tag dispatch (which is what makes the
   never-dereferenced-on-OUTPUT contract hold) so the next reader of
   the arg_types width-fix area doesn't have to re-derive it.

3. **``make_deps_json_path`` helper**
   Both onboard + sim device_runner used to build ``deps.json`` with
   ``output_prefix_ + "/deps.json"`` inline — out of step with the
   ``make_<feature>_path()`` convention shared by PMU and (previously)
   submit_trace. Added ``make_deps_json_path`` in
   ``dep_gen_collector.h``; both call sites now go through it, and
   the helper also handles ``create_directories`` so the path is safe
   even when the output dir hasn't been touched by anything else yet.

4. **``_task_id(ring, local)`` helper in the validation test**
   The 6-edge expected-set in ``test_dep_gen_capture.py`` was open-
   coded ``1 << 32`` arithmetic at every call. One helper, layout
   stated once.

5. **``docs/dep_gen.md``**
   First user-facing doc for the feature. Covers motivation (links
   hw-native-sys#599), enable flags, ``deps.json`` format, ``deps_to_graph.py``
   usage + node shape/color legend (AIC cube box, AIV vector ellipse,
   mix diamond, alloc dashed note), the ``fanout ⊆ deps`` validation
   gate, and the architecture touchpoints.
6. **CI smoke step for dep_gen** (``.github/workflows/ci.yml``)
   The default ``pytest tests/st`` invocation does not pass any DFX flag, so
   the dep_gen capture path never executed in CI. Added a second pytest step
   in ``st-sim-a2a3`` that re-runs only ``test_dep_gen_capture`` with
   ``--enable-dep-gen --enable-l2-swimlane``, which forces the full
   capture → replay → deps.json → fanout ⊆ deps gate path to execute.
   Same pattern is the model for future PMU / tensor_dump / swimlane smoke
   steps once those grow dedicated tests.

7. **UT roundtrip for ``enable_dep_gen``** (``tests/ut/py/test_chip_worker.py``)
   Adds the missing nanobind setter / getter / repr roundtrip for the
   ``enable_dep_gen`` ``CallConfig`` field, alongside the existing
   round-trips for ``enable_l2_swimlane`` / ``enable_dump_tensor`` /
   ``enable_pmu``. Catches binding-ABI regressions at the UT layer
   (e.g. the ``bool`` vs ``int32`` Python-wrapper bug that broke 9 st
   jobs earlier in the PR series — 10 s ut would have caught it).

8. **Pytest-friendly test hook via ``test_run`` override** instead of a
   framework-level ``_post_validate`` callback. ``test_dep_gen_capture``
   overrides the inherited ``SceneTestCase.test_run`` to call ``super()``
   then walk its cases and assert the dep_gen artifacts. Keeps the
   framework unchanged (no implicit ``hasattr`` check on every
   SceneTestCase), localises the test-specific behavior, and the
   ``--enable-dep-gen`` flag gate prevents the assertion from firing in
   default invocations.
9. **Unified DFX smoke folder** (``tests/st/a2a3/tensormap_and_ringbuffer/dfx/``)
   Moved ``dep_gen_capture/`` → ``dfx/`` and added three sibling smokes for
   the other 3 DFX features so all four profilers have a CI gate:
   - ``test_dep_gen.py`` (renamed from test_dep_gen_capture.py, class
     ``TestDepGen``)
   - ``test_l2_swimlane.py`` — asserts ``l2_perf_records.json`` shape
   - ``test_pmu.py`` — asserts ``pmu.csv`` header + row count
   - ``test_tensor_dump.py`` — asserts ``tensor_dump/`` manifest + bin
   Each test uses ``vector_example`` as a fixed 5-task workload; each opens
   only its own flag and asserts only its own artifact, so default
   ``pytest tests/st`` invocations (no flag) just run the case for golden
   compare without DFX paths. ci.yml gets four matching smoke steps under
   ``st-sim-a2a3``, each a fresh pytest invocation — running ≥2 DFX tests
   that open ``--enable-l2-swimlane`` in one session trips
   ``L2PerfCollector::initialize``'s dup-init guard because the L2 worker
   pool reuses the DeviceRunner across tests. The runtime-side fix (make
   init re-init-safe across runs) is a separate concern.

10. **Sibling helper consistency** (``make_pmu_csv_path``)
    Converted ``make_pmu_csv_path`` in ``pmu_collector.h`` to use
    ``std::filesystem::path`` operator/ instead of bare string concat,
    matching the new ``make_deps_json_path`` convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Swimlane profiling drops fanout edges for producers completing before consumer wiring

1 participant