Skip to content

Fix: dep_gen replay fanin dedup + docs#740

Open
ChaoWao wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoWao:feat/dep-gen-replay-fixes-and-docs
Open

Fix: dep_gen replay fanin dedup + docs#740
ChaoWao wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoWao:feat/dep-gen-replay-fixes-and-docs

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented May 11, 2026

Summary

Post-merge follow-up to #737. Fixes one real correctness bug surfaced in review, plus assorted polish + the first user-facing doc for dep_gen.

  • Replay dedup (must-fix). dep_gen_replay_emit_deps_json was emitting two edges when a single predecessor was reachable via both explicit_deps and the tensormap (creator retention or overlap hit). The runtime's PTO2FaninBuilder::append_fanin_or_fail dedups per-task; replay now mirrors that with a per-successor std::unordered_set<uint64_t> and a single emit_unique lambda funneling both STEP 1 and STEP 3.
  • OUTPUT-slot safety comment. The capture-replay loop sets tref_buf[i].ptr even for OUTPUT slots (whose blob is zeroed on disk). Safety relies on compute_task_fanin / register_task_outputs bailing out per-tag before dereferencing — added a comment pointing at that contract so future arg-types-area edits keep it intact.
  • make_deps_json_path helper. Both device_runner.cpps built deps.json with output_prefix_ + "/deps.json" inline; now they call make_deps_json_path(prefix), matching the existing make_pmu_csv_path() convention and folding in create_directories for safety.
  • _task_id(ring, local) test helper. Replaces open-coded 1 << 32 arithmetic at the call sites.
  • docs/dep_gen.md. First user-facing doc for the feature — links [Bug] Swimlane profiling drops fanout edges for producers completing before consumer wiring #599, covers enable flags, output format, deps_to_graph.py legend, the fanout ⊆ deps validation gate, and the architecture touchpoints. Currently a2a3 only (a5 port planned separately).

Test plan

  • Local a2a3sim build clean
  • pytest tests/st/.../dep_gen_capture --enable-dep-gen --enable-l2-swimlane passes (gate enforced)
  • deps.json confirmed identical to pre-fix for vector_example (no edge had both an explicit_dep and a tensormap path → fix is a strict no-op for current tests; dedup will start mattering on workloads that mix explicit_dep + implicit overlap)
  • pre-commit run green across touched files + new docs

Why these are a follow-up (not in #737)

The dedup is a real correctness bug but only surfaced post-review; #737 had already merged when the dedup oversight was identified. The path helper / test helper / docs are paperwork — bundling them keeps the dep_gen story moving forward without sitting on the merged branch.

🤖 Generated with Claude Code

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces dep_gen, a tool for capturing a complete per-submit dependency graph to resolve race conditions inherent in the existing fanout[] profiling. Key changes include the addition of comprehensive documentation, the implementation of a host-side replay mechanism with per-successor deduplication to prevent redundant edges, and the introduction of a helper function for generating output paths. Feedback identifies an opportunity to improve the robustness of path handling by using std::filesystem::path instead of manual string concatenation.

Comment thread src/a2a3/platform/include/host/dep_gen_collector.h
@ChaoWao ChaoWao force-pushed the feat/dep-gen-replay-fixes-and-docs branch 2 times, most recently from 7a6c910 to bb0770b Compare May 11, 2026 12:34
Follow-up to hw-native-sys#737 addressing post-merge review findings:

1. **Dedup replay fanin per-successor**
   ``dep_gen_replay_emit_deps_json`` now mirrors the runtime's
   ``PTO2FaninBuilder::append_fanin_or_fail`` semantics: an
   ``std::unordered_set<uint64_t>`` tracks predecessor task ids seen so
   far for the current successor, and both STEP 1 (``explicit_deps``)
   and STEP 3 (creator retention + tensormap lookup) push through a
   single ``emit_unique`` lambda. Previously an ``explicit_dep`` that
   the tensormap also surfaced (via ``owner_task_id`` or an overlap
   hit) emitted two edges, which double-counted ``deps.json`` and made
   ``swimlane_converter.py`` draw duplicate flow events.

2. **Document the OUTPUT-slot safety contract at the capture site**
   ``dep_gen_replay.cpp`` sets ``tref_buf[i].ptr`` for every captured
   tensor slot including OUTPUT — the on-disk blob for OUTPUT is zeroed
   by the AICPU writer. Added an inline comment pointing at
   ``pto_dep_compute.h``'s per-tag dispatch (which is what makes the
   never-dereferenced-on-OUTPUT contract hold) so the next reader of
   the arg_types width-fix area doesn't have to re-derive it.

3. **``make_deps_json_path`` helper**
   Both onboard + sim device_runner used to build ``deps.json`` with
   ``output_prefix_ + "/deps.json"`` inline — out of step with the
   ``make_<feature>_path()`` convention shared by PMU and (previously)
   submit_trace. Added ``make_deps_json_path`` in
   ``dep_gen_collector.h``; both call sites now go through it, and
   the helper also handles ``create_directories`` so the path is safe
   even when the output dir hasn't been touched by anything else yet.

4. **``_task_id(ring, local)`` helper in the validation test**
   The 6-edge expected-set in ``test_dep_gen_capture.py`` was open-
   coded ``1 << 32`` arithmetic at every call. One helper, layout
   stated once.

5. **``docs/dep_gen.md``**
   First user-facing doc for the feature. Covers motivation (links
   hw-native-sys#599), enable flags, ``deps.json`` format, ``deps_to_graph.py``
   usage + node shape/color legend (AIC cube box, AIV vector ellipse,
   mix diamond, alloc dashed note), the ``fanout ⊆ deps`` validation
   gate, and the architecture touchpoints.
6. **CI smoke step for dep_gen** (``.github/workflows/ci.yml``)
   The default ``pytest tests/st`` invocation does not pass any DFX flag, so
   the dep_gen capture path never executed in CI. Added a second pytest step
   in ``st-sim-a2a3`` that re-runs only ``test_dep_gen_capture`` with
   ``--enable-dep-gen --enable-l2-swimlane``, which forces the full
   capture → replay → deps.json → fanout ⊆ deps gate path to execute.
   Same pattern is the model for future PMU / tensor_dump / swimlane smoke
   steps once those grow dedicated tests.

7. **UT roundtrip for ``enable_dep_gen``** (``tests/ut/py/test_chip_worker.py``)
   Adds the missing nanobind setter / getter / repr roundtrip for the
   ``enable_dep_gen`` ``CallConfig`` field, alongside the existing
   round-trips for ``enable_l2_swimlane`` / ``enable_dump_tensor`` /
   ``enable_pmu``. Catches binding-ABI regressions at the UT layer
   (e.g. the ``bool`` vs ``int32`` Python-wrapper bug that broke 9 st
   jobs earlier in the PR series — 10 s ut would have caught it).

8. **Pytest-friendly test hook via ``test_run`` override** instead of a
   framework-level ``_post_validate`` callback. ``test_dep_gen_capture``
   overrides the inherited ``SceneTestCase.test_run`` to call ``super()``
   then walk its cases and assert the dep_gen artifacts. Keeps the
   framework unchanged (no implicit ``hasattr`` check on every
   SceneTestCase), localises the test-specific behavior, and the
   ``--enable-dep-gen`` flag gate prevents the assertion from firing in
   default invocations.
9. **Unified DFX smoke folder** (``tests/st/a2a3/tensormap_and_ringbuffer/dfx/``)
   Moved ``dep_gen_capture/`` → ``dfx/`` and added three sibling smokes for
   the other 3 DFX features so all four profilers have a CI gate:
   - ``test_dep_gen.py`` (renamed from test_dep_gen_capture.py, class
     ``TestDepGen``)
   - ``test_l2_swimlane.py`` — asserts ``l2_perf_records.json`` shape
   - ``test_pmu.py`` — asserts ``pmu.csv`` header + row count
   - ``test_tensor_dump.py`` — asserts ``tensor_dump/`` manifest + bin
   Each test uses ``vector_example`` as a fixed 5-task workload; each opens
   only its own flag and asserts only its own artifact, so default
   ``pytest tests/st`` invocations (no flag) just run the case for golden
   compare without DFX paths. ci.yml gets four matching smoke steps under
   ``st-sim-a2a3``, each a fresh pytest invocation — running ≥2 DFX tests
   that open ``--enable-l2-swimlane`` in one session trips
   ``L2PerfCollector::initialize``'s dup-init guard because the L2 worker
   pool reuses the DeviceRunner across tests. The runtime-side fix (make
   init re-init-safe across runs) is a separate concern.

10. **Sibling helper consistency** (``make_pmu_csv_path``)
    Converted ``make_pmu_csv_path`` in ``pmu_collector.h`` to use
    ``std::filesystem::path`` operator/ instead of bare string concat,
    matching the new ``make_deps_json_path`` convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ChaoWao ChaoWao force-pushed the feat/dep-gen-replay-fixes-and-docs branch from bb0770b to 40d0f02 Compare May 11, 2026 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant