Fix: dep_gen replay fanin dedup + docs#740
Open
ChaoWao wants to merge 1 commit into
Open
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces dep_gen, a tool for capturing a complete per-submit dependency graph to resolve race conditions inherent in the existing fanout[] profiling. Key changes include the addition of comprehensive documentation, the implementation of a host-side replay mechanism with per-successor deduplication to prevent redundant edges, and the introduction of a helper function for generating output paths. Feedback identifies an opportunity to improve the robustness of path handling by using std::filesystem::path instead of manual string concatenation.
7a6c910 to
bb0770b
Compare
Follow-up to hw-native-sys#737 addressing post-merge review findings: 1. **Dedup replay fanin per-successor** ``dep_gen_replay_emit_deps_json`` now mirrors the runtime's ``PTO2FaninBuilder::append_fanin_or_fail`` semantics: an ``std::unordered_set<uint64_t>`` tracks predecessor task ids seen so far for the current successor, and both STEP 1 (``explicit_deps``) and STEP 3 (creator retention + tensormap lookup) push through a single ``emit_unique`` lambda. Previously an ``explicit_dep`` that the tensormap also surfaced (via ``owner_task_id`` or an overlap hit) emitted two edges, which double-counted ``deps.json`` and made ``swimlane_converter.py`` draw duplicate flow events. 2. **Document the OUTPUT-slot safety contract at the capture site** ``dep_gen_replay.cpp`` sets ``tref_buf[i].ptr`` for every captured tensor slot including OUTPUT — the on-disk blob for OUTPUT is zeroed by the AICPU writer. Added an inline comment pointing at ``pto_dep_compute.h``'s per-tag dispatch (which is what makes the never-dereferenced-on-OUTPUT contract hold) so the next reader of the arg_types width-fix area doesn't have to re-derive it. 3. **``make_deps_json_path`` helper** Both onboard + sim device_runner used to build ``deps.json`` with ``output_prefix_ + "/deps.json"`` inline — out of step with the ``make_<feature>_path()`` convention shared by PMU and (previously) submit_trace. Added ``make_deps_json_path`` in ``dep_gen_collector.h``; both call sites now go through it, and the helper also handles ``create_directories`` so the path is safe even when the output dir hasn't been touched by anything else yet. 4. **``_task_id(ring, local)`` helper in the validation test** The 6-edge expected-set in ``test_dep_gen_capture.py`` was open- coded ``1 << 32`` arithmetic at every call. One helper, layout stated once. 5. **``docs/dep_gen.md``** First user-facing doc for the feature. Covers motivation (links hw-native-sys#599), enable flags, ``deps.json`` format, ``deps_to_graph.py`` usage + node shape/color legend (AIC cube box, AIV vector ellipse, mix diamond, alloc dashed note), the ``fanout ⊆ deps`` validation gate, and the architecture touchpoints. 6. **CI smoke step for dep_gen** (``.github/workflows/ci.yml``) The default ``pytest tests/st`` invocation does not pass any DFX flag, so the dep_gen capture path never executed in CI. Added a second pytest step in ``st-sim-a2a3`` that re-runs only ``test_dep_gen_capture`` with ``--enable-dep-gen --enable-l2-swimlane``, which forces the full capture → replay → deps.json → fanout ⊆ deps gate path to execute. Same pattern is the model for future PMU / tensor_dump / swimlane smoke steps once those grow dedicated tests. 7. **UT roundtrip for ``enable_dep_gen``** (``tests/ut/py/test_chip_worker.py``) Adds the missing nanobind setter / getter / repr roundtrip for the ``enable_dep_gen`` ``CallConfig`` field, alongside the existing round-trips for ``enable_l2_swimlane`` / ``enable_dump_tensor`` / ``enable_pmu``. Catches binding-ABI regressions at the UT layer (e.g. the ``bool`` vs ``int32`` Python-wrapper bug that broke 9 st jobs earlier in the PR series — 10 s ut would have caught it). 8. **Pytest-friendly test hook via ``test_run`` override** instead of a framework-level ``_post_validate`` callback. ``test_dep_gen_capture`` overrides the inherited ``SceneTestCase.test_run`` to call ``super()`` then walk its cases and assert the dep_gen artifacts. Keeps the framework unchanged (no implicit ``hasattr`` check on every SceneTestCase), localises the test-specific behavior, and the ``--enable-dep-gen`` flag gate prevents the assertion from firing in default invocations. 9. **Unified DFX smoke folder** (``tests/st/a2a3/tensormap_and_ringbuffer/dfx/``) Moved ``dep_gen_capture/`` → ``dfx/`` and added three sibling smokes for the other 3 DFX features so all four profilers have a CI gate: - ``test_dep_gen.py`` (renamed from test_dep_gen_capture.py, class ``TestDepGen``) - ``test_l2_swimlane.py`` — asserts ``l2_perf_records.json`` shape - ``test_pmu.py`` — asserts ``pmu.csv`` header + row count - ``test_tensor_dump.py`` — asserts ``tensor_dump/`` manifest + bin Each test uses ``vector_example`` as a fixed 5-task workload; each opens only its own flag and asserts only its own artifact, so default ``pytest tests/st`` invocations (no flag) just run the case for golden compare without DFX paths. ci.yml gets four matching smoke steps under ``st-sim-a2a3``, each a fresh pytest invocation — running ≥2 DFX tests that open ``--enable-l2-swimlane`` in one session trips ``L2PerfCollector::initialize``'s dup-init guard because the L2 worker pool reuses the DeviceRunner across tests. The runtime-side fix (make init re-init-safe across runs) is a separate concern. 10. **Sibling helper consistency** (``make_pmu_csv_path``) Converted ``make_pmu_csv_path`` in ``pmu_collector.h`` to use ``std::filesystem::path`` operator/ instead of bare string concat, matching the new ``make_deps_json_path`` convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bb0770b to
40d0f02
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Post-merge follow-up to #737. Fixes one real correctness bug surfaced in review, plus assorted polish + the first user-facing doc for
dep_gen.dep_gen_replay_emit_deps_jsonwas emitting two edges when a single predecessor was reachable via bothexplicit_depsand the tensormap (creator retention or overlap hit). The runtime'sPTO2FaninBuilder::append_fanin_or_faildedups per-task; replay now mirrors that with a per-successorstd::unordered_set<uint64_t>and a singleemit_uniquelambda funneling both STEP 1 and STEP 3.tref_buf[i].ptreven for OUTPUT slots (whose blob is zeroed on disk). Safety relies oncompute_task_fanin/register_task_outputsbailing out per-tag before dereferencing — added a comment pointing at that contract so future arg-types-area edits keep it intact.make_deps_json_pathhelper. Bothdevice_runner.cpps builtdeps.jsonwithoutput_prefix_ + "/deps.json"inline; now they callmake_deps_json_path(prefix), matching the existingmake_pmu_csv_path()convention and folding increate_directoriesfor safety._task_id(ring, local)test helper. Replaces open-coded1 << 32arithmetic at the call sites.docs/dep_gen.md. First user-facing doc for the feature — links [Bug] Swimlane profiling drops fanout edges for producers completing before consumer wiring #599, covers enable flags, output format,deps_to_graph.pylegend, thefanout ⊆ depsvalidation gate, and the architecture touchpoints. Currently a2a3 only (a5 port planned separately).Test plan
pytest tests/st/.../dep_gen_capture --enable-dep-gen --enable-l2-swimlanepasses (gate enforced)deps.jsonconfirmed identical to pre-fix for vector_example (no edge had both an explicit_dep and a tensormap path → fix is a strict no-op for current tests; dedup will start mattering on workloads that mix explicit_dep + implicit overlap)pre-commit rungreen across touched files + new docsWhy these are a follow-up (not in #737)
The dedup is a real correctness bug but only surfaced post-review; #737 had already merged when the dedup oversight was identified. The path helper / test helper / docs are paperwork — bundling them keeps the dep_gen story moving forward without sitting on the merged branch.
🤖 Generated with Claude Code