Skip to content

Add: dep_gen capture (SubmitTrace) for a2a3 tensormap_and_ringbuffer#736

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:feat/dep-gen-capture-a2a3
May 11, 2026
Merged

Add: dep_gen capture (SubmitTrace) for a2a3 tensormap_and_ringbuffer#736
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:feat/dep-gen-capture-a2a3

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented May 11, 2026

Summary

Capture every Orchestrator::submit_task call into a streaming
SubmitTrace ring on the a2a3 tensormap_and_ringbuffer runtime, drained
by a host collector to submit_trace.bin. Phase 1 of replacing #500 — the
host-side offline replay that reconstructs deps.json from these records
ships in a follow-up PR; this PR is intentionally scoped to the capture
path so it can land independently and be reviewed in isolation.

Motivation: today's swimlane L2PerfRecord::fanout[] is filled at the
producer's completion-record commit moment, so a fast producer that
finishes before a later consumer is submitted has its record sealed
without the new edge — those dep edges are silently lost. Capturing the
orch's submit-time inputs and replaying offline reconstructs the complete
logical graph because the replay can run with no eviction.

What lands in this PR

  • Device structssrc/a2a3/platform/include/common/dep_gen.h:
    DepGenRecord (2240 B per submit: task_id, flags, arg_types[16],
    explicit_deps[16], tensors[16][128] as opaque blobs). Plus the
    PMU/L2Perf-style SPSC buffer family (FreeQueue / BufferState /
    ReadyQueueEntry / DataHeader). Single-instance (orch is one AICPU
    thread).
  • AICPU writer
    src/a2a3/platform/{include/aicpu,src/aicpu}/dep_gen_collector_aicpu.{h,cpp}:
    all-primitive interface (raw uint64 task_id, void* per-Tensor blob,
    uint64* explicit_deps) so the platform header stays runtime-agnostic.
    submit_task entry calls dep_gen_aicpu_record_submit gated on
    is_dep_gen_enabled(). Orch callsite static_asserts
    sizeof(Tensor) == DEP_GEN_TENSOR_SIZE and
    PTO2_MAX_EXPLICIT_DEPS == DEP_GEN_MAX_EXPLICIT_DEPS. Weak fallbacks
    let host builds link without the AICPU strong symbols.
  • Host collector
    src/a2a3/platform/{include/host,src/host}/dep_gen_collector.{h,cpp}:
    DepGenModule trait + DepGenCollector : ProfilerBase<...>, so the
    mgmt thread / buffer-pool manager / poll loop come from the unified
    profiling framework (Refactor(a2a3): decouple profiling from runtime, own it in platform #714). on_buffer_collected appends DepGenRecord
    values to submit_trace.bin; reconcile_counters cross-checks
    collected + dropped == device_total and returns clean/dirty so the
    future replay step can skip deps.json on incomplete traces.
  • Device runner wiring (sim + onboard): DepGenCollector field,
    set_dep_gen_enabled() setter, init_dep_gen() helper,
    kernel_args.dep_gen_data_base plumbed through to AICPU, RAII cleanup,
    end-of-run flush + reconcile, make_dep_gen_path() helper.
  • AICPU lifecycle: set_platform_dep_gen_base /
    set_dep_gen_enabled / dep_gen_aicpu_set_orch_thread_idx /
    dep_gen_aicpu_init / dep_gen_aicpu_flush /
    dep_gen_aicpu_finalize hooked alongside the existing PMU calls.
  • CallConfig + Python: enable_dep_gen int32 flag, nanobind Python
    property (bool), conftest.py and scene_test.py --enable-dep-gen
    CLI option threaded through run_class_cases / _run_and_validate /
    _build_config. --rounds > 1 disables capture (matches the
    --enable-l2-swimlane pattern).
  • a2a3sim test
    tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/test_dep_gen_capture.py:
    re-uses the vector_example orchestration (5 submit_task calls), runs
    with --enable-dep-gen, then asserts submit_trace.bin size equals
    5 * sizeof(DepGenRecord) and spot-checks the first record's
    tensor_count == 3.

What does NOT land in this PR

These follow in subsequent PRs once the capture path is reviewed.

Testing

  • pre-commit run (clang-format, clang-tidy, cpplint, ruff,
    pyright) all pass
  • a2a3sim spmd_sync_start baseline (no --enable-dep-gen)
    PASSED — wiring does not perturb the default path
  • a2a3sim dep_gen_capture (with --enable-dep-gen) PASSED,
    post-run verification PASSED (submit_trace.bin has exactly
    5 records totalling 11200 bytes)
  • Linux CI green

Notes

  • profiler_base.h had to gain a one-line
    // NOLINTNEXTLINE(bugprone-crtp-constructor-accessibility)
    • comment: the new dep_gen_collector.cpp is the first translation
      unit in this checkout to pull the header into pre-commit's clang-tidy
      scope. Existing PmuCollector / L2PerfCollector use the same public
      ctor pattern; making it protected would force every derived class
      to declare friend ProfilerBase solely for ctor visibility.

@ChaoWao ChaoWao force-pushed the feat/dep-gen-capture-a2a3 branch from b267dac to f50bb7d Compare May 11, 2026 04:27
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the dep_gen (SubmitTrace) feature to capture orchestrator task submissions for offline replay and dependency graph reconstruction. The changes include new shared-memory protocols, AICPU-side recording, host-side collection, and integration into the Python and C++ runtime layers. Feedback was provided to improve the cache alignment of the tensors array in the DepGenRecord structure to meet performance and architectural standards.

Comment thread src/a2a3/platform/include/common/dep_gen.h Outdated
@ChaoWao ChaoWao force-pushed the feat/dep-gen-capture-a2a3 branch 2 times, most recently from 562c243 to 3582e6e Compare May 11, 2026 04:50
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request May 11, 2026
…s gate

Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that
together turn the captured submit_trace.bin into a user-visible artifact:

1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under
   runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each
   record through a host-resident PTO2TensorMap using the same
   compute_task_fanin / register_task_outputs primitives the device
   orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId).

   - device_runner (onboard + sim) calls the replay post-reconcile when
     the dep_gen trace is clean. Skips on drops to avoid producing a
     partial graph users might mistake for complete.
   - To share PTO2TensorMap between aicpu and host targets,
     pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale
     include of pto_orchestrator.h is dropped — no orch member is used).
     aicpu still picks it up via the recursive glob; host now does too.
     tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move.
   - device_runner.cpp (onboard + sim) provides a weak,
     visibility("hidden") fallback stub for dep_gen_replay_to_deps_json
     so host_runtime.so still links cleanly when the host_build_graph
     runtime (which has no replay implementation) is loaded. The strong
     symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins
     within its own .so; host_build_graph falls through to the stub.
     Mirrors the existing dep_gen_aicpu_record_submit pattern.
   - Auto-sizes per-ring task windows from the trace (rounds max
     observed local_id up to next pow2) so slot indexing never aliases.

2. swimlane_converter integration (PR4) — when deps.json sits next
   to l2_perf_records.json, prefer those edges over task["fanout"]. Each
   flow event is checked for a happens-before violation
   (pred.end_time > succ.start_time) and emitted under a distinct
   "hb_violation" name so Perfetto colors it apart from clean
   dependencies. Verbose output reports the chosen edge source and HB
   violation count.

3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts:
   - deps.json exists and contains the 6 expected edges from
     example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4).
   - When l2_perf_records.json is also present (--enable-l2-swimlane on),
     every fanout edge is a subset of deps.json. The standalone main
     auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a
     single command runs the full gate.

Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is
int32_t but the AICPU writer reinterprets the tag array as uint8_t[].
On little-endian this silently kept only every fourth tag byte, turning
(INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom
self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag
to uint8_t explicitly before passing it to the writer — keeps the
on-disk uint8_t[16] arg_types layout intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ChaoWao ChaoWao force-pushed the feat/dep-gen-capture-a2a3 branch from 3582e6e to 143d6ab Compare May 11, 2026 08:03
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request May 11, 2026
…s gate

Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that
together turn the captured submit_trace.bin into a user-visible artifact:

1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under
   runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each
   record through a host-resident PTO2TensorMap using the same
   compute_task_fanin / register_task_outputs primitives the device
   orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId).

   - device_runner (onboard + sim) calls the replay post-reconcile when
     the dep_gen trace is clean. Skips on drops to avoid producing a
     partial graph users might mistake for complete.
   - To share PTO2TensorMap between aicpu and host targets,
     pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale
     include of pto_orchestrator.h is dropped — no orch member is used).
     aicpu still picks it up via the recursive glob; host now does too.
     tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move.
   - device_runner.cpp (onboard + sim) provides a weak,
     visibility("hidden") fallback stub for dep_gen_replay_to_deps_json
     so host_runtime.so still links cleanly when the host_build_graph
     runtime (which has no replay implementation) is loaded. The strong
     symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins
     within its own .so; host_build_graph falls through to the stub.
     Mirrors the existing dep_gen_aicpu_record_submit pattern.
   - Auto-sizes per-ring task windows from the trace (rounds max
     observed local_id up to next pow2) so slot indexing never aliases.

2. swimlane_converter integration (PR4) — when deps.json sits next
   to l2_perf_records.json, prefer those edges over task["fanout"]. Each
   flow event is checked for a happens-before violation
   (pred.end_time > succ.start_time) and emitted under a distinct
   "hb_violation" name so Perfetto colors it apart from clean
   dependencies. Verbose output reports the chosen edge source and HB
   violation count.

3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts:
   - deps.json exists and contains the 6 expected edges from
     example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4).
   - When l2_perf_records.json is also present (--enable-l2-swimlane on),
     every fanout edge is a subset of deps.json. The standalone main
     auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a
     single command runs the full gate.

Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is
int32_t but the AICPU writer reinterprets the tag array as uint8_t[].
On little-endian this silently kept only every fourth tag byte, turning
(INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom
self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag
to uint8_t explicitly before passing it to the writer — keeps the
on-disk uint8_t[16] arg_types layout intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er + host collector)

Capture the inputs to every Orchestrator::submit_task call into a
streaming ring buffer that the host drains to submit_trace.bin. This is
phase 1 of replacing PR hw-native-sys#500 — the host-side offline replay that
reconstructs deps.json from these records ships in a follow-up PR.
The capture path sidesteps the race window in L2PerfRecord::fanout[],
where an early-finishing producer's record gets sealed before later-
submitted consumers can register themselves.

Architecture mirrors PMU / L2Perf / TensorDump on a2a3:

- src/a2a3/platform/include/common/dep_gen.h
    DepGenRecord (2240 B, single submit_task capture: task_id + flags +
    arg_types[16] + explicit_deps[16] + tensors[16][128] opaque blobs).
    DepGenBuffer / DepGenFreeQueue / DepGenBufferState /
    DepGenReadyQueueEntry / DepGenDataHeader: SPSC streaming buffer
    family identical in shape to PmuBuffer / PmuFreeQueue etc.
    Single-instance (orch is one thread).

- src/a2a3/platform/include/aicpu/dep_gen_collector_aicpu.{h,cpp}
    AICPU writer with all-primitive interface (task_id raw, void* per-
    tensor blob, uint64* explicit_deps) so the platform layer stays
    runtime-agnostic. submit_task entry calls dep_gen_aicpu_record_submit
    gated on is_dep_gen_enabled(); the orch callsite static_asserts
    sizeof(Tensor) == DEP_GEN_TENSOR_SIZE and
    PTO2_MAX_EXPLICIT_DEPS == DEP_GEN_MAX_EXPLICIT_DEPS. Weak fallbacks
    on the runtime side let host_build_graph and future replay builds
    link without the AICPU strong symbols.

- src/a2a3/platform/include/host/dep_gen_collector.{h,cpp}
    DepGenModule trait + DepGenCollector inheriting
    ProfilerBase<DepGenCollector, DepGenModule>, so the mgmt thread,
    buffer-pool manager, and poll loop come from the unified profiling
    framework. on_buffer_collected appends DepGenRecord values to
    submit_trace.bin; reconcile_counters cross-checks
    collected + dropped == device_total and reports clean/dirty so
    the future replay step can skip deps.json on incomplete traces.

- Device runner wiring (sim + onboard):
    DepGenCollector field, set_dep_gen_enabled() setter, init_dep_gen()
    helper, perf_cleanup RAII guard, kernel_args.dep_gen_data_base
    plumbed through to AICPU via set_platform_dep_gen_base, stop +
    reconcile + finalize at end of run, make_dep_gen_path() helper.

- AICPU executor lifecycle:
    set_platform_dep_gen_base / set_dep_gen_enabled /
    dep_gen_aicpu_set_orch_thread_idx / dep_gen_aicpu_init /
    dep_gen_aicpu_flush / dep_gen_aicpu_finalize hooked in alongside
    the existing PMU lifecycle calls.

- CallConfig + Python:
    `enable_dep_gen` int32 flag on ChipCallConfig, nanobind Python
    property (bool), conftest.py and scene_test.py
    `--enable-dep-gen` CLI option threaded through run_class_cases /
    _run_and_validate / _build_config. round > 1 disables capture
    (same pattern as enable_l2_swimlane).

- ProfilerBase suppression:
    `bugprone-crtp-constructor-accessibility` was newly tripped by the
    new dep_gen_collector.cpp pulling profiler_base.h into clang-tidy's
    scope. Suppressed with NOLINTNEXTLINE + comment explaining the
    intentional public ctor (all derived collectors call it).

- a2a3sim test:
    tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/
        test_dep_gen_capture.py
    re-uses the vector_example orchestration (5 submit_task calls),
    runs with --enable-dep-gen, then asserts submit_trace.bin size
    equals 5 * sizeof(DepGenRecord) and spot-checks the first record's
    tensor_count == 3.

Verified:
- a2a3sim spmd_sync_start baseline (no flag): PASSED — wiring did not
  perturb the default path.
- a2a3sim dep_gen_capture (with --enable-dep-gen): PASSED, post-run
  verification PASSED (submit_trace.bin has exactly 5 records).
@ChaoWao ChaoWao force-pushed the feat/dep-gen-capture-a2a3 branch from 143d6ab to 7a3dce0 Compare May 11, 2026 08:21
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request May 11, 2026
…s gate

Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that
together turn the captured submit_trace.bin into a user-visible artifact:

1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under
   runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each
   record through a host-resident PTO2TensorMap using the same
   compute_task_fanin / register_task_outputs primitives the device
   orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId).

   - device_runner (onboard + sim) calls the replay post-reconcile when
     the dep_gen trace is clean. Skips on drops to avoid producing a
     partial graph users might mistake for complete.
   - To share PTO2TensorMap between aicpu and host targets,
     pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale
     include of pto_orchestrator.h is dropped — no orch member is used).
     aicpu still picks it up via the recursive glob; host now does too.
     tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move.
   - device_runner.cpp (onboard + sim) provides a weak,
     visibility("hidden") fallback stub for dep_gen_replay_to_deps_json
     so host_runtime.so still links cleanly when the host_build_graph
     runtime (which has no replay implementation) is loaded. The strong
     symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins
     within its own .so; host_build_graph falls through to the stub.
     Mirrors the existing dep_gen_aicpu_record_submit pattern.
   - Auto-sizes per-ring task windows from the trace (rounds max
     observed local_id up to next pow2) so slot indexing never aliases.

2. swimlane_converter integration (PR4) — when deps.json sits next
   to l2_perf_records.json, prefer those edges over task["fanout"]. Each
   flow event is checked for a happens-before violation
   (pred.end_time > succ.start_time) and emitted under a distinct
   "hb_violation" name so Perfetto colors it apart from clean
   dependencies. Verbose output reports the chosen edge source and HB
   violation count.

3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts:
   - deps.json exists and contains the 6 expected edges from
     example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4).
   - When l2_perf_records.json is also present (--enable-l2-swimlane on),
     every fanout edge is a subset of deps.json. The standalone main
     auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a
     single command runs the full gate.

Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is
int32_t but the AICPU writer reinterprets the tag array as uint8_t[].
On little-endian this silently kept only every fourth tag byte, turning
(INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom
self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag
to uint8_t explicitly before passing it to the writer — keeps the
on-disk uint8_t[16] arg_types layout intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ChaoWao ChaoWao merged commit 77ef83c into hw-native-sys:main May 11, 2026
27 of 28 checks passed
@ChaoWao ChaoWao deleted the feat/dep-gen-capture-a2a3 branch May 11, 2026 10:46
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request May 11, 2026
…s gate

Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that
together turn the captured submit_trace.bin into a user-visible artifact:

1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under
   runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each
   record through a host-resident PTO2TensorMap using the same
   compute_task_fanin / register_task_outputs primitives the device
   orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId).

   - device_runner (onboard + sim) calls the replay post-reconcile when
     the dep_gen trace is clean. Skips on drops to avoid producing a
     partial graph users might mistake for complete.
   - To share PTO2TensorMap between aicpu and host targets,
     pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale
     include of pto_orchestrator.h is dropped — no orch member is used).
     aicpu still picks it up via the recursive glob; host now does too.
     tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move.
   - device_runner.cpp (onboard + sim) provides a weak,
     visibility("hidden") fallback stub for dep_gen_replay_to_deps_json
     so host_runtime.so still links cleanly when the host_build_graph
     runtime (which has no replay implementation) is loaded. The strong
     symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins
     within its own .so; host_build_graph falls through to the stub.
     Mirrors the existing dep_gen_aicpu_record_submit pattern.
   - Auto-sizes per-ring task windows from the trace (rounds max
     observed local_id up to next pow2) so slot indexing never aliases.

2. swimlane_converter integration (PR4) — when deps.json sits next
   to l2_perf_records.json, prefer those edges over task["fanout"]. Each
   flow event is checked for a happens-before violation
   (pred.end_time > succ.start_time) and emitted under a distinct
   "hb_violation" name so Perfetto colors it apart from clean
   dependencies. Verbose output reports the chosen edge source and HB
   violation count.

3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts:
   - deps.json exists and contains the 6 expected edges from
     example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4).
   - When l2_perf_records.json is also present (--enable-l2-swimlane on),
     every fanout edge is a subset of deps.json. The standalone main
     auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a
     single command runs the full gate.

4. **deps.json viewer (`simpler_setup/tools/deps_to_graph.py`)** — turns the
   replay product into a self-contained pan/zoom HTML page (Graphviz SVG +
   80-line vanilla-JS shim, no CDN/offline-capable). Distinct shape + color
   per node type so AIC (cube, blue box), AIV (vector, orange ellipse),
   mix (green diamond — single submit_task spanning both core types via
   MixedKernels), and alloc (gray dashed note — tasks from `alloc_tensors`
   that produced output tensors but never dispatched a kernel) stay readable
   even without color. Auto-loads the colocated `l2_perf_records.json` and
   `name_map_*.json` sidecars for label enrichment; isolated tasks (no
   inbound/outbound edges) still show up. `--engine sfdp` for graphs past
   ~500 nodes.

Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is
int32_t but the AICPU writer reinterprets the tag array as uint8_t[].
On little-endian this silently kept only every fourth tag byte, turning
(INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom
self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag
to uint8_t explicitly before passing it to the writer — keeps the
on-disk uint8_t[16] arg_types layout intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request May 11, 2026
…s gate

Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that
together turn the captured submit_trace.bin into a user-visible artifact:

1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under
   runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each
   record through a host-resident PTO2TensorMap using the same
   compute_task_fanin / register_task_outputs primitives the device
   orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId).

   - device_runner (onboard + sim) calls the replay post-reconcile when
     the dep_gen trace is clean. Skips on drops to avoid producing a
     partial graph users might mistake for complete.
   - To share PTO2TensorMap between aicpu and host targets,
     pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale
     include of pto_orchestrator.h is dropped — no orch member is used).
     aicpu still picks it up via the recursive glob; host now does too.
     tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move.
   - device_runner.cpp (onboard + sim) provides a weak,
     visibility("hidden") fallback stub for dep_gen_replay_to_deps_json
     so host_runtime.so still links cleanly when the host_build_graph
     runtime (which has no replay implementation) is loaded. The strong
     symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins
     within its own .so; host_build_graph falls through to the stub.
     Mirrors the existing dep_gen_aicpu_record_submit pattern.
   - Auto-sizes per-ring task windows from the trace (rounds max
     observed local_id up to next pow2) so slot indexing never aliases.

2. swimlane_converter integration (PR4) — when deps.json sits next
   to l2_perf_records.json, prefer those edges over task["fanout"]. Each
   flow event is checked for a happens-before violation
   (pred.end_time > succ.start_time) and emitted under a distinct
   "hb_violation" name so Perfetto colors it apart from clean
   dependencies. Verbose output reports the chosen edge source and HB
   violation count.

3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts:
   - deps.json exists and contains the 6 expected edges from
     example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4).
   - When l2_perf_records.json is also present (--enable-l2-swimlane on),
     every fanout edge is a subset of deps.json. The standalone main
     auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a
     single command runs the full gate.

4. **deps.json viewer (`simpler_setup/tools/deps_to_graph.py`)** — turns the
   replay product into a self-contained pan/zoom HTML page (Graphviz SVG +
   80-line vanilla-JS shim, no CDN/offline-capable). Distinct shape + color
   per node type so AIC (cube, blue box), AIV (vector, orange ellipse),
   mix (green diamond — single submit_task spanning both core types via
   MixedKernels), and alloc (gray dashed note — tasks from `alloc_tensors`
   that produced output tensors but never dispatched a kernel) stay readable
   even without color. Auto-loads the colocated `l2_perf_records.json` and
   `name_map_*.json` sidecars for label enrichment; isolated tasks (no
   inbound/outbound edges) still show up. `--engine sfdp` for graphs past
   ~500 nodes.

Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is
int32_t but the AICPU writer reinterprets the tag array as uint8_t[].
On little-endian this silently kept only every fourth tag byte, turning
(INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom
self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag
to uint8_t explicitly before passing it to the writer — keeps the
on-disk uint8_t[16] arg_types layout intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request May 11, 2026
…s gate

Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that
together turn the captured submit_trace.bin into a user-visible artifact:

1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under
   runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each
   record through a host-resident PTO2TensorMap using the same
   compute_task_fanin / register_task_outputs primitives the device
   orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId).

   - device_runner (onboard + sim) calls the replay post-reconcile when
     the dep_gen trace is clean. Skips on drops to avoid producing a
     partial graph users might mistake for complete.
   - To share PTO2TensorMap between aicpu and host targets,
     pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale
     include of pto_orchestrator.h is dropped — no orch member is used).
     aicpu still picks it up via the recursive glob; host now does too.
     tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move.
   - device_runner.cpp (onboard + sim) provides a weak,
     visibility("hidden") fallback stub for dep_gen_replay_to_deps_json
     so host_runtime.so still links cleanly when the host_build_graph
     runtime (which has no replay implementation) is loaded. The strong
     symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins
     within its own .so; host_build_graph falls through to the stub.
     Mirrors the existing dep_gen_aicpu_record_submit pattern.
   - Auto-sizes per-ring task windows from the trace (rounds max
     observed local_id up to next pow2) so slot indexing never aliases.

2. swimlane_converter integration (PR4) — when deps.json sits next
   to l2_perf_records.json, prefer those edges over task["fanout"]. Each
   flow event is checked for a happens-before violation
   (pred.end_time > succ.start_time) and emitted under a distinct
   "hb_violation" name so Perfetto colors it apart from clean
   dependencies. Verbose output reports the chosen edge source and HB
   violation count.

3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts:
   - deps.json exists and contains the 6 expected edges from
     example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4).
   - When l2_perf_records.json is also present (--enable-l2-swimlane on),
     every fanout edge is a subset of deps.json. The standalone main
     auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a
     single command runs the full gate.

4. **deps.json viewer (`simpler_setup/tools/deps_to_graph.py`)** — turns the
   replay product into a self-contained pan/zoom HTML page (Graphviz SVG +
   80-line vanilla-JS shim, no CDN/offline-capable). Distinct shape + color
   per node type so AIC (cube, blue box), AIV (vector, orange ellipse),
   mix (green diamond — single submit_task spanning both core types via
   MixedKernels), and alloc (gray dashed note — tasks from `alloc_tensors`
   that produced output tensors but never dispatched a kernel) stay readable
   even without color. Auto-loads the colocated `l2_perf_records.json` and
   `name_map_*.json` sidecars for label enrichment; isolated tasks (no
   inbound/outbound edges) still show up. `--engine sfdp` for graphs past
   ~500 nodes.

5. **In-memory capture → replay (drops `submit_trace.bin`)** — the host
   collector now accumulates `DepGenRecord` entries directly in a
   `std::vector<DepGenRecord>` instead of streaming them to
   `submit_trace.bin` on disk. The replay function takes a pointer +
   count from `DepGenCollector::records()` and skips the file round-trip
   entirely. deps.json is the only on-disk dep_gen artifact now; the
   `make_dep_gen_path` helper is gone, `DepGenCollector::init` no longer
   takes a path, and the replay C ABI is now
   `dep_gen_replay_emit_deps_json(records, n, deps_json_path, …)`. Also
   clamps `args.tensor_count()` to `MAX_TENSOR_ARGS` at the capture
   call-site (defensive — the Arg builder already caps at MAX_TENSOR_ARGS
   but this prevents a future builder bypass from overflowing the stack
   buffers). The weak fallback in non-dep_gen runtimes (host_build_graph)
   drops from LOG_WARN to LOG_DEBUG since that path is unreachable for
   end users — it exists only to keep the .so loadable.

Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is
int32_t but the AICPU writer reinterprets the tag array as uint8_t[].
On little-endian this silently kept only every fourth tag byte, turning
(INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom
self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag
to uint8_t explicitly before passing it to the writer — keeps the
on-disk uint8_t[16] arg_types layout intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fixes hw-native-sys#599.
ChaoWao added a commit that referenced this pull request May 11, 2026
…s gate (#737)

Stacked on #736 (dep_gen capture). Three closely-coupled changes that
together turn the captured submit_trace.bin into a user-visible artifact:

1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under
   runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each
   record through a host-resident PTO2TensorMap using the same
   compute_task_fanin / register_task_outputs primitives the device
   orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId).

   - device_runner (onboard + sim) calls the replay post-reconcile when
     the dep_gen trace is clean. Skips on drops to avoid producing a
     partial graph users might mistake for complete.
   - To share PTO2TensorMap between aicpu and host targets,
     pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale
     include of pto_orchestrator.h is dropped — no orch member is used).
     aicpu still picks it up via the recursive glob; host now does too.
     tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move.
   - device_runner.cpp (onboard + sim) provides a weak,
     visibility("hidden") fallback stub for dep_gen_replay_to_deps_json
     so host_runtime.so still links cleanly when the host_build_graph
     runtime (which has no replay implementation) is loaded. The strong
     symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins
     within its own .so; host_build_graph falls through to the stub.
     Mirrors the existing dep_gen_aicpu_record_submit pattern.
   - Auto-sizes per-ring task windows from the trace (rounds max
     observed local_id up to next pow2) so slot indexing never aliases.

2. swimlane_converter integration (PR4) — when deps.json sits next
   to l2_perf_records.json, prefer those edges over task["fanout"]. Each
   flow event is checked for a happens-before violation
   (pred.end_time > succ.start_time) and emitted under a distinct
   "hb_violation" name so Perfetto colors it apart from clean
   dependencies. Verbose output reports the chosen edge source and HB
   violation count.

3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts:
   - deps.json exists and contains the 6 expected edges from
     example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4).
   - When l2_perf_records.json is also present (--enable-l2-swimlane on),
     every fanout edge is a subset of deps.json. The standalone main
     auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a
     single command runs the full gate.

4. **deps.json viewer (`simpler_setup/tools/deps_to_graph.py`)** — turns the
   replay product into a self-contained pan/zoom HTML page (Graphviz SVG +
   80-line vanilla-JS shim, no CDN/offline-capable). Distinct shape + color
   per node type so AIC (cube, blue box), AIV (vector, orange ellipse),
   mix (green diamond — single submit_task spanning both core types via
   MixedKernels), and alloc (gray dashed note — tasks from `alloc_tensors`
   that produced output tensors but never dispatched a kernel) stay readable
   even without color. Auto-loads the colocated `l2_perf_records.json` and
   `name_map_*.json` sidecars for label enrichment; isolated tasks (no
   inbound/outbound edges) still show up. `--engine sfdp` for graphs past
   ~500 nodes.

5. **In-memory capture → replay (drops `submit_trace.bin`)** — the host
   collector now accumulates `DepGenRecord` entries directly in a
   `std::vector<DepGenRecord>` instead of streaming them to
   `submit_trace.bin` on disk. The replay function takes a pointer +
   count from `DepGenCollector::records()` and skips the file round-trip
   entirely. deps.json is the only on-disk dep_gen artifact now; the
   `make_dep_gen_path` helper is gone, `DepGenCollector::init` no longer
   takes a path, and the replay C ABI is now
   `dep_gen_replay_emit_deps_json(records, n, deps_json_path, …)`. Also
   clamps `args.tensor_count()` to `MAX_TENSOR_ARGS` at the capture
   call-site (defensive — the Arg builder already caps at MAX_TENSOR_ARGS
   but this prevents a future builder bypass from overflowing the stack
   buffers). The weak fallback in non-dep_gen runtimes (host_build_graph)
   drops from LOG_WARN to LOG_DEBUG since that path is unreachable for
   end users — it exists only to keep the .so loadable.

Also fixes a width-mismatch bug in #736's capture path: TensorArgType is
int32_t but the AICPU writer reinterprets the tag array as uint8_t[].
On little-endian this silently kept only every fourth tag byte, turning
(INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom
self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag
to uint8_t explicitly before passing it to the writer — keeps the
on-disk uint8_t[16] arg_types layout intact.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant