Add: dep_gen capture (SubmitTrace) for a2a3 tensormap_and_ringbuffer by ChaoWao · Pull Request #736 · hw-native-sys/simpler

ChaoWao · 2026-05-11T04:22:23Z

Summary

Capture every Orchestrator::submit_task call into a streaming
SubmitTrace ring on the a2a3 tensormap_and_ringbuffer runtime, drained
by a host collector to submit_trace.bin. Phase 1 of replacing #500 — the
host-side offline replay that reconstructs deps.json from these records
ships in a follow-up PR; this PR is intentionally scoped to the capture
path so it can land independently and be reviewed in isolation.

Motivation: today's swimlane L2PerfRecord::fanout[] is filled at the
producer's completion-record commit moment, so a fast producer that
finishes before a later consumer is submitted has its record sealed
without the new edge — those dep edges are silently lost. Capturing the
orch's submit-time inputs and replaying offline reconstructs the complete
logical graph because the replay can run with no eviction.

What lands in this PR

Device structs — src/a2a3/platform/include/common/dep_gen.h:
DepGenRecord (2240 B per submit: task_id, flags, arg_types[16],
explicit_deps[16], tensors[16][128] as opaque blobs). Plus the
PMU/L2Perf-style SPSC buffer family (FreeQueue / BufferState /
ReadyQueueEntry / DataHeader). Single-instance (orch is one AICPU
thread).
AICPU writer —
src/a2a3/platform/{include/aicpu,src/aicpu}/dep_gen_collector_aicpu.{h,cpp}:
all-primitive interface (raw uint64 task_id, void* per-Tensor blob,
uint64* explicit_deps) so the platform header stays runtime-agnostic.
submit_task entry calls dep_gen_aicpu_record_submit gated on
is_dep_gen_enabled(). Orch callsite static_asserts
sizeof(Tensor) == DEP_GEN_TENSOR_SIZE and
PTO2_MAX_EXPLICIT_DEPS == DEP_GEN_MAX_EXPLICIT_DEPS. Weak fallbacks
let host builds link without the AICPU strong symbols.
Host collector —
src/a2a3/platform/{include/host,src/host}/dep_gen_collector.{h,cpp}:
DepGenModule trait + DepGenCollector : ProfilerBase<...>, so the
mgmt thread / buffer-pool manager / poll loop come from the unified
profiling framework (Refactor(a2a3): decouple profiling from runtime, own it in platform #714). on_buffer_collected appends DepGenRecord
values to submit_trace.bin; reconcile_counters cross-checks
collected + dropped == device_total and returns clean/dirty so the
future replay step can skip deps.json on incomplete traces.
Device runner wiring (sim + onboard): DepGenCollector field,
set_dep_gen_enabled() setter, init_dep_gen() helper,
kernel_args.dep_gen_data_base plumbed through to AICPU, RAII cleanup,
end-of-run flush + reconcile, make_dep_gen_path() helper.
AICPU lifecycle: set_platform_dep_gen_base /
set_dep_gen_enabled / dep_gen_aicpu_set_orch_thread_idx /
dep_gen_aicpu_init / dep_gen_aicpu_flush /
dep_gen_aicpu_finalize hooked alongside the existing PMU calls.
CallConfig + Python: enable_dep_gen int32 flag, nanobind Python
property (bool), conftest.py and scene_test.py --enable-dep-gen
CLI option threaded through run_class_cases / _run_and_validate /
_build_config. --rounds > 1 disables capture (matches the
--enable-l2-swimlane pattern).
a2a3sim test —
tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/test_dep_gen_capture.py:
re-uses the vector_example orchestration (5 submit_task calls), runs
with --enable-dep-gen, then asserts submit_trace.bin size equals
5 * sizeof(DepGenRecord) and spot-checks the first record's
tensor_count == 3.

What does NOT land in this PR

Host-side offline replay → deps.json
swimlane_converter.py integration that prefers deps.json over the
in-record fanout[]
a5 port (the a5 profiling framework is the pre-Refactor(a2a3): decouple profiling from runtime, own it in platform #714 layout and needs
separate wiring)

These follow in subsequent PRs once the capture path is reviewed.

Testing

pre-commit run (clang-format, clang-tidy, cpplint, ruff,
pyright) all pass
a2a3sim spmd_sync_start baseline (no --enable-dep-gen)
PASSED — wiring does not perturb the default path
a2a3sim dep_gen_capture (with --enable-dep-gen) PASSED,
post-run verification PASSED (submit_trace.bin has exactly
5 records totalling 11200 bytes)
Linux CI green

Notes

profiler_base.h had to gain a one-line
// NOLINTNEXTLINE(bugprone-crtp-constructor-accessibility)
- comment: the new dep_gen_collector.cpp is the first translation
  unit in this checkout to pull the header into pre-commit's clang-tidy
  scope. Existing PmuCollector / L2PerfCollector use the same public
  ctor pattern; making it protected would force every derived class
  to declare friend ProfilerBase solely for ctor visibility.

gemini-code-assist

Code Review

This pull request implements the dep_gen (SubmitTrace) feature to capture orchestrator task submissions for offline replay and dependency graph reconstruction. The changes include new shared-memory protocols, AICPU-side recording, host-side collection, and integration into the Python and C++ runtime layers. Feedback was provided to improve the cache alignment of the tensors array in the DepGenRecord structure to meet performance and architectural standards.

…s gate Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that together turn the captured submit_trace.bin into a user-visible artifact: 1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each record through a host-resident PTO2TensorMap using the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId). - device_runner (onboard + sim) calls the replay post-reconcile when the dep_gen trace is clean. Skips on drops to avoid producing a partial graph users might mistake for complete. - To share PTO2TensorMap between aicpu and host targets, pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale include of pto_orchestrator.h is dropped — no orch member is used). aicpu still picks it up via the recursive glob; host now does too. tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move. - device_runner.cpp (onboard + sim) provides a weak, visibility("hidden") fallback stub for dep_gen_replay_to_deps_json so host_runtime.so still links cleanly when the host_build_graph runtime (which has no replay implementation) is loaded. The strong symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins within its own .so; host_build_graph falls through to the stub. Mirrors the existing dep_gen_aicpu_record_submit pattern. - Auto-sizes per-ring task windows from the trace (rounds max observed local_id up to next pow2) so slot indexing never aliases. 2. swimlane_converter integration (PR4) — when deps.json sits next to l2_perf_records.json, prefer those edges over task["fanout"]. Each flow event is checked for a happens-before violation (pred.end_time > succ.start_time) and emitted under a distinct "hb_violation" name so Perfetto colors it apart from clean dependencies. Verbose output reports the chosen edge source and HB violation count. 3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts: - deps.json exists and contains the 6 expected edges from example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4). - When l2_perf_records.json is also present (--enable-l2-swimlane on), every fanout edge is a subset of deps.json. The standalone main auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a single command runs the full gate. Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is int32_t but the AICPU writer reinterprets the tag array as uint8_t[]. On little-endian this silently kept only every fourth tag byte, turning (INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag to uint8_t explicitly before passing it to the writer — keeps the on-disk uint8_t[16] arg_types layout intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er + host collector) Capture the inputs to every Orchestrator::submit_task call into a streaming ring buffer that the host drains to submit_trace.bin. This is phase 1 of replacing PR hw-native-sys#500 — the host-side offline replay that reconstructs deps.json from these records ships in a follow-up PR. The capture path sidesteps the race window in L2PerfRecord::fanout[], where an early-finishing producer's record gets sealed before later- submitted consumers can register themselves. Architecture mirrors PMU / L2Perf / TensorDump on a2a3: - src/a2a3/platform/include/common/dep_gen.h DepGenRecord (2240 B, single submit_task capture: task_id + flags + arg_types[16] + explicit_deps[16] + tensors[16][128] opaque blobs). DepGenBuffer / DepGenFreeQueue / DepGenBufferState / DepGenReadyQueueEntry / DepGenDataHeader: SPSC streaming buffer family identical in shape to PmuBuffer / PmuFreeQueue etc. Single-instance (orch is one thread). - src/a2a3/platform/include/aicpu/dep_gen_collector_aicpu.{h,cpp} AICPU writer with all-primitive interface (task_id raw, void* per- tensor blob, uint64* explicit_deps) so the platform layer stays runtime-agnostic. submit_task entry calls dep_gen_aicpu_record_submit gated on is_dep_gen_enabled(); the orch callsite static_asserts sizeof(Tensor) == DEP_GEN_TENSOR_SIZE and PTO2_MAX_EXPLICIT_DEPS == DEP_GEN_MAX_EXPLICIT_DEPS. Weak fallbacks on the runtime side let host_build_graph and future replay builds link without the AICPU strong symbols. - src/a2a3/platform/include/host/dep_gen_collector.{h,cpp} DepGenModule trait + DepGenCollector inheriting ProfilerBase<DepGenCollector, DepGenModule>, so the mgmt thread, buffer-pool manager, and poll loop come from the unified profiling framework. on_buffer_collected appends DepGenRecord values to submit_trace.bin; reconcile_counters cross-checks collected + dropped == device_total and reports clean/dirty so the future replay step can skip deps.json on incomplete traces. - Device runner wiring (sim + onboard): DepGenCollector field, set_dep_gen_enabled() setter, init_dep_gen() helper, perf_cleanup RAII guard, kernel_args.dep_gen_data_base plumbed through to AICPU via set_platform_dep_gen_base, stop + reconcile + finalize at end of run, make_dep_gen_path() helper. - AICPU executor lifecycle: set_platform_dep_gen_base / set_dep_gen_enabled / dep_gen_aicpu_set_orch_thread_idx / dep_gen_aicpu_init / dep_gen_aicpu_flush / dep_gen_aicpu_finalize hooked in alongside the existing PMU lifecycle calls. - CallConfig + Python: `enable_dep_gen` int32 flag on ChipCallConfig, nanobind Python property (bool), conftest.py and scene_test.py `--enable-dep-gen` CLI option threaded through run_class_cases / _run_and_validate / _build_config. round > 1 disables capture (same pattern as enable_l2_swimlane). - ProfilerBase suppression: `bugprone-crtp-constructor-accessibility` was newly tripped by the new dep_gen_collector.cpp pulling profiler_base.h into clang-tidy's scope. Suppressed with NOLINTNEXTLINE + comment explaining the intentional public ctor (all derived collectors call it). - a2a3sim test: tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/ test_dep_gen_capture.py re-uses the vector_example orchestration (5 submit_task calls), runs with --enable-dep-gen, then asserts submit_trace.bin size equals 5 * sizeof(DepGenRecord) and spot-checks the first record's tensor_count == 3. Verified: - a2a3sim spmd_sync_start baseline (no flag): PASSED — wiring did not perturb the default path. - a2a3sim dep_gen_capture (with --enable-dep-gen): PASSED, post-run verification PASSED (submit_trace.bin has exactly 5 records).

…s gate Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that together turn the captured submit_trace.bin into a user-visible artifact: 1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each record through a host-resident PTO2TensorMap using the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId). - device_runner (onboard + sim) calls the replay post-reconcile when the dep_gen trace is clean. Skips on drops to avoid producing a partial graph users might mistake for complete. - To share PTO2TensorMap between aicpu and host targets, pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale include of pto_orchestrator.h is dropped — no orch member is used). aicpu still picks it up via the recursive glob; host now does too. tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move. - device_runner.cpp (onboard + sim) provides a weak, visibility("hidden") fallback stub for dep_gen_replay_to_deps_json so host_runtime.so still links cleanly when the host_build_graph runtime (which has no replay implementation) is loaded. The strong symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins within its own .so; host_build_graph falls through to the stub. Mirrors the existing dep_gen_aicpu_record_submit pattern. - Auto-sizes per-ring task windows from the trace (rounds max observed local_id up to next pow2) so slot indexing never aliases. 2. swimlane_converter integration (PR4) — when deps.json sits next to l2_perf_records.json, prefer those edges over task["fanout"]. Each flow event is checked for a happens-before violation (pred.end_time > succ.start_time) and emitted under a distinct "hb_violation" name so Perfetto colors it apart from clean dependencies. Verbose output reports the chosen edge source and HB violation count. 3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts: - deps.json exists and contains the 6 expected edges from example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4). - When l2_perf_records.json is also present (--enable-l2-swimlane on), every fanout edge is a subset of deps.json. The standalone main auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a single command runs the full gate. Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is int32_t but the AICPU writer reinterprets the tag array as uint8_t[]. On little-endian this silently kept only every fourth tag byte, turning (INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag to uint8_t explicitly before passing it to the writer — keeps the on-disk uint8_t[16] arg_types layout intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s gate Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that together turn the captured submit_trace.bin into a user-visible artifact: 1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each record through a host-resident PTO2TensorMap using the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId). - device_runner (onboard + sim) calls the replay post-reconcile when the dep_gen trace is clean. Skips on drops to avoid producing a partial graph users might mistake for complete. - To share PTO2TensorMap between aicpu and host targets, pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale include of pto_orchestrator.h is dropped — no orch member is used). aicpu still picks it up via the recursive glob; host now does too. tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move. - device_runner.cpp (onboard + sim) provides a weak, visibility("hidden") fallback stub for dep_gen_replay_to_deps_json so host_runtime.so still links cleanly when the host_build_graph runtime (which has no replay implementation) is loaded. The strong symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins within its own .so; host_build_graph falls through to the stub. Mirrors the existing dep_gen_aicpu_record_submit pattern. - Auto-sizes per-ring task windows from the trace (rounds max observed local_id up to next pow2) so slot indexing never aliases. 2. swimlane_converter integration (PR4) — when deps.json sits next to l2_perf_records.json, prefer those edges over task["fanout"]. Each flow event is checked for a happens-before violation (pred.end_time > succ.start_time) and emitted under a distinct "hb_violation" name so Perfetto colors it apart from clean dependencies. Verbose output reports the chosen edge source and HB violation count. 3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts: - deps.json exists and contains the 6 expected edges from example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4). - When l2_perf_records.json is also present (--enable-l2-swimlane on), every fanout edge is a subset of deps.json. The standalone main auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a single command runs the full gate. 4. **deps.json viewer (`simpler_setup/tools/deps_to_graph.py`)** — turns the replay product into a self-contained pan/zoom HTML page (Graphviz SVG + 80-line vanilla-JS shim, no CDN/offline-capable). Distinct shape + color per node type so AIC (cube, blue box), AIV (vector, orange ellipse), mix (green diamond — single submit_task spanning both core types via MixedKernels), and alloc (gray dashed note — tasks from `alloc_tensors` that produced output tensors but never dispatched a kernel) stay readable even without color. Auto-loads the colocated `l2_perf_records.json` and `name_map_*.json` sidecars for label enrichment; isolated tasks (no inbound/outbound edges) still show up. `--engine sfdp` for graphs past ~500 nodes. Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is int32_t but the AICPU writer reinterprets the tag array as uint8_t[]. On little-endian this silently kept only every fourth tag byte, turning (INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag to uint8_t explicitly before passing it to the writer — keeps the on-disk uint8_t[16] arg_types layout intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s gate Stacked on hw-native-sys#736 (dep_gen capture). Three closely-coupled changes that together turn the captured submit_trace.bin into a user-visible artifact: 1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each record through a host-resident PTO2TensorMap using the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId). - device_runner (onboard + sim) calls the replay post-reconcile when the dep_gen trace is clean. Skips on drops to avoid producing a partial graph users might mistake for complete. - To share PTO2TensorMap between aicpu and host targets, pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale include of pto_orchestrator.h is dropped — no orch member is used). aicpu still picks it up via the recursive glob; host now does too. tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move. - device_runner.cpp (onboard + sim) provides a weak, visibility("hidden") fallback stub for dep_gen_replay_to_deps_json so host_runtime.so still links cleanly when the host_build_graph runtime (which has no replay implementation) is loaded. The strong symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins within its own .so; host_build_graph falls through to the stub. Mirrors the existing dep_gen_aicpu_record_submit pattern. - Auto-sizes per-ring task windows from the trace (rounds max observed local_id up to next pow2) so slot indexing never aliases. 2. swimlane_converter integration (PR4) — when deps.json sits next to l2_perf_records.json, prefer those edges over task["fanout"]. Each flow event is checked for a happens-before violation (pred.end_time > succ.start_time) and emitted under a distinct "hb_violation" name so Perfetto colors it apart from clean dependencies. Verbose output reports the chosen edge source and HB violation count. 3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts: - deps.json exists and contains the 6 expected edges from example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4). - When l2_perf_records.json is also present (--enable-l2-swimlane on), every fanout edge is a subset of deps.json. The standalone main auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a single command runs the full gate. 4. **deps.json viewer (`simpler_setup/tools/deps_to_graph.py`)** — turns the replay product into a self-contained pan/zoom HTML page (Graphviz SVG + 80-line vanilla-JS shim, no CDN/offline-capable). Distinct shape + color per node type so AIC (cube, blue box), AIV (vector, orange ellipse), mix (green diamond — single submit_task spanning both core types via MixedKernels), and alloc (gray dashed note — tasks from `alloc_tensors` that produced output tensors but never dispatched a kernel) stay readable even without color. Auto-loads the colocated `l2_perf_records.json` and `name_map_*.json` sidecars for label enrichment; isolated tasks (no inbound/outbound edges) still show up. `--engine sfdp` for graphs past ~500 nodes. 5. **In-memory capture → replay (drops `submit_trace.bin`)** — the host collector now accumulates `DepGenRecord` entries directly in a `std::vector<DepGenRecord>` instead of streaming them to `submit_trace.bin` on disk. The replay function takes a pointer + count from `DepGenCollector::records()` and skips the file round-trip entirely. deps.json is the only on-disk dep_gen artifact now; the `make_dep_gen_path` helper is gone, `DepGenCollector::init` no longer takes a path, and the replay C ABI is now `dep_gen_replay_emit_deps_json(records, n, deps_json_path, …)`. Also clamps `args.tensor_count()` to `MAX_TENSOR_ARGS` at the capture call-site (defensive — the Arg builder already caps at MAX_TENSOR_ARGS but this prevents a future builder bypass from overflowing the stack buffers). The weak fallback in non-dep_gen runtimes (host_build_graph) drops from LOG_WARN to LOG_DEBUG since that path is unreachable for end users — it exists only to keep the .so loadable. Also fixes a width-mismatch bug in hw-native-sys#736's capture path: TensorArgType is int32_t but the AICPU writer reinterprets the tag array as uint8_t[]. On little-endian this silently kept only every fourth tag byte, turning (INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag to uint8_t explicitly before passing it to the writer — keeps the on-disk uint8_t[16] arg_types layout intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Fixes hw-native-sys#599.

…s gate (#737) Stacked on #736 (dep_gen capture). Three closely-coupled changes that together turn the captured submit_trace.bin into a user-visible artifact: 1. Host replay (PR3) — new dep_gen_replay.{h,cpp} under runtime/tensormap_and_ringbuffer/host/. Reads submit_trace.bin, runs each record through a host-resident PTO2TensorMap using the same compute_task_fanin / register_task_outputs primitives the device orchestrator uses, emits deps.json (edge list keyed by raw PTO2TaskId). - device_runner (onboard + sim) calls the replay post-reconcile when the dep_gen trace is clean. Skips on drops to avoid producing a partial graph users might mistake for complete. - To share PTO2TensorMap between aicpu and host targets, pto_tensormap.cpp moves from runtime/ to runtime/shared/ (its stale include of pto_orchestrator.h is dropped — no orch member is used). aicpu still picks it up via the recursive glob; host now does too. tests/ut/cpp/CMakeLists.txt's explicit path entry follows the move. - device_runner.cpp (onboard + sim) provides a weak, visibility("hidden") fallback stub for dep_gen_replay_to_deps_json so host_runtime.so still links cleanly when the host_build_graph runtime (which has no replay implementation) is loaded. The strong symbol from tensormap_and_ringbuffer/host/dep_gen_replay.cpp wins within its own .so; host_build_graph falls through to the stub. Mirrors the existing dep_gen_aicpu_record_submit pattern. - Auto-sizes per-ring task windows from the trace (rounds max observed local_id up to next pow2) so slot indexing never aliases. 2. swimlane_converter integration (PR4) — when deps.json sits next to l2_perf_records.json, prefer those edges over task["fanout"]. Each flow event is checked for a happens-before violation (pred.end_time > succ.start_time) and emitted under a distinct "hb_violation" name so Perfetto colors it apart from clean dependencies. Verbose output reports the chosen edge source and HB violation count. 3. Validation gate (PR5) — test_dep_gen_capture.py now also asserts: - deps.json exists and contains the 6 expected edges from example_orchestration.cpp (t0→t1, t0→t2, t1→t3, t2→t3, t0→t4, t3→t4). - When l2_perf_records.json is also present (--enable-l2-swimlane on), every fanout edge is a subset of deps.json. The standalone main auto-adds --enable-l2-swimlane when --enable-dep-gen is passed so a single command runs the full gate. 4. **deps.json viewer (`simpler_setup/tools/deps_to_graph.py`)** — turns the replay product into a self-contained pan/zoom HTML page (Graphviz SVG + 80-line vanilla-JS shim, no CDN/offline-capable). Distinct shape + color per node type so AIC (cube, blue box), AIV (vector, orange ellipse), mix (green diamond — single submit_task spanning both core types via MixedKernels), and alloc (gray dashed note — tasks from `alloc_tensors` that produced output tensors but never dispatched a kernel) stay readable even without color. Auto-loads the colocated `l2_perf_records.json` and `name_map_*.json` sidecars for label enrichment; isolated tasks (no inbound/outbound edges) still show up. `--engine sfdp` for graphs past ~500 nodes. 5. **In-memory capture → replay (drops `submit_trace.bin`)** — the host collector now accumulates `DepGenRecord` entries directly in a `std::vector<DepGenRecord>` instead of streaming them to `submit_trace.bin` on disk. The replay function takes a pointer + count from `DepGenCollector::records()` and skips the file round-trip entirely. deps.json is the only on-disk dep_gen artifact now; the `make_dep_gen_path` helper is gone, `DepGenCollector::init` no longer takes a path, and the replay C ABI is now `dep_gen_replay_emit_deps_json(records, n, deps_json_path, …)`. Also clamps `args.tensor_count()` to `MAX_TENSOR_ARGS` at the capture call-site (defensive — the Arg builder already caps at MAX_TENSOR_ARGS but this prevents a future builder bypass from overflowing the stack buffers). The weak fallback in non-dep_gen runtimes (host_build_graph) drops from LOG_WARN to LOG_DEBUG since that path is unreachable for end users — it exists only to keep the .so loadable. Also fixes a width-mismatch bug in #736's capture path: TensorArgType is int32_t but the AICPU writer reinterprets the tag array as uint8_t[]. On little-endian this silently kept only every fourth tag byte, turning (INPUT, INPUT, OUTPUT) into (0, 0, 0) and synthesizing a phantom self-edge t0→t0 in replay. Fixed at the call site by narrowing each tag to uint8_t explicitly before passing it to the writer — keeps the on-disk uint8_t[16] arg_types layout intact.

ChaoWao force-pushed the feat/dep-gen-capture-a2a3 branch from b267dac to f50bb7d Compare May 11, 2026 04:27

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

Comment thread src/a2a3/platform/include/common/dep_gen.h Outdated

ChaoWao force-pushed the feat/dep-gen-capture-a2a3 branch 2 times, most recently from 562c243 to 3582e6e Compare May 11, 2026 04:50

ChaoWao mentioned this pull request May 11, 2026

Add: dep_gen replay → deps.json + swimlane integration + fanout ⊆ deps gate #737

Merged

4 tasks

ChaoWao force-pushed the feat/dep-gen-capture-a2a3 branch from 3582e6e to 143d6ab Compare May 11, 2026 08:03

ChaoWao force-pushed the feat/dep-gen-capture-a2a3 branch from 143d6ab to 7a3dce0 Compare May 11, 2026 08:21

ChaoWao merged commit 77ef83c into hw-native-sys:main May 11, 2026
27 of 28 checks passed

ChaoWao deleted the feat/dep-gen-capture-a2a3 branch May 11, 2026 10:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: dep_gen capture (SubmitTrace) for a2a3 tensormap_and_ringbuffer#736

Add: dep_gen capture (SubmitTrace) for a2a3 tensormap_and_ringbuffer#736
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:feat/dep-gen-capture-a2a3

ChaoWao commented May 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoWao commented May 11, 2026

Summary

What lands in this PR

What does NOT land in this PR

Testing

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant