Add: dep_gen deps.json v2 — tensor-annotated edges + differential replay by ChaoWao · Pull Request #769 · hw-native-sys/simpler

ChaoWao · 2026-05-13T07:50:40Z

Summary

Resolves #666 — deps.json now carries per-edge tensor metadata (offset/shape/dtype) and per-task input/output slot info, instead of just task→task IDs. Zero runtime changes; all work lives in the host-side replay + downstream tools.

v2 schema (replaces v1, no fallback):

tasks[] — per task: task_id, scope, args[] with {idx, type, tensor_id, dtype, shape, offset} per slot. OUTPUT slots omit tensor info (zero blob at submit time).
tensors[] — one entry per unique (buffer_addr, version); stable FNV-1a tensor_id; raw_shapes for the underlying buffer.
edges[] — {pred, succ, arg, source} plus tensor_id / consumer_shape / consumer_offset (non-explicit) and producer_shape / producer_offset (source=tensormap).

Self-checking replay — no shotgun-modification risk on compute_task_fanin:

Per record, two parallel PTO2TensorMap instances run in lockstep.
Oracle pass drives the canonical compute_task_fanin template (unchanged, zero runtime touch) and collects the producer-id set the runtime would have wired.
Annotated pass inlines the same STEP A / STEP B logic against a second map but the callback fires with the full PTO2TensorMapEntry& + consumer Tensor* + arg index, capturing per-edge metadata.
After both passes, the producer-id sets are compared. Divergence → LOG_ERROR + return non-zero, deps.json not written. Anyone who later changes compute_task_fanin semantics will trip the gate immediately and know to mirror the change in the annotated pass.
INOUT+COVERED remove_entry is mirrored so both maps stay bit-equivalent for the next record.

Viewer (deps_to_graph.py --show-tensor-info):

Replaces per-edge labels with HTML-table task nodes — input rows (blue) on top, identity header in the middle (core-type colored), output rows (orange) on the bottom. Each arg cell is a 4-line block: arg<i> <TYPE> <Tname>:<dtype> + raw: / shape: / offset:. Edges route pred:out_<idx>:e → succ:in_<arg>:w by matching tensor_id. OUTPUT slots backfill their tensor_id from downstream creator edges when unambiguous (marked ? in the row).

Test plan

pip install --no-build-isolation -e . builds clean on macOS arm64
python test_dep_gen.py -p a2a3sim --enable-dep-gen (which auto-adds --enable-l2-swimlane) PASSED on a2a3sim — 5 tasks, 7 tensors, 6 annotated edges, fanout ⊆ deps gate green
Differential check did not fire (oracle ≡ annot for the vector_example workload)
deps_to_graph.py smoke-tested in both plain and --show-tensor-info modes
swimlane_converter.load_deps_json correctly projects v2 → {pred → [succ]}
Hardware run on a2a3 (CI)
Verify on a larger workload (paged_attention) that overlap=other / partial-tensormap edges still pass the differential check

Issue: #666

gemini-code-assist

Code Review

This pull request upgrades the dependency graph generation to version 2, introducing tensor-level annotations such as shapes, offsets, and dtypes. The core replay logic now employs a dual-pass differential check between a canonical oracle and an annotated mirror to guarantee correctness. Corresponding updates were made to the visualization tools, including a new --show-tensor-info mode, and the test suite. Feedback suggests serializing 64-bit integers as strings in the JSON output to ensure interoperability with JavaScript parsers.

…fferential replay Replay always emits the v2 schema with task IDs, the underlying tensors they touch, and the (offset, shape, dtype) slice each edge represents. v1 (task-pair-only) is gone — the runtime never writes it, downstream tools (swimlane_converter, deps_to_graph, scene_test gate) reject any non-2 version. Self-checking dual-pass replay - Per record, the host runs TWO parallel PTO2TensorMap instances. Oracle drives the canonical compute_task_fanin template (unchanged runtime). Annotated mirrors STEP A + STEP B inline against a second map, with a wider callback that captures the matched PTO2TensorMapEntry + consumer Tensor + arg index per emit. - After both passes finish the record, the producer-id sets are compared. Divergence -> LOG_ERROR with symmetric difference + return non-zero, deps.json NOT written. Guarantee against silent shotgun modifications: anyone who changes compute_task_fanin semantics trips the gate immediately. - INOUT+COVERED remove_entry is mirrored exactly so both maps stay bit-equivalent for the next record. v2 schema - tasks[]: task_id, scope (auto/manual), args[] with per-slot {idx, type, tensor_id, dtype, shape, offset}. OUTPUT slots omit tensor info (zero blob at submit time); viewer backfills tensor_id from downstream creator-source edges when unambiguous. - tensors[]: stable FNV-1a 64-bit hash of (buffer_addr, version) as tensor_id; raw_shapes describes the underlying buffer; per-slot shape/offset describes the slice. - edges[]: pred, succ, arg, source (explicit/creator/tensormap), overlap (tensormap only), tensor_id + consumer slice + producer slice (tensormap only). Distinct args/sources keep their own edges; task-pair projection still satisfies fanout subset deps. - All uint64 fields (task_id, tensor_id, pred, succ, buffer_addr) are serialized as JSON STRINGS. tensor_id (FNV hash) and buffer_addr routinely exceed Number.MAX_SAFE_INTEGER (2^53-1), which would silently corrupt them in JavaScript-based JSON parsers. Python consumers pass them through int(v) which handles either form. Viewer: deps_to_graph.py - --show-tensor-info renders each task as an HTML-table node: input rows (top, blue) | identity header (core-type colored) | output rows (bottom, orange). Each arg cell is a 4-line block: "arg<i> <TYPE> <Tname>:<dtype>" + raw/shape/offset. - Edges route producer:out_<idx>:e -> consumer:in_<arg>:w by matching tensor_id; explicit edges render dashed grey; tensormap edges with overlap != covered carry a small red label. - Default mode (no flag) unchanged: bare shape nodes + bare arrows. - uint64 fields (task_id, tensor_id) coerced int<-str once at ingestion via _normalize_tensor_id alias of _normalize_task_id. Other consumers - swimlane_converter.py: reads v2 dict edges, projects to (pred, succ) set for Perfetto flow events; warns + falls back to fanout[] on any non-v2 file. normalize_pto2_task_id_int already handles string-encoded uint64. - test_dep_gen.py: asserts v2 schema, projects edges to task-pair set for fanout subset deps check, validates tasks[] / tensors[] / per-edge annotation completeness. perf_to_mermaid removed - The Mermaid timing-graph tool is superseded by deps_to_graph (the structural counterpart, with --show-tensor-info for per-edge slice info) and swimlane_converter (Perfetto flow events sourced from deps.json since hw-native-sys#737). simpler_setup/tools/perf_to_mermaid.py and all docs/__init__/README references are dropped in this commit. docs/dfx/dep_gen.md, simpler_setup/tools/README.md, docs/profiling-name-map.md Schema (with JS precision rationale), differential-validation contract, new viewer flag, and the deps_to_graph replacement for perf_to_mermaid. Issue: hw-native-sys#666

Each DFX capture pipeline (dep_gen / l2_swimlane / tensor_dump) ships with a consumer script under simpler_setup/tools/. The scene test for that pipeline now invokes the consumer against the artifact it just produced, asserting exit code 0. If a future schema change breaks the tool, the failure attributes to the same CI step that captured the artifact rather than surfacing later as a silent tooling rot. Smoke is exit-code-only — HTML / PDF / diagram content is NOT validated. The contract is "does the tool still parse this schema", not "is the rendered output correct". Wiring - simpler_setup/tools/_smoke.py: run_tool + has_binary helpers shared by all DFX tests. - tests/.../dfx/dep_gen/test_dep_gen.py: deps_to_graph smoked in both default and --show-tensor-info modes (guarded by has_binary("dot") so dev machines without graphviz skip cleanly). - tests/.../dfx/l2_swimlane/test_l2_swimlane.py: swimlane_converter smoked against l2_perf_records.json. - tests/.../dfx/tensor_dump/test_tensor_dump.py: dump_viewer smoked against the captured tensor_dump/ directory. - pmu has no consumer tool; no smoke added (raw csv is the artifact). sched_overhead_analysis is intentionally NOT smoked — it requires a real device log and would false-positive on sim. Reserve for a future hardware DFX smoke. CI: graphviz installed on the github-hosted sim runners (both Linux and macOS) so deps_to_graph can render. The self-hosted onboard a2a3 runner warns if graphviz is missing instead of failing — the runner admin should install it for full coverage, otherwise the has_binary guard skips that one smoke. Issue: follow-up to hw-native-sys#769

Each DFX capture pipeline (dep_gen / l2_swimlane / tensor_dump) ships with a consumer script under simpler_setup/tools/. The scene test for that pipeline now invokes the consumer against the artifact it just produced, asserting exit code 0. If a future schema change breaks the tool, the failure attributes to the same CI step that captured the artifact rather than surfacing later as a silent tooling rot. Smoke is exit-code-only — HTML / PDF / diagram content is NOT validated. The contract is "does the tool still parse this schema", not "is the rendered output correct". Wiring - simpler_setup/tools/_smoke.py: run_tool + has_binary helpers shared by all DFX tests. - tests/.../dfx/dep_gen/test_dep_gen.py: deps_to_graph smoked in both default and --show-tensor-info modes (guarded by has_binary("dot") so dev machines without graphviz skip cleanly). - tests/.../dfx/l2_swimlane/test_l2_swimlane.py: swimlane_converter smoked against l2_perf_records.json. - tests/.../dfx/tensor_dump/test_tensor_dump.py: dump_viewer smoked against the captured tensor_dump/ directory. - pmu has no consumer tool; no smoke added (raw csv is the artifact). sched_overhead_analysis is intentionally NOT smoked — it requires a real device log and would false-positive on sim. Reserve for a future hardware DFX smoke. CI: graphviz installed on the github-hosted sim runners (both Linux and macOS) so deps_to_graph can render. The self-hosted onboard a2a3 runner warns if graphviz is missing instead of failing — the runner admin should install it for full coverage, otherwise the has_binary guard skips that one smoke. Issue: follow-up to #769

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp Outdated

ChaoWao force-pushed the worktree-spicy-petting-elephant branch from de31792 to 54f88d1 Compare May 13, 2026 08:15

ChaoWao merged commit f1e3f0d into hw-native-sys:main May 13, 2026
14 checks passed

ChaoWao deleted the worktree-spicy-petting-elephant branch May 13, 2026 08:36

ChaoWao mentioned this pull request May 13, 2026

Add: tool-smoke gate inside each DFX scene test #771

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: dep_gen deps.json v2 — tensor-annotated edges + differential replay#769

Add: dep_gen deps.json v2 — tensor-annotated edges + differential replay#769
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:worktree-spicy-petting-elephant

ChaoWao commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoWao commented May 13, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant