Skip to content

Add: dep_gen deps.json v2 — tensor-annotated edges + differential replay#769

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:worktree-spicy-petting-elephant
May 13, 2026
Merged

Add: dep_gen deps.json v2 — tensor-annotated edges + differential replay#769
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:worktree-spicy-petting-elephant

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented May 13, 2026

Summary

Resolves #666deps.json now carries per-edge tensor metadata (offset/shape/dtype) and per-task input/output slot info, instead of just task→task IDs. Zero runtime changes; all work lives in the host-side replay + downstream tools.

v2 schema (replaces v1, no fallback):

  • tasks[] — per task: task_id, scope, args[] with {idx, type, tensor_id, dtype, shape, offset} per slot. OUTPUT slots omit tensor info (zero blob at submit time).
  • tensors[] — one entry per unique (buffer_addr, version); stable FNV-1a tensor_id; raw_shapes for the underlying buffer.
  • edges[]{pred, succ, arg, source} plus tensor_id / consumer_shape / consumer_offset (non-explicit) and producer_shape / producer_offset (source=tensormap).

Self-checking replay — no shotgun-modification risk on compute_task_fanin:

  • Per record, two parallel PTO2TensorMap instances run in lockstep.
  • Oracle pass drives the canonical compute_task_fanin template (unchanged, zero runtime touch) and collects the producer-id set the runtime would have wired.
  • Annotated pass inlines the same STEP A / STEP B logic against a second map but the callback fires with the full PTO2TensorMapEntry& + consumer Tensor* + arg index, capturing per-edge metadata.
  • After both passes, the producer-id sets are compared. Divergence → LOG_ERROR + return non-zero, deps.json not written. Anyone who later changes compute_task_fanin semantics will trip the gate immediately and know to mirror the change in the annotated pass.
  • INOUT+COVERED remove_entry is mirrored so both maps stay bit-equivalent for the next record.

Viewer (deps_to_graph.py --show-tensor-info):

Replaces per-edge labels with HTML-table task nodes — input rows (blue) on top, identity header in the middle (core-type colored), output rows (orange) on the bottom. Each arg cell is a 4-line block: arg<i> <TYPE> <Tname>:<dtype> + raw: / shape: / offset:. Edges route pred:out_<idx>:e → succ:in_<arg>:w by matching tensor_id. OUTPUT slots backfill their tensor_id from downstream creator edges when unambiguous (marked ? in the row).

Test plan

  • pip install --no-build-isolation -e . builds clean on macOS arm64
  • python test_dep_gen.py -p a2a3sim --enable-dep-gen (which auto-adds --enable-l2-swimlane) PASSED on a2a3sim — 5 tasks, 7 tensors, 6 annotated edges, fanout ⊆ deps gate green
  • Differential check did not fire (oracle ≡ annot for the vector_example workload)
  • deps_to_graph.py smoke-tested in both plain and --show-tensor-info modes
  • swimlane_converter.load_deps_json correctly projects v2 → {pred → [succ]}
  • Hardware run on a2a3 (CI)
  • Verify on a larger workload (paged_attention) that overlap=other / partial-tensormap edges still pass the differential check

Issue: #666

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request upgrades the dependency graph generation to version 2, introducing tensor-level annotations such as shapes, offsets, and dtypes. The core replay logic now employs a dual-pass differential check between a canonical oracle and an annotated mirror to guarantee correctness. Corresponding updates were made to the visualization tools, including a new --show-tensor-info mode, and the test suite. Feedback suggests serializing 64-bit integers as strings in the JSON output to ensure interoperability with JavaScript parsers.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp Outdated
…fferential replay

Replay always emits the v2 schema with task IDs, the underlying tensors
they touch, and the (offset, shape, dtype) slice each edge represents.
v1 (task-pair-only) is gone — the runtime never writes it, downstream
tools (swimlane_converter, deps_to_graph, scene_test gate) reject any
non-2 version.

Self-checking dual-pass replay
- Per record, the host runs TWO parallel PTO2TensorMap instances. Oracle
  drives the canonical compute_task_fanin template (unchanged runtime).
  Annotated mirrors STEP A + STEP B inline against a second map, with a
  wider callback that captures the matched PTO2TensorMapEntry + consumer
  Tensor + arg index per emit.
- After both passes finish the record, the producer-id sets are
  compared. Divergence -> LOG_ERROR with symmetric difference + return
  non-zero, deps.json NOT written. Guarantee against silent shotgun
  modifications: anyone who changes compute_task_fanin semantics trips
  the gate immediately.
- INOUT+COVERED remove_entry is mirrored exactly so both maps stay
  bit-equivalent for the next record.

v2 schema
- tasks[]: task_id, scope (auto/manual), args[] with per-slot
  {idx, type, tensor_id, dtype, shape, offset}. OUTPUT slots omit
  tensor info (zero blob at submit time); viewer backfills tensor_id
  from downstream creator-source edges when unambiguous.
- tensors[]: stable FNV-1a 64-bit hash of (buffer_addr, version) as
  tensor_id; raw_shapes describes the underlying buffer; per-slot
  shape/offset describes the slice.
- edges[]: pred, succ, arg, source (explicit/creator/tensormap),
  overlap (tensormap only), tensor_id + consumer slice + producer slice
  (tensormap only). Distinct args/sources keep their own edges;
  task-pair projection still satisfies fanout subset deps.
- All uint64 fields (task_id, tensor_id, pred, succ, buffer_addr) are
  serialized as JSON STRINGS. tensor_id (FNV hash) and buffer_addr
  routinely exceed Number.MAX_SAFE_INTEGER (2^53-1), which would
  silently corrupt them in JavaScript-based JSON parsers. Python
  consumers pass them through int(v) which handles either form.

Viewer: deps_to_graph.py
- --show-tensor-info renders each task as an HTML-table node:
  input rows (top, blue) | identity header (core-type colored) |
  output rows (bottom, orange). Each arg cell is a 4-line block:
  "arg<i> <TYPE> <Tname>:<dtype>" + raw/shape/offset.
- Edges route producer:out_<idx>:e -> consumer:in_<arg>:w by matching
  tensor_id; explicit edges render dashed grey; tensormap edges with
  overlap != covered carry a small red label.
- Default mode (no flag) unchanged: bare shape nodes + bare arrows.
- uint64 fields (task_id, tensor_id) coerced int<-str once at ingestion
  via _normalize_tensor_id alias of _normalize_task_id.

Other consumers
- swimlane_converter.py: reads v2 dict edges, projects to (pred, succ)
  set for Perfetto flow events; warns + falls back to fanout[] on any
  non-v2 file. normalize_pto2_task_id_int already handles string-encoded
  uint64.
- test_dep_gen.py: asserts v2 schema, projects edges to task-pair set
  for fanout subset deps check, validates tasks[] / tensors[] /
  per-edge annotation completeness.

perf_to_mermaid removed
- The Mermaid timing-graph tool is superseded by deps_to_graph (the
  structural counterpart, with --show-tensor-info for per-edge slice
  info) and swimlane_converter (Perfetto flow events sourced from
  deps.json since hw-native-sys#737). simpler_setup/tools/perf_to_mermaid.py and
  all docs/__init__/README references are dropped in this commit.

docs/dfx/dep_gen.md, simpler_setup/tools/README.md,
docs/profiling-name-map.md
Schema (with JS precision rationale), differential-validation contract,
new viewer flag, and the deps_to_graph replacement for perf_to_mermaid.

Issue: hw-native-sys#666
@ChaoWao ChaoWao force-pushed the worktree-spicy-petting-elephant branch from de31792 to 54f88d1 Compare May 13, 2026 08:15
@ChaoWao ChaoWao merged commit f1e3f0d into hw-native-sys:main May 13, 2026
14 checks passed
@ChaoWao ChaoWao deleted the worktree-spicy-petting-elephant branch May 13, 2026 08:36
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request May 13, 2026
Each DFX capture pipeline (dep_gen / l2_swimlane / tensor_dump) ships
with a consumer script under simpler_setup/tools/. The scene test for
that pipeline now invokes the consumer against the artifact it just
produced, asserting exit code 0. If a future schema change breaks the
tool, the failure attributes to the same CI step that captured the
artifact rather than surfacing later as a silent tooling rot.

Smoke is exit-code-only — HTML / PDF / diagram content is NOT validated.
The contract is "does the tool still parse this schema", not "is the
rendered output correct".

Wiring
- simpler_setup/tools/_smoke.py: run_tool + has_binary helpers shared
  by all DFX tests.
- tests/.../dfx/dep_gen/test_dep_gen.py: deps_to_graph smoked in both
  default and --show-tensor-info modes (guarded by has_binary("dot")
  so dev machines without graphviz skip cleanly).
- tests/.../dfx/l2_swimlane/test_l2_swimlane.py: swimlane_converter
  smoked against l2_perf_records.json.
- tests/.../dfx/tensor_dump/test_tensor_dump.py: dump_viewer smoked
  against the captured tensor_dump/ directory.
- pmu has no consumer tool; no smoke added (raw csv is the artifact).

sched_overhead_analysis is intentionally NOT smoked — it requires a
real device log and would false-positive on sim. Reserve for a future
hardware DFX smoke.

CI: graphviz installed on the github-hosted sim runners (both Linux
and macOS) so deps_to_graph can render. The self-hosted onboard a2a3
runner warns if graphviz is missing instead of failing — the runner
admin should install it for full coverage, otherwise the has_binary
guard skips that one smoke.

Issue: follow-up to hw-native-sys#769
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request May 13, 2026
Each DFX capture pipeline (dep_gen / l2_swimlane / tensor_dump) ships
with a consumer script under simpler_setup/tools/. The scene test for
that pipeline now invokes the consumer against the artifact it just
produced, asserting exit code 0. If a future schema change breaks the
tool, the failure attributes to the same CI step that captured the
artifact rather than surfacing later as a silent tooling rot.

Smoke is exit-code-only — HTML / PDF / diagram content is NOT validated.
The contract is "does the tool still parse this schema", not "is the
rendered output correct".

Wiring
- simpler_setup/tools/_smoke.py: run_tool + has_binary helpers shared
  by all DFX tests.
- tests/.../dfx/dep_gen/test_dep_gen.py: deps_to_graph smoked in both
  default and --show-tensor-info modes (guarded by has_binary("dot")
  so dev machines without graphviz skip cleanly).
- tests/.../dfx/l2_swimlane/test_l2_swimlane.py: swimlane_converter
  smoked against l2_perf_records.json.
- tests/.../dfx/tensor_dump/test_tensor_dump.py: dump_viewer smoked
  against the captured tensor_dump/ directory.
- pmu has no consumer tool; no smoke added (raw csv is the artifact).

sched_overhead_analysis is intentionally NOT smoked — it requires a
real device log and would false-positive on sim. Reserve for a future
hardware DFX smoke.

CI: graphviz installed on the github-hosted sim runners (both Linux
and macOS) so deps_to_graph can render. The self-hosted onboard a2a3
runner warns if graphviz is missing instead of failing — the runner
admin should install it for full coverage, otherwise the has_binary
guard skips that one smoke.

Issue: follow-up to hw-native-sys#769
ChaoWao added a commit that referenced this pull request May 13, 2026
Each DFX capture pipeline (dep_gen / l2_swimlane / tensor_dump) ships
with a consumer script under simpler_setup/tools/. The scene test for
that pipeline now invokes the consumer against the artifact it just
produced, asserting exit code 0. If a future schema change breaks the
tool, the failure attributes to the same CI step that captured the
artifact rather than surfacing later as a silent tooling rot.

Smoke is exit-code-only — HTML / PDF / diagram content is NOT validated.
The contract is "does the tool still parse this schema", not "is the
rendered output correct".

Wiring
- simpler_setup/tools/_smoke.py: run_tool + has_binary helpers shared
  by all DFX tests.
- tests/.../dfx/dep_gen/test_dep_gen.py: deps_to_graph smoked in both
  default and --show-tensor-info modes (guarded by has_binary("dot")
  so dev machines without graphviz skip cleanly).
- tests/.../dfx/l2_swimlane/test_l2_swimlane.py: swimlane_converter
  smoked against l2_perf_records.json.
- tests/.../dfx/tensor_dump/test_tensor_dump.py: dump_viewer smoked
  against the captured tensor_dump/ directory.
- pmu has no consumer tool; no smoke added (raw csv is the artifact).

sched_overhead_analysis is intentionally NOT smoked — it requires a
real device log and would false-positive on sim. Reserve for a future
hardware DFX smoke.

CI: graphviz installed on the github-hosted sim runners (both Linux
and macOS) so deps_to_graph can render. The self-hosted onboard a2a3
runner warns if graphviz is missing instead of failing — the runner
admin should install it for full coverage, otherwise the has_binary
guard skips that one smoke.

Issue: follow-up to #769
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Dump runtime task dependency DAG with per-edge tensor (offset/shape) metadata

1 participant