Refactor: introduce tiered profiling levels for a2a3 tensormap_and_ringbuffer swimlane export#500
Refactor: introduce tiered profiling levels for a2a3 tensormap_and_ringbuffer swimlane export#500indigo1973 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a more granular performance profiling system by replacing the boolean 'enable_profiling' flag with an integer 'perf_level' across the codebase. This change allows for multiple profiling modes (0=off, 1=AICore-only, 2=task+fanout, 3=full) and includes updates to the runtime, device runner, and performance collector to handle these levels. Additionally, it updates the swimlane JSON export logic to support versioning based on the profiling level and improves robustness in the swimlane converter by handling optional fanout data. I have no feedback to provide as there are no review comments to evaluate.
0b39cb0 to
5b10ea4
Compare
…ngbuffer swimlane export Replace the boolean enable_profiling flag with an integer perf_level throughout the profiling pipeline (CLI → Python bindings → C++ runtime → AICPU executor → PerformanceCollector → JSON export). Profiling levels: 0 = off 1 = AICore task start/end timing only (JSON version 0) 2 = + dispatch timestamps, finish timestamps, fanout edges (JSON version 1) 3 = + AICPU scheduler/orchestrator phase buffers (JSON version 2) Key changes: - ChipCallConfig: bool enable_profiling → int perf_level - CLI --enable-profiling: store_true → optional int (bare flag defaults to 3) - nanobind property: backward-compatible bool→3 coercion for legacy callers - AICPU executor: split into task_recording_enabled (>0) vs phase_recording_enabled (>=3) to skip phase overhead at lower levels - PerformanceCollector: skip phase buffer allocation when perf_level < 3; version selection based on perf_level and presence of phase data - swimlane_converter.py: accept version 0, tolerate missing fanout field - Fix scene_test.py: `val and cond` truncated int to bool; use ternary
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request replaces the boolean profiling flag with a multi-level integer perf_level (0-3) to provide granular control over performance data collection, ranging from AICore-only to full profiling with phase data. The changes span Python CLI tools, C++ runtime structures, and shared memory allocation logic in the PerformanceCollector. Review feedback highlights opportunities to simplify redundant logical checks in the performance-critical scheduler path where perf_level is validated alongside derived boolean flags.
| if (task_recording_enabled && runtime->perf_level >= 2) { | ||
| core_exec_state.pending_dispatch_timestamp = get_sys_cnt_aicpu(); | ||
| } |
There was a problem hiding this comment.
The condition task_recording_enabled && runtime->perf_level >= 2 is redundant. Since task_recording_enabled is true when runtime->perf_level > 0, this check is equivalent to just runtime->perf_level >= 2. Simplifying this avoids redundant checks on a performance-critical path.
if (runtime->perf_level >= 2) {
core_exec_state.pending_dispatch_timestamp = get_sys_cnt_aicpu();
}References
- On performance-critical paths, avoid redundant validation checks for variables that are guaranteed to be valid by the logic at their point of entry.
| if (task_recording_enabled && runtime->perf_level >= 2) { | ||
| core_exec_state.running_dispatch_timestamp = get_sys_cnt_aicpu(); | ||
| } |
There was a problem hiding this comment.
Similar to the previous check, the condition task_recording_enabled && runtime->perf_level >= 2 is redundant and can be simplified to runtime->perf_level >= 2 to improve performance and readability on this hot path.
if (runtime->perf_level >= 2) {
core_exec_state.running_dispatch_timestamp = get_sys_cnt_aicpu();
}References
- Avoid redundant data validation checks on performance-critical paths if the data has already been validated or is logically guaranteed.
|
@indigo1973 Rebase isn't viable, and the design wants refinement before landing. Rebase blocked by30+ commits of drift. Key blocker: What to keep
What to change1. Pack all DFX toggles into Today's 2. Rename 3. Separate PR: Path forwardClose this PR, replace with three independent PRs co-authored with you:
|
…er + host collector) Capture the inputs to every Orchestrator::submit_task call into a streaming ring buffer that the host drains to submit_trace.bin. This is phase 1 of replacing PR hw-native-sys#500 — the host-side offline replay that reconstructs deps.json from these records ships in a follow-up PR. The capture path sidesteps the race window in L2PerfRecord::fanout[], where an early-finishing producer's record gets sealed before later- submitted consumers can register themselves. Architecture mirrors PMU / L2Perf / TensorDump on a2a3: - src/a2a3/platform/include/common/dep_gen.h DepGenRecord (2240 B, single submit_task capture: task_id + flags + arg_types[16] + explicit_deps[16] + tensors[16][128] opaque blobs). DepGenBuffer / DepGenFreeQueue / DepGenBufferState / DepGenReadyQueueEntry / DepGenDataHeader: SPSC streaming buffer family identical in shape to PmuBuffer / PmuFreeQueue etc. Single-instance (orch is one thread). - src/a2a3/platform/include/aicpu/dep_gen_collector_aicpu.{h,cpp} AICPU writer with all-primitive interface (task_id raw, void* per- tensor blob, uint64* explicit_deps) so the platform layer stays runtime-agnostic. submit_task entry calls dep_gen_aicpu_record_submit gated on is_dep_gen_enabled(); the orch callsite static_asserts sizeof(Tensor) == DEP_GEN_TENSOR_SIZE and PTO2_MAX_EXPLICIT_DEPS == DEP_GEN_MAX_EXPLICIT_DEPS. Weak fallbacks on the runtime side let host_build_graph and future replay builds link without the AICPU strong symbols. - src/a2a3/platform/include/host/dep_gen_collector.{h,cpp} DepGenModule trait + DepGenCollector inheriting ProfilerBase<DepGenCollector, DepGenModule>, so the mgmt thread, buffer-pool manager, and poll loop come from the unified profiling framework. on_buffer_collected appends DepGenRecord values to submit_trace.bin; reconcile_counters cross-checks collected + dropped == device_total and reports clean/dirty so the future replay step can skip deps.json on incomplete traces. - Device runner wiring (sim + onboard): DepGenCollector field, set_dep_gen_enabled() setter, init_dep_gen() helper, perf_cleanup RAII guard, kernel_args.dep_gen_data_base plumbed through to AICPU via set_platform_dep_gen_base, stop + reconcile + finalize at end of run, make_dep_gen_path() helper. - AICPU executor lifecycle: set_platform_dep_gen_base / set_dep_gen_enabled / dep_gen_aicpu_set_orch_thread_idx / dep_gen_aicpu_init / dep_gen_aicpu_flush / dep_gen_aicpu_finalize hooked in alongside the existing PMU lifecycle calls. - CallConfig + Python: `enable_dep_gen` int32 flag on ChipCallConfig, nanobind Python property (bool), conftest.py and scene_test.py `--enable-dep-gen` CLI option threaded through run_class_cases / _run_and_validate / _build_config. round > 1 disables capture (same pattern as enable_l2_swimlane). - ProfilerBase suppression: `bugprone-crtp-constructor-accessibility` was newly tripped by the new dep_gen_collector.cpp pulling profiler_base.h into clang-tidy's scope. Suppressed with NOLINTNEXTLINE + comment explaining the intentional public ctor (all derived collectors call it). - a2a3sim test: tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/ test_dep_gen_capture.py re-uses the vector_example orchestration (5 submit_task calls), runs with --enable-dep-gen, then asserts submit_trace.bin size equals 5 * sizeof(DepGenRecord) and spot-checks the first record's tensor_count == 3. Verified: - a2a3sim spmd_sync_start baseline (no flag): PASSED — wiring did not perturb the default path. - a2a3sim dep_gen_capture (with --enable-dep-gen): PASSED, post-run verification PASSED (submit_trace.bin has exactly 5 records).
…er + host collector) Capture the inputs to every Orchestrator::submit_task call into a streaming ring buffer that the host drains to submit_trace.bin. This is phase 1 of replacing PR hw-native-sys#500 — the host-side offline replay that reconstructs deps.json from these records ships in a follow-up PR. The capture path sidesteps the race window in L2PerfRecord::fanout[], where an early-finishing producer's record gets sealed before later- submitted consumers can register themselves. Architecture mirrors PMU / L2Perf / TensorDump on a2a3: - src/a2a3/platform/include/common/dep_gen.h DepGenRecord (2240 B, single submit_task capture: task_id + flags + arg_types[16] + explicit_deps[16] + tensors[16][128] opaque blobs). DepGenBuffer / DepGenFreeQueue / DepGenBufferState / DepGenReadyQueueEntry / DepGenDataHeader: SPSC streaming buffer family identical in shape to PmuBuffer / PmuFreeQueue etc. Single-instance (orch is one thread). - src/a2a3/platform/include/aicpu/dep_gen_collector_aicpu.{h,cpp} AICPU writer with all-primitive interface (task_id raw, void* per- tensor blob, uint64* explicit_deps) so the platform layer stays runtime-agnostic. submit_task entry calls dep_gen_aicpu_record_submit gated on is_dep_gen_enabled(); the orch callsite static_asserts sizeof(Tensor) == DEP_GEN_TENSOR_SIZE and PTO2_MAX_EXPLICIT_DEPS == DEP_GEN_MAX_EXPLICIT_DEPS. Weak fallbacks on the runtime side let host_build_graph and future replay builds link without the AICPU strong symbols. - src/a2a3/platform/include/host/dep_gen_collector.{h,cpp} DepGenModule trait + DepGenCollector inheriting ProfilerBase<DepGenCollector, DepGenModule>, so the mgmt thread, buffer-pool manager, and poll loop come from the unified profiling framework. on_buffer_collected appends DepGenRecord values to submit_trace.bin; reconcile_counters cross-checks collected + dropped == device_total and reports clean/dirty so the future replay step can skip deps.json on incomplete traces. - Device runner wiring (sim + onboard): DepGenCollector field, set_dep_gen_enabled() setter, init_dep_gen() helper, perf_cleanup RAII guard, kernel_args.dep_gen_data_base plumbed through to AICPU via set_platform_dep_gen_base, stop + reconcile + finalize at end of run, make_dep_gen_path() helper. - AICPU executor lifecycle: set_platform_dep_gen_base / set_dep_gen_enabled / dep_gen_aicpu_set_orch_thread_idx / dep_gen_aicpu_init / dep_gen_aicpu_flush / dep_gen_aicpu_finalize hooked in alongside the existing PMU lifecycle calls. - CallConfig + Python: `enable_dep_gen` int32 flag on ChipCallConfig, nanobind Python property (bool), conftest.py and scene_test.py `--enable-dep-gen` CLI option threaded through run_class_cases / _run_and_validate / _build_config. round > 1 disables capture (same pattern as enable_l2_swimlane). - ProfilerBase suppression: `bugprone-crtp-constructor-accessibility` was newly tripped by the new dep_gen_collector.cpp pulling profiler_base.h into clang-tidy's scope. Suppressed with NOLINTNEXTLINE + comment explaining the intentional public ctor (all derived collectors call it). - a2a3sim test: tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/ test_dep_gen_capture.py re-uses the vector_example orchestration (5 submit_task calls), runs with --enable-dep-gen, then asserts submit_trace.bin size equals 5 * sizeof(DepGenRecord) and spot-checks the first record's tensor_count == 3. Verified: - a2a3sim spmd_sync_start baseline (no flag): PASSED — wiring did not perturb the default path. - a2a3sim dep_gen_capture (with --enable-dep-gen): PASSED, post-run verification PASSED (submit_trace.bin has exactly 5 records).
…er + host collector) Capture the inputs to every Orchestrator::submit_task call into a streaming ring buffer that the host drains to submit_trace.bin. This is phase 1 of replacing PR hw-native-sys#500 — the host-side offline replay that reconstructs deps.json from these records ships in a follow-up PR. The capture path sidesteps the race window in L2PerfRecord::fanout[], where an early-finishing producer's record gets sealed before later- submitted consumers can register themselves. Architecture mirrors PMU / L2Perf / TensorDump on a2a3: - src/a2a3/platform/include/common/dep_gen.h DepGenRecord (2240 B, single submit_task capture: task_id + flags + arg_types[16] + explicit_deps[16] + tensors[16][128] opaque blobs). DepGenBuffer / DepGenFreeQueue / DepGenBufferState / DepGenReadyQueueEntry / DepGenDataHeader: SPSC streaming buffer family identical in shape to PmuBuffer / PmuFreeQueue etc. Single-instance (orch is one thread). - src/a2a3/platform/include/aicpu/dep_gen_collector_aicpu.{h,cpp} AICPU writer with all-primitive interface (task_id raw, void* per- tensor blob, uint64* explicit_deps) so the platform layer stays runtime-agnostic. submit_task entry calls dep_gen_aicpu_record_submit gated on is_dep_gen_enabled(); the orch callsite static_asserts sizeof(Tensor) == DEP_GEN_TENSOR_SIZE and PTO2_MAX_EXPLICIT_DEPS == DEP_GEN_MAX_EXPLICIT_DEPS. Weak fallbacks on the runtime side let host_build_graph and future replay builds link without the AICPU strong symbols. - src/a2a3/platform/include/host/dep_gen_collector.{h,cpp} DepGenModule trait + DepGenCollector inheriting ProfilerBase<DepGenCollector, DepGenModule>, so the mgmt thread, buffer-pool manager, and poll loop come from the unified profiling framework. on_buffer_collected appends DepGenRecord values to submit_trace.bin; reconcile_counters cross-checks collected + dropped == device_total and reports clean/dirty so the future replay step can skip deps.json on incomplete traces. - Device runner wiring (sim + onboard): DepGenCollector field, set_dep_gen_enabled() setter, init_dep_gen() helper, perf_cleanup RAII guard, kernel_args.dep_gen_data_base plumbed through to AICPU via set_platform_dep_gen_base, stop + reconcile + finalize at end of run, make_dep_gen_path() helper. - AICPU executor lifecycle: set_platform_dep_gen_base / set_dep_gen_enabled / dep_gen_aicpu_set_orch_thread_idx / dep_gen_aicpu_init / dep_gen_aicpu_flush / dep_gen_aicpu_finalize hooked in alongside the existing PMU lifecycle calls. - CallConfig + Python: `enable_dep_gen` int32 flag on ChipCallConfig, nanobind Python property (bool), conftest.py and scene_test.py `--enable-dep-gen` CLI option threaded through run_class_cases / _run_and_validate / _build_config. round > 1 disables capture (same pattern as enable_l2_swimlane). - ProfilerBase suppression: `bugprone-crtp-constructor-accessibility` was newly tripped by the new dep_gen_collector.cpp pulling profiler_base.h into clang-tidy's scope. Suppressed with NOLINTNEXTLINE + comment explaining the intentional public ctor (all derived collectors call it). - a2a3sim test: tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/ test_dep_gen_capture.py re-uses the vector_example orchestration (5 submit_task calls), runs with --enable-dep-gen, then asserts submit_trace.bin size equals 5 * sizeof(DepGenRecord) and spot-checks the first record's tensor_count == 3. Verified: - a2a3sim spmd_sync_start baseline (no flag): PASSED — wiring did not perturb the default path. - a2a3sim dep_gen_capture (with --enable-dep-gen): PASSED, post-run verification PASSED (submit_trace.bin has exactly 5 records).
…er + host collector) Capture the inputs to every Orchestrator::submit_task call into a streaming ring buffer that the host drains to submit_trace.bin. This is phase 1 of replacing PR hw-native-sys#500 — the host-side offline replay that reconstructs deps.json from these records ships in a follow-up PR. The capture path sidesteps the race window in L2PerfRecord::fanout[], where an early-finishing producer's record gets sealed before later- submitted consumers can register themselves. Architecture mirrors PMU / L2Perf / TensorDump on a2a3: - src/a2a3/platform/include/common/dep_gen.h DepGenRecord (2240 B, single submit_task capture: task_id + flags + arg_types[16] + explicit_deps[16] + tensors[16][128] opaque blobs). DepGenBuffer / DepGenFreeQueue / DepGenBufferState / DepGenReadyQueueEntry / DepGenDataHeader: SPSC streaming buffer family identical in shape to PmuBuffer / PmuFreeQueue etc. Single-instance (orch is one thread). - src/a2a3/platform/include/aicpu/dep_gen_collector_aicpu.{h,cpp} AICPU writer with all-primitive interface (task_id raw, void* per- tensor blob, uint64* explicit_deps) so the platform layer stays runtime-agnostic. submit_task entry calls dep_gen_aicpu_record_submit gated on is_dep_gen_enabled(); the orch callsite static_asserts sizeof(Tensor) == DEP_GEN_TENSOR_SIZE and PTO2_MAX_EXPLICIT_DEPS == DEP_GEN_MAX_EXPLICIT_DEPS. Weak fallbacks on the runtime side let host_build_graph and future replay builds link without the AICPU strong symbols. - src/a2a3/platform/include/host/dep_gen_collector.{h,cpp} DepGenModule trait + DepGenCollector inheriting ProfilerBase<DepGenCollector, DepGenModule>, so the mgmt thread, buffer-pool manager, and poll loop come from the unified profiling framework. on_buffer_collected appends DepGenRecord values to submit_trace.bin; reconcile_counters cross-checks collected + dropped == device_total and reports clean/dirty so the future replay step can skip deps.json on incomplete traces. - Device runner wiring (sim + onboard): DepGenCollector field, set_dep_gen_enabled() setter, init_dep_gen() helper, perf_cleanup RAII guard, kernel_args.dep_gen_data_base plumbed through to AICPU via set_platform_dep_gen_base, stop + reconcile + finalize at end of run, make_dep_gen_path() helper. - AICPU executor lifecycle: set_platform_dep_gen_base / set_dep_gen_enabled / dep_gen_aicpu_set_orch_thread_idx / dep_gen_aicpu_init / dep_gen_aicpu_flush / dep_gen_aicpu_finalize hooked in alongside the existing PMU lifecycle calls. - CallConfig + Python: `enable_dep_gen` int32 flag on ChipCallConfig, nanobind Python property (bool), conftest.py and scene_test.py `--enable-dep-gen` CLI option threaded through run_class_cases / _run_and_validate / _build_config. round > 1 disables capture (same pattern as enable_l2_swimlane). - ProfilerBase suppression: `bugprone-crtp-constructor-accessibility` was newly tripped by the new dep_gen_collector.cpp pulling profiler_base.h into clang-tidy's scope. Suppressed with NOLINTNEXTLINE + comment explaining the intentional public ctor (all derived collectors call it). - a2a3sim test: tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/ test_dep_gen_capture.py re-uses the vector_example orchestration (5 submit_task calls), runs with --enable-dep-gen, then asserts submit_trace.bin size equals 5 * sizeof(DepGenRecord) and spot-checks the first record's tensor_count == 3. Verified: - a2a3sim spmd_sync_start baseline (no flag): PASSED — wiring did not perturb the default path. - a2a3sim dep_gen_capture (with --enable-dep-gen): PASSED, post-run verification PASSED (submit_trace.bin has exactly 5 records).
…er + host collector) Capture the inputs to every Orchestrator::submit_task call into a streaming ring buffer that the host drains to submit_trace.bin. This is phase 1 of replacing PR hw-native-sys#500 — the host-side offline replay that reconstructs deps.json from these records ships in a follow-up PR. The capture path sidesteps the race window in L2PerfRecord::fanout[], where an early-finishing producer's record gets sealed before later- submitted consumers can register themselves. Architecture mirrors PMU / L2Perf / TensorDump on a2a3: - src/a2a3/platform/include/common/dep_gen.h DepGenRecord (2240 B, single submit_task capture: task_id + flags + arg_types[16] + explicit_deps[16] + tensors[16][128] opaque blobs). DepGenBuffer / DepGenFreeQueue / DepGenBufferState / DepGenReadyQueueEntry / DepGenDataHeader: SPSC streaming buffer family identical in shape to PmuBuffer / PmuFreeQueue etc. Single-instance (orch is one thread). - src/a2a3/platform/include/aicpu/dep_gen_collector_aicpu.{h,cpp} AICPU writer with all-primitive interface (task_id raw, void* per- tensor blob, uint64* explicit_deps) so the platform layer stays runtime-agnostic. submit_task entry calls dep_gen_aicpu_record_submit gated on is_dep_gen_enabled(); the orch callsite static_asserts sizeof(Tensor) == DEP_GEN_TENSOR_SIZE and PTO2_MAX_EXPLICIT_DEPS == DEP_GEN_MAX_EXPLICIT_DEPS. Weak fallbacks on the runtime side let host_build_graph and future replay builds link without the AICPU strong symbols. - src/a2a3/platform/include/host/dep_gen_collector.{h,cpp} DepGenModule trait + DepGenCollector inheriting ProfilerBase<DepGenCollector, DepGenModule>, so the mgmt thread, buffer-pool manager, and poll loop come from the unified profiling framework. on_buffer_collected appends DepGenRecord values to submit_trace.bin; reconcile_counters cross-checks collected + dropped == device_total and reports clean/dirty so the future replay step can skip deps.json on incomplete traces. - Device runner wiring (sim + onboard): DepGenCollector field, set_dep_gen_enabled() setter, init_dep_gen() helper, perf_cleanup RAII guard, kernel_args.dep_gen_data_base plumbed through to AICPU via set_platform_dep_gen_base, stop + reconcile + finalize at end of run, make_dep_gen_path() helper. - AICPU executor lifecycle: set_platform_dep_gen_base / set_dep_gen_enabled / dep_gen_aicpu_set_orch_thread_idx / dep_gen_aicpu_init / dep_gen_aicpu_flush / dep_gen_aicpu_finalize hooked in alongside the existing PMU lifecycle calls. - CallConfig + Python: `enable_dep_gen` int32 flag on ChipCallConfig, nanobind Python property (bool), conftest.py and scene_test.py `--enable-dep-gen` CLI option threaded through run_class_cases / _run_and_validate / _build_config. round > 1 disables capture (same pattern as enable_l2_swimlane). - ProfilerBase suppression: `bugprone-crtp-constructor-accessibility` was newly tripped by the new dep_gen_collector.cpp pulling profiler_base.h into clang-tidy's scope. Suppressed with NOLINTNEXTLINE + comment explaining the intentional public ctor (all derived collectors call it). - a2a3sim test: tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/ test_dep_gen_capture.py re-uses the vector_example orchestration (5 submit_task calls), runs with --enable-dep-gen, then asserts submit_trace.bin size equals 5 * sizeof(DepGenRecord) and spot-checks the first record's tensor_count == 3. Verified: - a2a3sim spmd_sync_start baseline (no flag): PASSED — wiring did not perturb the default path. - a2a3sim dep_gen_capture (with --enable-dep-gen): PASSED, post-run verification PASSED (submit_trace.bin has exactly 5 records).
…er + host collector) (#736) Capture the inputs to every Orchestrator::submit_task call into a streaming ring buffer that the host drains to submit_trace.bin. This is phase 1 of replacing PR #500 — the host-side offline replay that reconstructs deps.json from these records ships in a follow-up PR. The capture path sidesteps the race window in L2PerfRecord::fanout[], where an early-finishing producer's record gets sealed before later- submitted consumers can register themselves. Architecture mirrors PMU / L2Perf / TensorDump on a2a3: - src/a2a3/platform/include/common/dep_gen.h DepGenRecord (2240 B, single submit_task capture: task_id + flags + arg_types[16] + explicit_deps[16] + tensors[16][128] opaque blobs). DepGenBuffer / DepGenFreeQueue / DepGenBufferState / DepGenReadyQueueEntry / DepGenDataHeader: SPSC streaming buffer family identical in shape to PmuBuffer / PmuFreeQueue etc. Single-instance (orch is one thread). - src/a2a3/platform/include/aicpu/dep_gen_collector_aicpu.{h,cpp} AICPU writer with all-primitive interface (task_id raw, void* per- tensor blob, uint64* explicit_deps) so the platform layer stays runtime-agnostic. submit_task entry calls dep_gen_aicpu_record_submit gated on is_dep_gen_enabled(); the orch callsite static_asserts sizeof(Tensor) == DEP_GEN_TENSOR_SIZE and PTO2_MAX_EXPLICIT_DEPS == DEP_GEN_MAX_EXPLICIT_DEPS. Weak fallbacks on the runtime side let host_build_graph and future replay builds link without the AICPU strong symbols. - src/a2a3/platform/include/host/dep_gen_collector.{h,cpp} DepGenModule trait + DepGenCollector inheriting ProfilerBase<DepGenCollector, DepGenModule>, so the mgmt thread, buffer-pool manager, and poll loop come from the unified profiling framework. on_buffer_collected appends DepGenRecord values to submit_trace.bin; reconcile_counters cross-checks collected + dropped == device_total and reports clean/dirty so the future replay step can skip deps.json on incomplete traces. - Device runner wiring (sim + onboard): DepGenCollector field, set_dep_gen_enabled() setter, init_dep_gen() helper, perf_cleanup RAII guard, kernel_args.dep_gen_data_base plumbed through to AICPU via set_platform_dep_gen_base, stop + reconcile + finalize at end of run, make_dep_gen_path() helper. - AICPU executor lifecycle: set_platform_dep_gen_base / set_dep_gen_enabled / dep_gen_aicpu_set_orch_thread_idx / dep_gen_aicpu_init / dep_gen_aicpu_flush / dep_gen_aicpu_finalize hooked in alongside the existing PMU lifecycle calls. - CallConfig + Python: `enable_dep_gen` int32 flag on ChipCallConfig, nanobind Python property (bool), conftest.py and scene_test.py `--enable-dep-gen` CLI option threaded through run_class_cases / _run_and_validate / _build_config. round > 1 disables capture (same pattern as enable_l2_swimlane). - ProfilerBase suppression: `bugprone-crtp-constructor-accessibility` was newly tripped by the new dep_gen_collector.cpp pulling profiler_base.h into clang-tidy's scope. Suppressed with NOLINTNEXTLINE + comment explaining the intentional public ctor (all derived collectors call it). - a2a3sim test: tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/ test_dep_gen_capture.py re-uses the vector_example orchestration (5 submit_task calls), runs with --enable-dep-gen, then asserts submit_trace.bin size equals 5 * sizeof(DepGenRecord) and spot-checks the first record's tensor_count == 3. Verified: - a2a3sim spmd_sync_start baseline (no flag): PASSED — wiring did not perturb the default path. - a2a3sim dep_gen_capture (with --enable-dep-gen): PASSED, post-run verification PASSED (submit_trace.bin has exactly 5 records).
Replace the boolean enable_profiling flag with an integer perf_level
throughout the profiling pipeline (CLI → Python bindings → C++ runtime →
AICPU executor → PerformanceCollector → JSON export).
Profiling levels:
0 = off
1 = AICore task start/end timing only (JSON version 0)
2 = + dispatch timestamps, finish timestamps, fanout edges (JSON version 1)
3 = + AICPU scheduler/orchestrator phase buffers (JSON version 2)
Key changes:
ChipCallConfig: bool enable_profiling → int perf_level
CLI --enable-profiling: store_true → optional int (bare flag defaults to 3)
nanobind property: backward-compatible bool→3 coercion for legacy callers
AICPU executor: split into task_recording_enabled (>0) vs
phase_recording_enabled (>=3) to skip phase overhead at lower levels
PerformanceCollector: skip phase buffer allocation when perf_level < 3;
version selection based on perf_level and presence of phase data
swimlane_converter.py: accept version 0, tolerate missing fanout field
Fix scene_test.py: val and cond truncated int to bool; use ternary