Refactor: PTO2 stall diagnostic log format by ChaoZheng109 · Pull Request #741 · hw-native-sys/simpler

ChaoZheng109 · 2026-05-11T12:13:22Z

Summary

Reformat the PTO2 scheduler stall diagnostic so multi-thread output is greppable. No changes to scheduler hot-path behavior — task_state semantics on the non-profiling path are preserved exactly.

Mirrored across src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/ and src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/.

Motivation

Every scheduler thread that goes idle hits STALL_LOG_INTERVAL together, but only T0 emitted anything, and the lines themselves had no per-line context. Once device_log interleaved their output across threads it was impossible to tell which diagnostic round a line belonged to.

Changes

1. Log format — uniform self-contained lines

Every line now starts with [STALL thread=N idle_iterations=K] CATEGORY .... grep "idle_iterations=N" groups one round across all threads.

Category	Emitted by	Contents
`SUMMARY`	T0 only	completed / total + scan_ready / scan_waiting / scan_running
`TASK`	T0 only	one per non-completed task; `state=RUNNING` includes `running_on=[owner_thread=... cores=[...]]`; `state=WAIT` includes `missing_deps=N`
`CLUSTER`	every thread	one per owned cluster; busy slot shows `kernel`, `task`, `cond_reg_state`; `ANOMALY` suffix when the COND register reports `fin` but the slot is still marked busy in software

2. `state=` derived from ground truth, not `task_state`

task_state's intermediate values (READY / RUNNING) are intentionally not written on the non-profiling hot path — only PENDING / COMPLETED / CONSUMED are. Reading it would miscategorize.

The new diagnostic scans core_exec_states_ once per task:

state=RUNNING ⇔ some core has the slot as running_slot_state
state=READY ⇔ fanin_refcount >= fanin_count and no core is running it
state=WAIT ⇔ otherwise

The same scan produces the running_on=[...] cross-reference for free. Zero new atomics on the scheduler hot path.

3. Per-thread diagnostic emission + interval tuning

scheduler_dispatch.cpp: log_stall_diagnostics is invoked from every thread (was T0-only), passing total_tasks_ instead of the thread-local task_count, so per-thread CLUSTER lines reach the log.
MAX_IDLE_ITERATIONS comparison switched from > to >= to match the STALL_LOG_INTERVAL boundary.
STALL_LOG_INTERVAL 50000 → 400000 (one round ~every 10s of idle instead of every 1.25s — matches how long a real hang takes to triage, stops drowning the normal idle period in spam).

Log Output: Before vs After

Before (one diagnostic round)

PTO2 stall: no progress for 50000 iterations, completed=3 total=5 (last progress at 0)
  STUCK-READY  ring=1 task_id=4294967298 kernel_id=-1 refcount=3 fanin=3 state=0
  STUCK-WAIT   ring=1 task_id=4294967299 kernel_id=-1 refcount=2 fanin=3 state=0
  scan result: stuck_ready=1 stuck_waiting=1 in_flight=0
  thread=0 running_cores=0 (AIC=0 AIV=0) core_num=3
    cluster[0] aic=0(idle) aiv0=3(idle) aiv1=4(idle)

Problems:

Lines from other threads were silently dropped — only T0 emitted anything.
state=0 was reported even for tasks that were actually running on a core, because the old classifier read task_state (which non-profiling code never updates past PENDING).
Lines from different rounds shared identical prefixes (STUCK-READY, STUCK-WAIT, scan result) — once device_log interleaved scheduler output across threads, you could not tell which line belonged to which round.
kernel_id=-1 only showed one slot of the three; if AIV1 was the actually-stuck core, you couldn't tell from the line.
No back-reference from a stuck task to the core(s) running it.

After (one diagnostic round)

[STALL thread=0 idle_iterations=400000] TASK ring=1 task_id=4294967300 state=RUNNING fanin_refcount=4/4 kernels=[aic:-1 aiv0:3 aiv1:-1] running_on=[owner_thread=2 cores=[core=47(aiv0)]]
[STALL thread=0 idle_iterations=400000] TASK ring=1 task_id=4294967305 state=RUNNING fanin_refcount=4/4 kernels=[aic:-1 aiv0:3 aiv1:-1] running_on=[owner_thread=1 cores=[core=45(aiv0)]]
[STALL thread=0 idle_iterations=400000] TASK ring=1 task_id=4294967585 state=READY   fanin_refcount=4/4 kernels=[aic:-1 aiv0:3 aiv1:-1]
[STALL thread=0 idle_iterations=400000] TASK ring=1 task_id=4294967605 state=WAIT    fanin_refcount=3/4 kernels=[aic:-1 aiv0:3 aiv1:-1] missing_deps=1
[STALL thread=0 idle_iterations=400000] SUMMARY completed=716/1280 last_progress_iteration=711 scan_ready=160 scan_waiting=308 scan_running=96
[STALL thread=2 idle_iterations=400000] CLUSTER cluster_id=2 aic=core2(idle) aiv0=core28(busy kernel=3 task=4294967595 cond_reg_state=ack) aiv1=core29(busy kernel=3 task=4294967415 cond_reg_state=ack)
[STALL thread=2 idle_iterations=400000] CLUSTER cluster_id=5 aic=core5(idle) aiv0=core34(busy kernel=3 task=4294967350 cond_reg_state=ack) aiv1=core35(busy kernel=3 task=4294967410 cond_reg_state=ack)
[ERROR] [STALL thread=0 idle_iterations=800000] TIMEOUT_EXIT after_idle_iterations=800000

Wins:

Every line is self-contained — interleaved output from T0/T1/T2 is still readable. grep "idle_iterations=400000" returns exactly one diagnostic round, across all threads.
state=RUNNING lines carry running_on=[...] so the stuck task points directly at the core(s) that own it (note tasks 4294967300 and 4294967305 land on different owner_threads — the cross-ref tells you which scheduler thread to look at).
state= is derived from core_exec_states_ + refcount, so it stays accurate without forcing the hot path to maintain extra state.
state=WAIT lines report missing_deps=N directly — no more refcount/fanin math by hand.
kernels=[aic:_ aiv0:_ aiv1:_] shows all three subslots, so the AIV-only stuck case is obvious.
CLUSTER lines from every thread now reach the log, with cond_reg_state=ack|fin per busy core (and an ANOMALY suffix when HW says fin but SW still thinks the slot is busy).
TIMEOUT_EXIT uses the same prefix, so a grep for STALL finds the death banner too.

Test plan

Local cpp ut: 22/22 passing (including TaskStateTest.NonProfilingReadyPathStaysPending, which asserts the preserved non-profiling contract)
a5sim CI passes
a2a3sim CI passes

gemini-code-assist

Code Review

This pull request enhances the scheduler's diagnostic capabilities and task state management. Key changes include a significant refactoring of the stall diagnostic logging to provide more granular information about task states (RUNNING, READY, WAIT), kernel IDs, and core assignments. It also introduces explicit task state transitions to PTO2_TASK_READY and PTO2_TASK_RUNNING during the dispatch process and increases the stall log interval to reduce log volume. Additionally, helper functions were added to improve core status formatting and thread ownership lookups on the diagnostic path. I have no feedback to provide as there were no review comments to assess.

The stall log was hard to read at multi-thread idle: every scheduler thread spinning at the same idle rate hit STALL_LOG_INTERVAL together, but only T0 emitted anything and the lines themselves had no per-line context. Once device_log interleaved their output you could not tell which round a line belonged to, and you had to cross-reference task IDs against core state by hand. scheduler_cold_path.cpp now emits a uniform self-contained format on every line: [STALL thread=N idle_iterations=K] CATEGORY ... Categories: SUMMARY (T0 only, completed/total + scan totals), TASK (T0 only, one per non-completed task — RUNNING lines include a running_on=[owner_thread=... cores=[...]] cross-reference, WAIT lines include missing_deps=N), CLUSTER (every thread, one per owned cluster, busy slots show kernel + task_id + cond_reg_state with an ANOMALY suffix when the COND register reports fin but the slot is still marked busy in software). grep 'idle_iterations=N' now groups one round's output across threads. state= is now derived from ground truth rather than slot_state.task_state: task_state's intermediate values (READY / RUNNING) are intentionally not written on the non-profiling hot path (only PENDING / COMPLETED / CONSUMED are), so reading it would miscategorize. Instead we scan core_exec_states_ once per task — a slot is RUNNING iff some core has it as running_slot_state, READY iff fanin_refcount >= fanin_count and no core is running it, WAIT otherwise. The same scan produces the running_on=[...] cross-reference for free. Zero new atomics on the scheduler hot path. scheduler_dispatch.cpp: log_stall_diagnostics is now invoked from every thread (was T0 only), passing total_tasks_ instead of the thread-local task_count, so per-thread CLUSTER lines reach the log. MAX_IDLE_ITERATIONS comparison switched from > to >= to match the STALL_LOG_INTERVAL boundary. scheduler_types.h: STALL_LOG_INTERVAL 50000 -> 400000 (one diagnostic round ~every 10s of idle instead of every 1.25s, which matches how long a real hang takes to triage and stops drowning normal idle periods in log spam). Mirrored across a2a3 and a5 tensormap_and_ringbuffer scheduler trees.

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

ChaoZheng109 force-pushed the refactor/pto2-stall-diagnostic-log-format branch from af4f81d to 0c3e3c9 Compare May 11, 2026 13:18

ChaoZheng109 changed the title ~~Refactor: PTO2 stall diagnostic log format + task_state correctness~~ Refactor: PTO2 stall diagnostic log format May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: PTO2 stall diagnostic log format#741

Refactor: PTO2 stall diagnostic log format#741
ChaoZheng109 wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:refactor/pto2-stall-diagnostic-log-format

ChaoZheng109 commented May 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoZheng109 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

1. Log format — uniform self-contained lines

2. state= derived from ground truth, not task_state

3. Per-thread diagnostic emission + interval tuning

Log Output: Before vs After

Before (one diagnostic round)

After (one diagnostic round)

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChaoZheng109 commented May 11, 2026 •

edited

Loading

2. `state=` derived from ground truth, not `task_state`