Skip to content

Refactor: PTO2 stall diagnostic log format#741

Open
ChaoZheng109 wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:refactor/pto2-stall-diagnostic-log-format
Open

Refactor: PTO2 stall diagnostic log format#741
ChaoZheng109 wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:refactor/pto2-stall-diagnostic-log-format

Conversation

@ChaoZheng109
Copy link
Copy Markdown
Collaborator

@ChaoZheng109 ChaoZheng109 commented May 11, 2026

Summary

Reformat the PTO2 scheduler stall diagnostic so multi-thread output is greppable. No changes to scheduler hot-path behavior — task_state semantics on the non-profiling path are preserved exactly.

Mirrored across src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/ and src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/.

Motivation

Every scheduler thread that goes idle hits STALL_LOG_INTERVAL together, but only T0 emitted anything, and the lines themselves had no per-line context. Once device_log interleaved their output across threads it was impossible to tell which diagnostic round a line belonged to.

Changes

1. Log format — uniform self-contained lines

Every line now starts with [STALL thread=N idle_iterations=K] CATEGORY .... grep "idle_iterations=N" groups one round across all threads.

Category Emitted by Contents
SUMMARY T0 only completed / total + scan_ready / scan_waiting / scan_running
TASK T0 only one per non-completed task; state=RUNNING includes running_on=[owner_thread=... cores=[...]]; state=WAIT includes missing_deps=N
CLUSTER every thread one per owned cluster; busy slot shows kernel, task, cond_reg_state; ANOMALY suffix when the COND register reports fin but the slot is still marked busy in software

2. state= derived from ground truth, not task_state

task_state's intermediate values (READY / RUNNING) are intentionally not written on the non-profiling hot path — only PENDING / COMPLETED / CONSUMED are. Reading it would miscategorize.

The new diagnostic scans core_exec_states_ once per task:

  • state=RUNNING ⇔ some core has the slot as running_slot_state
  • state=READYfanin_refcount >= fanin_count and no core is running it
  • state=WAIT ⇔ otherwise

The same scan produces the running_on=[...] cross-reference for free. Zero new atomics on the scheduler hot path.

3. Per-thread diagnostic emission + interval tuning

  • scheduler_dispatch.cpp: log_stall_diagnostics is invoked from every thread (was T0-only), passing total_tasks_ instead of the thread-local task_count, so per-thread CLUSTER lines reach the log.
  • MAX_IDLE_ITERATIONS comparison switched from > to >= to match the STALL_LOG_INTERVAL boundary.
  • STALL_LOG_INTERVAL 50000 → 400000 (one round ~every 10s of idle instead of every 1.25s — matches how long a real hang takes to triage, stops drowning the normal idle period in spam).

Log Output: Before vs After

Before (one diagnostic round)

PTO2 stall: no progress for 50000 iterations, completed=3 total=5 (last progress at 0)
  STUCK-READY  ring=1 task_id=4294967298 kernel_id=-1 refcount=3 fanin=3 state=0
  STUCK-WAIT   ring=1 task_id=4294967299 kernel_id=-1 refcount=2 fanin=3 state=0
  scan result: stuck_ready=1 stuck_waiting=1 in_flight=0
  thread=0 running_cores=0 (AIC=0 AIV=0) core_num=3
    cluster[0] aic=0(idle) aiv0=3(idle) aiv1=4(idle)

Problems:

  • Lines from other threads were silently dropped — only T0 emitted anything.
  • state=0 was reported even for tasks that were actually running on a core, because the old classifier read task_state (which non-profiling code never updates past PENDING).
  • Lines from different rounds shared identical prefixes (STUCK-READY, STUCK-WAIT, scan result) — once device_log interleaved scheduler output across threads, you could not tell which line belonged to which round.
  • kernel_id=-1 only showed one slot of the three; if AIV1 was the actually-stuck core, you couldn't tell from the line.
  • No back-reference from a stuck task to the core(s) running it.

After (one diagnostic round)

[STALL thread=0 idle_iterations=400000] TASK ring=1 task_id=4294967300 state=RUNNING fanin_refcount=4/4 kernels=[aic:-1 aiv0:3 aiv1:-1] running_on=[owner_thread=2 cores=[core=47(aiv0)]]
[STALL thread=0 idle_iterations=400000] TASK ring=1 task_id=4294967305 state=RUNNING fanin_refcount=4/4 kernels=[aic:-1 aiv0:3 aiv1:-1] running_on=[owner_thread=1 cores=[core=45(aiv0)]]
[STALL thread=0 idle_iterations=400000] TASK ring=1 task_id=4294967585 state=READY   fanin_refcount=4/4 kernels=[aic:-1 aiv0:3 aiv1:-1]
[STALL thread=0 idle_iterations=400000] TASK ring=1 task_id=4294967605 state=WAIT    fanin_refcount=3/4 kernels=[aic:-1 aiv0:3 aiv1:-1] missing_deps=1
[STALL thread=0 idle_iterations=400000] SUMMARY completed=716/1280 last_progress_iteration=711 scan_ready=160 scan_waiting=308 scan_running=96
[STALL thread=2 idle_iterations=400000] CLUSTER cluster_id=2 aic=core2(idle) aiv0=core28(busy kernel=3 task=4294967595 cond_reg_state=ack) aiv1=core29(busy kernel=3 task=4294967415 cond_reg_state=ack)
[STALL thread=2 idle_iterations=400000] CLUSTER cluster_id=5 aic=core5(idle) aiv0=core34(busy kernel=3 task=4294967350 cond_reg_state=ack) aiv1=core35(busy kernel=3 task=4294967410 cond_reg_state=ack)
[ERROR] [STALL thread=0 idle_iterations=800000] TIMEOUT_EXIT after_idle_iterations=800000

Wins:

  • Every line is self-contained — interleaved output from T0/T1/T2 is still readable. grep "idle_iterations=400000" returns exactly one diagnostic round, across all threads.
  • state=RUNNING lines carry running_on=[...] so the stuck task points directly at the core(s) that own it (note tasks 4294967300 and 4294967305 land on different owner_threads — the cross-ref tells you which scheduler thread to look at).
  • state= is derived from core_exec_states_ + refcount, so it stays accurate without forcing the hot path to maintain extra state.
  • state=WAIT lines report missing_deps=N directly — no more refcount/fanin math by hand.
  • kernels=[aic:_ aiv0:_ aiv1:_] shows all three subslots, so the AIV-only stuck case is obvious.
  • CLUSTER lines from every thread now reach the log, with cond_reg_state=ack|fin per busy core (and an ANOMALY suffix when HW says fin but SW still thinks the slot is busy).
  • TIMEOUT_EXIT uses the same prefix, so a grep for STALL finds the death banner too.

Test plan

  • Local cpp ut: 22/22 passing (including TaskStateTest.NonProfilingReadyPathStaysPending, which asserts the preserved non-profiling contract)
  • a5sim CI passes
  • a2a3sim CI passes

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the scheduler's diagnostic capabilities and task state management. Key changes include a significant refactoring of the stall diagnostic logging to provide more granular information about task states (RUNNING, READY, WAIT), kernel IDs, and core assignments. It also introduces explicit task state transitions to PTO2_TASK_READY and PTO2_TASK_RUNNING during the dispatch process and increases the stall log interval to reduce log volume. Additionally, helper functions were added to improve core status formatting and thread ownership lookups on the diagnostic path. I have no feedback to provide as there were no review comments to assess.

The stall log was hard to read at multi-thread idle: every scheduler
thread spinning at the same idle rate hit STALL_LOG_INTERVAL together,
but only T0 emitted anything and the lines themselves had no per-line
context. Once device_log interleaved their output you could not tell
which round a line belonged to, and you had to cross-reference task IDs
against core state by hand.

scheduler_cold_path.cpp now emits a uniform self-contained format on
every line:

    [STALL thread=N idle_iterations=K] CATEGORY ...

Categories: SUMMARY (T0 only, completed/total + scan totals), TASK
(T0 only, one per non-completed task — RUNNING lines include a
running_on=[owner_thread=... cores=[...]] cross-reference, WAIT lines
include missing_deps=N), CLUSTER (every thread, one per owned cluster,
busy slots show kernel + task_id + cond_reg_state with an ANOMALY
suffix when the COND register reports fin but the slot is still marked
busy in software). grep 'idle_iterations=N' now groups one round's
output across threads.

state= is now derived from ground truth rather than slot_state.task_state:
task_state's intermediate values (READY / RUNNING) are intentionally not
written on the non-profiling hot path (only PENDING / COMPLETED / CONSUMED
are), so reading it would miscategorize. Instead we scan core_exec_states_
once per task — a slot is RUNNING iff some core has it as
running_slot_state, READY iff fanin_refcount >= fanin_count and no core
is running it, WAIT otherwise. The same scan produces the running_on=[...]
cross-reference for free. Zero new atomics on the scheduler hot path.

scheduler_dispatch.cpp: log_stall_diagnostics is now invoked from every
thread (was T0 only), passing total_tasks_ instead of the thread-local
task_count, so per-thread CLUSTER lines reach the log.
MAX_IDLE_ITERATIONS comparison switched from > to >= to match the
STALL_LOG_INTERVAL boundary.

scheduler_types.h: STALL_LOG_INTERVAL 50000 -> 400000 (one diagnostic
round ~every 10s of idle instead of every 1.25s, which matches how
long a real hang takes to triage and stops drowning normal idle
periods in log spam).

Mirrored across a2a3 and a5 tensormap_and_ringbuffer scheduler trees.
@ChaoZheng109 ChaoZheng109 force-pushed the refactor/pto2-stall-diagnostic-log-format branch from af4f81d to 0c3e3c9 Compare May 11, 2026 13:18
@ChaoZheng109 ChaoZheng109 changed the title Refactor: PTO2 stall diagnostic log format + task_state correctness Refactor: PTO2 stall diagnostic log format May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant