Refactor: PTO2 stall diagnostic log format#741
Open
ChaoZheng109 wants to merge 1 commit into
Open
Conversation
There was a problem hiding this comment.
Code Review
This pull request enhances the scheduler's diagnostic capabilities and task state management. Key changes include a significant refactoring of the stall diagnostic logging to provide more granular information about task states (RUNNING, READY, WAIT), kernel IDs, and core assignments. It also introduces explicit task state transitions to PTO2_TASK_READY and PTO2_TASK_RUNNING during the dispatch process and increases the stall log interval to reduce log volume. Additionally, helper functions were added to improve core status formatting and thread ownership lookups on the diagnostic path. I have no feedback to provide as there were no review comments to assess.
The stall log was hard to read at multi-thread idle: every scheduler
thread spinning at the same idle rate hit STALL_LOG_INTERVAL together,
but only T0 emitted anything and the lines themselves had no per-line
context. Once device_log interleaved their output you could not tell
which round a line belonged to, and you had to cross-reference task IDs
against core state by hand.
scheduler_cold_path.cpp now emits a uniform self-contained format on
every line:
[STALL thread=N idle_iterations=K] CATEGORY ...
Categories: SUMMARY (T0 only, completed/total + scan totals), TASK
(T0 only, one per non-completed task — RUNNING lines include a
running_on=[owner_thread=... cores=[...]] cross-reference, WAIT lines
include missing_deps=N), CLUSTER (every thread, one per owned cluster,
busy slots show kernel + task_id + cond_reg_state with an ANOMALY
suffix when the COND register reports fin but the slot is still marked
busy in software). grep 'idle_iterations=N' now groups one round's
output across threads.
state= is now derived from ground truth rather than slot_state.task_state:
task_state's intermediate values (READY / RUNNING) are intentionally not
written on the non-profiling hot path (only PENDING / COMPLETED / CONSUMED
are), so reading it would miscategorize. Instead we scan core_exec_states_
once per task — a slot is RUNNING iff some core has it as
running_slot_state, READY iff fanin_refcount >= fanin_count and no core
is running it, WAIT otherwise. The same scan produces the running_on=[...]
cross-reference for free. Zero new atomics on the scheduler hot path.
scheduler_dispatch.cpp: log_stall_diagnostics is now invoked from every
thread (was T0 only), passing total_tasks_ instead of the thread-local
task_count, so per-thread CLUSTER lines reach the log.
MAX_IDLE_ITERATIONS comparison switched from > to >= to match the
STALL_LOG_INTERVAL boundary.
scheduler_types.h: STALL_LOG_INTERVAL 50000 -> 400000 (one diagnostic
round ~every 10s of idle instead of every 1.25s, which matches how
long a real hang takes to triage and stops drowning normal idle
periods in log spam).
Mirrored across a2a3 and a5 tensormap_and_ringbuffer scheduler trees.
af4f81d to
0c3e3c9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reformat the PTO2 scheduler stall diagnostic so multi-thread output is greppable. No changes to scheduler hot-path behavior —
task_statesemantics on the non-profiling path are preserved exactly.Mirrored across
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/andsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/.Motivation
Every scheduler thread that goes idle hits
STALL_LOG_INTERVALtogether, but only T0 emitted anything, and the lines themselves had no per-line context. Oncedevice_loginterleaved their output across threads it was impossible to tell which diagnostic round a line belonged to.Changes
1. Log format — uniform self-contained lines
Every line now starts with
[STALL thread=N idle_iterations=K] CATEGORY ....grep "idle_iterations=N"groups one round across all threads.SUMMARYTASKstate=RUNNINGincludesrunning_on=[owner_thread=... cores=[...]];state=WAITincludesmissing_deps=NCLUSTERkernel,task,cond_reg_state;ANOMALYsuffix when the COND register reportsfinbut the slot is still marked busy in software2.
state=derived from ground truth, nottask_statetask_state's intermediate values (READY/RUNNING) are intentionally not written on the non-profiling hot path — onlyPENDING/COMPLETED/CONSUMEDare. Reading it would miscategorize.The new diagnostic scans
core_exec_states_once per task:state=RUNNING⇔ some core has the slot asrunning_slot_statestate=READY⇔fanin_refcount >= fanin_countand no core is running itstate=WAIT⇔ otherwiseThe same scan produces the
running_on=[...]cross-reference for free. Zero new atomics on the scheduler hot path.3. Per-thread diagnostic emission + interval tuning
scheduler_dispatch.cpp:log_stall_diagnosticsis invoked from every thread (was T0-only), passingtotal_tasks_instead of the thread-localtask_count, so per-thread CLUSTER lines reach the log.MAX_IDLE_ITERATIONScomparison switched from>to>=to match theSTALL_LOG_INTERVALboundary.STALL_LOG_INTERVAL50000 → 400000 (one round ~every 10s of idle instead of every 1.25s — matches how long a real hang takes to triage, stops drowning the normal idle period in spam).Log Output: Before vs After
Before (one diagnostic round)
Problems:
state=0was reported even for tasks that were actually running on a core, because the old classifier readtask_state(which non-profiling code never updates pastPENDING).STUCK-READY,STUCK-WAIT,scan result) — oncedevice_loginterleaved scheduler output across threads, you could not tell which line belonged to which round.kernel_id=-1only showed one slot of the three; if AIV1 was the actually-stuck core, you couldn't tell from the line.After (one diagnostic round)
Wins:
grep "idle_iterations=400000"returns exactly one diagnostic round, across all threads.state=RUNNINGlines carryrunning_on=[...]so the stuck task points directly at the core(s) that own it (note tasks 4294967300 and 4294967305 land on differentowner_threads — the cross-ref tells you which scheduler thread to look at).state=is derived fromcore_exec_states_+refcount, so it stays accurate without forcing the hot path to maintain extra state.state=WAITlines reportmissing_deps=Ndirectly — no more refcount/fanin math by hand.kernels=[aic:_ aiv0:_ aiv1:_]shows all three subslots, so the AIV-only stuck case is obvious.cond_reg_state=ack|finper busy core (and anANOMALYsuffix when HW says fin but SW still thinks the slot is busy).STALLfinds the death banner too.Test plan
TaskStateTest.NonProfilingReadyPathStaysPending, which asserts the preserved non-profiling contract)