Skip to content

fix(runtime): unify task state lifecycle with idempotent READY CAS and RUNNING promotion#748

Open
Crystal-wzy wants to merge 2 commits into
hw-native-sys:mainfrom
Crystal-wzy:main
Open

fix(runtime): unify task state lifecycle with idempotent READY CAS and RUNNING promotion#748
Crystal-wzy wants to merge 2 commits into
hw-native-sys:mainfrom
Crystal-wzy:main

Conversation

@Crystal-wzy
Copy link
Copy Markdown
Contributor

@Crystal-wzy Crystal-wzy commented May 12, 2026

Summary

  • Add enqueue_ready_once() to centralize PENDING → READY CAS and
    ready-queue push into a single idempotent operation. Previously the
    non-profiling path never transitioned task_state to READY, leaving
    queue state and task state inconsistent; multiple fanin observers could
    also push duplicate READY notifications. The new helper guarantees
    exactly-once enqueue via CAS, with both profiling and non-profiling
    paths sharing the same logic.
  • Add push_ready_queue() / push_ready_queue_batch() helpers that wrap
    the spin-retry loop, replacing raw ready_queues[…].push() calls across
    orchestration, completion, dispatch, and local-buffer overflow paths for
    consistent retry behavior.
  • Add mark_task_running_on_first_dispatch(): CAS READY → RUNNING on
    the first block dispatch (including the drain-dispatch path), completing
    the PENDING → READY → RUNNING → done state machine that was previously
    missing the RUNNING step.
  • Harden completion scanning: skip cores whose running_reg_task_id is
    AICPU_TASK_INVALID; when a running FIN is observed while a pending
    dispatch exists, wait for the pending ACK/FIN before promoting, ensuring
    the payload slot is hardware-latched.
  • Improve idle-iteration detection in resolve_and_dispatch: reset idle
    counter when global completed_tasks_ advances or when another scheduler
    thread still has running cores, preventing premature stall timeouts in
    multi-thread scheduling.
  • Trim stall diagnostics: remove per-cluster idle/busy dump that was
    redundant with per-core logging; move CoreTracker variable after the
    ring scan for narrower scope.
  • Add SPIN_WAIT_HINT() to ready-queue enqueue_pos CAS retry loop for
    reduced contention on hardware.
  • Update test_task_state.cpp: NonProfilingReadyPath now asserts
    PTO2_TASK_READY instead of PTO2_TASK_PENDING, matching the unified
    state machine.

Testing

  • CPU sim for models/qwen3/14b/qwen3_14b_decode.py passes after the
    READY-state fix.
  • git diff --check passes for the runtime worktree.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces task cookie tracking in simulation and enhances the PTO2 scheduler's handling of MIX resource shapes and task state transitions. Key updates include the addition of sim_set_current_task_cookie, new queue management methods in PTO2SchedulerState, and improved synchronization logic for MIX tasks during completion and dispatch. Additionally, the non-profiling ready path now explicitly transitions tasks to the READY state. Feedback was provided regarding a potential performance regression in the push_ready_queue_batch implementation, which uses multiple single-item pushes instead of a single batch operation.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h Outdated
@Crystal-wzy Crystal-wzy force-pushed the main branch 4 times, most recently from 8975412 to 6dd079f Compare May 12, 2026 07:51
@Crystal-wzy Crystal-wzy changed the title fix(backend): Fix MIX task scheduling races and add sim task cookie fix(runtime): unify task state lifecycle with idempotent READY CAS and RUNNING promotion May 12, 2026
…tly-once enqueue

## Summary
- Extract `enqueue_ready_once()` in `PTO2SchedulerState` that atomically CAS
  task_state PENDING→READY before pushing to ready queue, guaranteeing
  exactly-once enqueue and eliminating potential duplicate enqueue races
  (both profiling and non-profiling overloads)
- Replace all inline ready-queue push sites in `wire_slot()` and
  `release_fanin_and_check_ready()` with `enqueue_ready_once()` calls
- Add READY→RUNNING state transition via `mark_task_running_on_first_dispatch()`
  helper, called on first block dispatch in both `dispatch_shape()` and
  `drain_worker_dispatch()`, completing the PENDING→READY→RUNNING state machine
- Add global `completed_tasks_` progress check in `resolve_and_dispatch()` idle
  loop to reset idle iterations when other threads make progress, preventing
  premature idle timeout
- Update `NonProfilingReadyPathMarksReady` unit test to verify task_state
  transitions to READY after `release_fanin_and_check_ready()` (previously
  expected PENDING by design)
- All changes applied symmetrically to both a2a3 and a5 scheduler paths

## Testing
- [x] All unit tests pass
- [x] Pre-commit hooks pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant