Lock-free scheduler with per-priority queues#193
Conversation
…ce fixes Rework Scheduler/Pool/Group to avoid the global mutex contention that showed up under load. Tasks are now placed into one of five lock-free queues (one per priority bucket) so producers and consumers no longer serialise on a single lock, and Group acquisition uses a counting-semaphore fast path with a lock-free waiter queue per priority. Pools whose concurrency is one (MainThread, TraceController, etc.) now use a specialised MPSC queue with a non-atomic consumer side. Tasks within a Sync group retain strict in-order delivery; equal-priority tasks across pools see relaxed global FIFO. The new lock-free primitives (TaskQueue, MPSCQueue, Semaphore) and the Group counting-lock are covered by Catch2 BDD-style tests plus a scheduler benchmark. During TSAN validation across macOS clang and Linux gcc 13 three pre-existing data races were uncovered and fixed: * IOController (POSIX): the IOFinished handler mutated watches[].events under tasks_mutex while the poll thread read the same field from inside ::poll() under notifier.mutex. The bump() call closed the wake window but released notifier.mutex before the mutation, leaving the race. Inline the wake and hold notifier.mutex across the watches mutation and follow-up fire_event. * Scheduler::submit: the cached Pool pointer on Reaction (scheduler_data) was read/written from any submitting thread without synchronisation. Switch the cache to std::atomic_load/store on the shared_ptr; the worst case is two submitters racing both compute the identical pool and last-writer-wins. * Watchdog data store: the service time_point was read by the chrono controller while being mutated by user threads emitting a service event, and the void specialisation returned a reference through a temporary shared_ptr. Centralise reads/writes through a per-(WatchdogGroup, RuntimeType) std::mutex, return the time_point by value, and route WatchdogServicer::service through WatchdogDataStore::service so writes share the read mutex. Validation: * macOS clang TSAN: dsl/IO, dsl/Inline, dsl/Watchdog 30/30 clean each; full suite 63/63. * Linux gcc 13 TSAN: same three tests 30/30 clean; 16 hot-path tests x 3 runs serially with no TSAN warnings. * macOS Release: 64/64. Also ignore build-*/ directories so out-of-tree build folders don't show up in git status. Co-authored-by: Cursor <cursoragent@cursor.com>
The Reaction::scheduler_data cache previously held an std::shared_ptr<void>
read/written via std::atomic_load/atomic_store. On libstdc++ those fall back
to a small global pool of mutexes (selected by pointer hash), which becomes
a contention point on hot submission paths.
Change scheduler_data to std::atomic<void*>{nullptr}. Pools live for the
lifetime of the Scheduler and the PowerPlant tears reactors down before the
scheduler, so a non-owning raw pointer is safe. Group::try_submit and
WaitEntry are switched to Pool* accordingly, and the Scheduler field
declaration order is changed so that pools outlive groups on destruction.
Also fix the clang-tidy errors that were blocking the lint job: switch the
queue Slot/Block backing storage to std::array (avoid-c-arrays, member
init), explicit-base BLOCK_SIZE, do-while -> while, use auto with new,
RunningLock special members, Semaphore destructor, missing direct includes,
unused-and-kept includes, and a couple of small test cleanups
(reserve before emplace, explicit lvalue MPSCQueue enqueue overload to
work around an MSVC overload-resolution quirk on int(i)).
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
Reworks the Scheduler / Pool / Group machinery to remove global mutex contention by introducing per-priority lock-free task queues, an MPSC specialisation for single-consumer pools, and a counting-semaphore fast path for single-Sync groups. The change set also fixes three pre-existing TSAN races (IO poll/pollfd::events, Reaction::scheduler_data cache, and Watchdog data store) and adds BDD-style unit tests plus a scheduler microbenchmark.
Changes:
- New lock-free primitives (
Queue/TaskQueue/MPSCQueue/Semaphore/Priority) and aPoolthat owns five priority-bucketed queues with off-mutex idle bookkeeping. Groupgains a lock-free fast path (signedtokens+ per-prioritywait_bucketsofWaitEntry) and aRunningLock, with the existing mutex-based path retained for multi-group submissions.- TSAN race fixes in
IOController_Posix.ipp,Scheduler::submit(atomicscheduler_dataraw pointer), andWatchdogDataStore(mutex-protected reads/writes,getreturns by value, newserviceentry point).
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| .gitignore | Ignore build-*/ out-of-tree build directories. |
| src/extension/IOController_Posix.ipp | Hold notifier.mutex across watches[].events mutation to close TSAN race with the poll thread. |
| src/dsl/word/Watchdog.hpp | Add per-store mutex; get returns by value; new service static. |
| src/dsl/word/emit/Watchdog.hpp | Delegate WatchdogServicer::service to the mutex-protected WatchdogDataStore::service. |
| src/threading/Reaction.hpp | Change scheduler_data to std::atomic<void*> to avoid shared_ptr atomic-load contention. |
| src/threading/scheduler/Scheduler.{hpp,cpp} | Switch cached pool to raw pointer; reorder members so groups destroy before pools; fast-path single-group submit. |
| src/threading/scheduler/Pool.{hpp,cpp} | Replace single mutex-protected queue with per-priority buckets (MPMC or MPSC), atomic accept/pending_tasks/external_waiters, external waiter tracking. |
| src/threading/scheduler/Group.{hpp,cpp} | Add try_submit/try_acquire_running_lock lock-free fast path with signed tokens, slow_pending, wait_buckets, and RunningLock. |
| src/threading/scheduler/queue/{Queue,TaskQueue,MPSCQueue,Semaphore,Priority}.hpp | New lock-free queue interface and concrete MPMC/MPSC implementations, counting semaphore, priority bucket helper. |
| tests/tests/threading/{TaskQueue,MPSCQueue,Semaphore,Group}.cpp | BDD-style unit tests for the new primitives; Group test gains a forward-declaration include. |
| tests/tests/Benchmark.cpp | New scheduler microbenchmark over no-sync / single-sync / two-sync configurations. |
| tests/tests/util/serialise/xxhash.cpp | Add missing <cstring> include. |
* TaskQueue/MPSCQueue::link_next_block: hold the freshly allocated Block in a std::unique_ptr so a lost CAS race no longer leaks the block. * Group: drain the wait_buckets and call unregister_external_waiter for every parked entry in ~Group so the Pool::external_waiters counter is balanced on Scheduler teardown (per declaration order Pools outlive Groups, so the raw Pool* pointers in WaitEntry remain valid). * Group.hpp: drop the orphan try_submit doc block that was sitting above try_acquire_running_lock and add the rule-of-five deletes that come with declaring an explicit destructor. Co-authored-by: Cursor <cursoragent@cursor.com>
In the lock-free fast path a task that can't take its group token is parked in the Group's wait_buckets instead of sitting in the destination pool's queue with a failing lock. The old scheduler relied on that queued-but- unrunnable task to force the pool worker to poll, fail the lock, and fall through to get_idle_task, so a parked waiter always produced exactly one idle fire on its destination pool. Without that, a worker preempted past its natural idle window could pick up the drained (lock-OK) task directly and silently swallow the idle epoch (flaky IdleSingle under load). Restore the invariant with a per-pool pending_idle latch: register_external_ waiter sets it and wakes one worker, and get_task consumes it to fire one idle epoch before dispatching the next task. Gate the whole mechanism on idle_relevant() (any idle reaction bound to this pool, or any global idle reaction) so the hot Sync-contended submission path pays nothing when no idle reaction exists. Without this gate the latch oscillated every iteration and regressed single-sync by ~70% and two-syncs by ~28%; with it, contended scheduling is back within noise of the no-fix baseline. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Seems to work, heres a trace file that I can't view if someone wants to check. Edit: perfetto website cooperating today, trace looks fine and can see providers. |
MacOS and Windows CI set NUCLEAR_TEST_TIME_UNIT_DEN=10, so the default TestBase timeout of 20 units is only 2000ms. UDP runs more network cases than TCP and routinely exceeds that on loaded runners, then hangs in the IO poll loop after the forced timeout and consumes the whole 5-minute test step. TCP already uses TimeUnit(50) for this reason; apply the same budget here. Co-authored-by: Cursor <cursoragent@cursor.com>
Fix the two BUG-severity destructor-throw findings (S1048/M23_201) by marking Group::release_token, drain_one_to_pool, and notify_slow_path noexcept. Apply stylistic refactors: remove continue/break from lock-free retry loops in TaskQueue/MPSCQueue and Group; make Pool::drain_queues const; rename Binder ctor param to avoid shadowing; drop redundant BLOCK_SIZE casts. Add C++14 odr-definitions for template static constexpr BLOCK_SIZE members used from Pool.cpp/Group.cpp instantiations. Add sonar-project.properties to scope-ignore S8417 (memory_order) under src/threading/** and src/extension/**, and S5025/S3630/S3432 (placement-new queue idioms) under **/scheduler/queue/*.hpp. Co-authored-by: Cursor <cursoragent@cursor.com>
…top, benchmark gating - TaskQueue/MPSCQueue destructors now run ~T on still-enqueued items so a non-empty queue torn down (e.g. holding Task's unique_ptr<ReactionTask>) no longer skips element destructors. Added red/green tests that fill the queue across multiple blocks, partially drain, then assert all live elements are destroyed on teardown. - TaskQueue::try_dequeue now retires the block when its read-path wins the head-advance CAS. Previously only try_reclaim_block retired on that CAS, so a race between the two left a fully-drained block unreachable from both head and the graveyard, leaking it under sustained contention. - Pool::stop(FORCE) now clears `accept` so persistent pools cannot repopulate the queues after a force stop drains them and stops the threads. - Benchmark cases are hidden behind the [.benchmark] tag so the slow, timing-sensitive matrix no longer runs in the default CTest suite; test runners pass --allow-running-no-tests so an all-hidden binary still exits 0. - Corrected the Reaction::scheduler_data doc comment to describe the actual plain release/acquire store (not a CAS / set-once). Co-authored-by: Cursor <cursoragent@cursor.com>
Tag fast-path waiters as handback vs normal parkers so the GroupLock opportunistic drain only keeps its pre-reserved token for handbacks. Add a deterministic Catch2 scenario that reproduces the publish/reconcile interleaving and verifies tokens are not permanently lost. Co-authored-by: Cursor <cursoragent@cursor.com>
…test Replace the unreliable handback-detection heuristic in park_reconcile (handback_parker && wait_buckets_empty()) with a per-waiter arbiter slot (shared_ptr<atomic<bool>>). Whoever flips the slot false->true -- the parker's own park_reconcile or a racing opportunistic drainer -- owns that waiter's single token decrement, so the keep/hand-back decision is exact regardless of how many other waiters are parked. This closes the token-leak hole where another parked waiter made wait_buckets_empty() return false and the phantom fetch_sub was never undone. Preserve slow-path priority: when a token is free but slow_pending > 0 the parker hands it straight back and stays parked uncounted (slot left unclaimed) so an older multi-group waiter is not jumped by a single-group fast submit. Clear Pool::current_pool in ~Scheduler. The constructor installs a non-owning pointer to the main thread pool; without a matching reset it dangles once the Scheduler is destroyed and any later ReactionTask built on that thread trips a bad_weak_ptr via Pool::current(). Reset it (only when it still points at one of our pools) so its lifetime is bounded by the Scheduler. Add a randomized multi-threaded stress test that mixes single-group fast-path submits with multi-group slow-path lock/unlock across several threads and concurrency levels, asserting no deadlock (bounded wait) and that tokens return to concurrency with a fresh submit still scheduling after quiescence. Co-authored-by: Cursor <cursoragent@cursor.com>
Remove redundant member init on WaitEntry::slot and explicitly delete Scheduler copy/move operations to satisfy cppcoreguidelines-special-member-functions. Co-authored-by: Cursor <cursoragent@cursor.com>
Delete the unused Semaphore primitive and its test rather than fixing its multi-waiter wakeup bug, since it is dead code referenced only by its own test. Increment pending_tasks before publishing the task in Pool::submit and in the get_task re-enqueue path, so a worker can never dequeue a task before its count is registered (over-counting is safe; under-counting could transiently underflow). Reword the ~Scheduler comments to describe the real failure mode: a dangling current_pool leads to shared_from_this() on a destroyed pool (undefined behaviour), not a guaranteed bad_weak_ptr throw. Co-authored-by: Cursor <cursoragent@cursor.com>
Defer drained task destruction until after Pool::mutex is released so group-lock destructors cannot re-enter submit() under the lock. Keep NUCLEAR_GROUP_TEST_API off nuclear's public interface by making it PRIVATE on the library and PUBLIC on test_util for ODR-safe test TUs. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Assess which enqueue/dequeue paths are wait-free vs lock-free CAS vs brief spinning; add assessment doc and accurate header commentary. Use pause-and-yield on single-iteration stall paths and spin_until on the committed handoff without changing try_dequeue semantics. Co-authored-by: Cursor <cursoragent@cursor.com>
NUCLEAR_GROUP_TEST_API was never set when relying on the default ON option because src/CMakeLists.txt evaluated BUILD_TESTS before the option was declared in the root CMakeLists.txt. Co-authored-by: Cursor <cursoragent@cursor.com>
Use const Pred& since the predicate is only invoked locally. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Use uint8_t for PriorityLevel to satisfy clang-tidy performance-enum-size. Scale UDP TestBase timeout with active test count so the full matrix still fits CI windows (NUCLEAR_TEST_TIME_UNIT_DEN=10) without restoring the blanket Windows skip. Co-authored-by: Cursor <cursoragent@cursor.com>
Scale with active test count using factor 20 so a full matrix matches the prior TimeUnit(200) ceiling under NUCLEAR_TEST_TIME_UNIT_DEN=10. Co-authored-by: Cursor <cursoragent@cursor.com>
Full matrix on Windows CI finishes all but the last multicast-v6 receive at 20s; scale timeout with active_tests * 25. Co-authored-by: Cursor <cursoragent@cursor.com>
30s scaled timeout leaves headroom after the last receive for idle shutdown; 25s still tripped TestBase timeout while events were landing. Co-authored-by: Cursor <cursoragent@cursor.com>
40s scaled budget (active_tests * 40 under DEN=10) covers last receive plus Finished/idle shutdown on slow Windows runners. Co-authored-by: Cursor <cursoragent@cursor.com>
Multicast availability is gated by the round-trip probe on all platforms, but the NUClear UDP DSL matrix intermittently stalls on Windows CI runners (40s timeout with no receives). Skip on CI Windows only; matrix runs on Linux, macOS, and local Windows. Co-authored-by: Cursor <cursoragent@cursor.com>
Document pool routing, priority buckets, lock-free queues, group tokens, idle behaviour, and shutdown so PR #193 can merge with maintainer-facing architecture coverage on ReadTheDocs. Co-authored-by: Cursor <cursoragent@cursor.com>
Apply mdformat to docs/explanation/scheduler.md so the Markdown Formatting CI check passes. Co-authored-by: Cursor <cursoragent@cursor.com>
Spike write-ups belong in spike worktrees only; keep progress-guarantee content inline in scheduler docs and TaskQueue header. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace manual atomic flag management in bump(), the unmask path, and the poll loop with NotifierWakeGuard and NotifierPollScope so the wake-then-lock handoff cannot be skipped on error or early return. Co-authored-by: Cursor <cursoragent@cursor.com>
Platform pause intrinsics showed no measurable win in a handoff microbench and add ifdef complexity; plain yield loops match pre-spike behavior. Co-authored-by: Cursor <cursoragent@cursor.com>
TrentHouliston
left a comment
There was a problem hiding this comment.
🤖 Doxygen pass across new scheduler code in 5735023d and 42a79531: queue classes (TaskQueue, MPSCQueue, Queue), block_ops.hpp helpers, ExternalWaiterRegistration, PriorityLevel mapping, and queue_live_tracker.hpp now have @tparam/@param/@return blocks matching existing NUClear style. TaskQueue.hpp also documents progress guarantees inline. Added docs/explanation/scheduler.md for the design overview. Group headers/sources are intentionally deferred for a follow-up pass as noted in your review.
Delete move operations to satisfy cppcoreguidelines-special-member-functions and mark signal() const. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace white-box Group tests with black-box helpers that observe behavior through the public try_acquire_running_lock and try_submit APIs only. Co-authored-by: Cursor <cursoragent@cursor.com>
Correct the scheduler_data comment (atomic stores are not a data race), cache Pool::current() on the submit hot path, and reflow scheduler.md with semantic line breaks. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Rebased onto main after lock-free scheduler merge (#193). Introduce typed PriorityLevel, ThreadPriority RAII for set/restore, and integrate priority handling with the lock-free scheduler pools. Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Rework
Scheduler/Pool/Groupto remove the global mutex contention that showed up on the existing implementation under load:Poolnow owns five independent MPMCTaskQueues (one per priority bucket) instead of a single mutex-protected priority queue. Submitting and dequeuing on different priorities no longer contend at all, and same-priority submits use a block-based lock-free FIFO with a graveyard-style block reclamation scheme to avoid ABA.concurrency == 1(e.g.MainThread,TraceController) now constructMPSCQueueinstead ofTaskQueue, giving a non-atomic consumer side and removing the dequeue CAS entirely for those pools. Both queues sit behind aQueue<T>polymorphic interface soPoolcan hold an array of buckets and dispatch the right type at construction.Syncgroups now acquire via a signedstd::atomic<int> tokensplus a lock-freeTaskQueueper priority for waiters, falling back to the mutex-backedGroupLockonly for multi-group submissions. Sync ordering inside a group is strict; equal-priority tasks across pools see relaxed global FIFO.New lock-free primitives (
TaskQueue,MPSCQueue,Semaphore) and the Group counting fast path are covered by Catch2 BDD-style tests, plus a scheduler microbenchmark (tests/tests/Benchmark.cpp, hidden behind the[.benchmark]tag).TSAN race fixes (pre-existing)
While validating the changes under TSAN on macOS clang and Linux gcc 13 three pre-existing data races surfaced and are fixed in this PR:
IOController_Posix.ipp—pollfd::events. TheIOFinishedhandler mutatedwatches[].eventswhile only holdingtasks_mutex, and the poll thread read the same field from inside::poll()while only holdingnotifier.mutex.bump()woke poll but releasednotifier.mutexbefore the mutation, leaving the race window open. The handler now writes the wake byte inline and holdsnotifier.mutexacross thewatchesupdate and the follow-upfire_eventcall (which can also touchwatches[].events).Scheduler::submit—Reaction::scheduler_data. The cached pool was read/written from any submitting thread without synchronisation. It is now a non-owning rawPool*cached in astd::atomic<void*>(release store / acquire load) rather than astd::shared_ptr<void>accessed viastd::atomic_load/atomic_store— the shared_ptr atomics fall back to a small global mutex pool on libstdc++ and become a contention point on hot submission paths. Pools outlive all reactions (PowerPlant tears reactors down before the scheduler), so a raw pointer is safe, and all racers resolve the same pool, so the benign store is last-writer-wins.Watchdogdata store. The servicetime_pointwas read by the chrono controller while being mutated by user threads emitting a service event, and the void specialisation returned a reference through a temporaryshared_ptr(latent dangling reference). Centralised reads/writes through a per-(WatchdogGroup, RuntimeType)std::mutex, madegetreturn by value, and routedWatchdogServicer::servicethroughWatchdogDataStore::serviceso writes share the read mutex.Other
.gitignorenow matchesbuild-*/so out-of-tree TSAN/ASAN/Release build directories don't appear ingit status.sonar-project.properties.Test plan
dsl/IO,dsl/Inline,dsl/Watchdog30/30 clean each; full suite ✓Made with Cursor