fix(concurrency): snapshot lock state before awaits in status RPC + MCP dispatch#304
Merged
Merged
Conversation
…CP dispatch
Phase 10b hand-audit found 2 of 36 lock-across-await candidate sites
where the read guard was held across long-running inner awaits. Neither
was a deadlock (no lock-acquisition cycle), but both caused unnecessary
contention with concurrent writers on the same lock.
## Fix 1 — crates/uffs-daemon/src/index/stats.rs:78
`IndexManager::status()` held `self.status.read().await` across three
inner awaits (`has_drives`, `total_records`, `snapshot`) and the
allocator-mem snapshot call. The guard was only used to clone the
`DaemonStatus` into the response at the very end. This blocked any
concurrent `set_ready` / `set_loading_progress` / `refresh` writer
on `self.status` for the full duration of the status RPC (tens of ms
on many-drive boxes).
Fix: snapshot the `DaemonStatus` upfront via `.read().await.clone()`
and use the owned value in the response. `DaemonStatus` is already
`Clone` (small enum: bare unit variants, plus Loading{usize,usize},
plus Refreshing{Vec<DriveLetter>} which is typically empty in steady
state). Snapshot is microseconds.
## Fix 2 — crates/uffs-mcp/src/handler/mod.rs:256 + roots.rs:43
`UffsMcpServer::dispatch_tool` held `self.roots.read().await` across
the entire tool dispatch chain — `tools::search::run(..., &roots_state).await`
can run for seconds against the daemon RPC for large queries. Any
concurrent `on_roots_list_changed` notification (which takes
`self.roots.write().await`) blocked for the duration of every
in-flight tool call across the whole MCP session.
Fix:
* Derive `Clone` on `RootsState` (cheap: typically <10 short-string
`RootScope` entries).
* Snapshot via `.read().await.clone()` and pass the owned value
by reference to the tool runners. Tool signatures unchanged
(they already took `&RootsState`).
## Verification
* `cargo check -p uffs-daemon -p uffs-mcp` — clean.
* `cargo clippy -p uffs-daemon -p uffs-mcp --all-targets -- -W clippy::await_holding_lock -D warnings`
— 0 warnings.
* `cargo test -p uffs-daemon --lib` — 298 passed / 0 failed.
* `cargo test -p uffs-mcp` — 12 passed / 0 failed + 1 doctest passed.
## Audit verdict for the remaining 34 candidate sites
All 34 are textbook-clean and require no changes. See the per-site
verdict table in docs/dev/baseline/2026-05-19/phase_10_lock_across_await_audit.md
(local-only; gitignored). The dominant patterns are:
* Block-scoped extract-then-await: `let x: Vec<_> = { let g = lock.read().await; … .collect() };`
(used in dispatch.rs Phase-1, drives.rs, forget_drive.rs, hibernate_shards Phase-1,
cascade_demote_one_step Phase-1, etc).
* Explicit `drop(guard);` immediately before any further `.await`
(used in every write-guard swap path: add_drive, replace_drive,
promote/demote, journal apply, etc).
* Single-statement guard (no inner await possible): `*lock.write().await = …;`
or `lock.read().await.is_empty()`.
## Rule-1 adherence
Zero `#[allow(...)]` introductions. No suppression hacks, no
disabled lints, no skipped tests. Both fixes are minimal extract-
then-await snapshots with full rustdoc justification at each
modified site.
Refs #302.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Phase 10b hand-audit fixes for the 2 of 36 lock-across-await candidate sites that held a guard across long-running inner awaits. Neither was a deadlock (no lock-acquisition cycle), but both caused unnecessary contention with concurrent writers on the same lock.
Why
Phase 10b of the workspace refactor playbook (#302) requires that every lock-across-await site identified by
scripts/dev/concurrency_audit.sh(added in #303) is either proven safe by hand-audit or fixed. Hand-audit of all 36 candidate sites found:drop(guard);before next.await, single-statement guards with no inner await.Fix 1 —
uffs-daemon/src/index/stats.rs:78(status RPC)IndexManager::status()heldself.status.read().awaitacross three inner awaits (has_drives,total_records,snapshot) and the allocator-mem snapshot call. The guard was only used to clone theDaemonStatusinto the response at the very end. This blocked any concurrentset_ready/set_loading_progress/refreshwriter onself.statusfor the full duration of the status RPC (tens of ms on many-drive boxes).Fix: snapshot via
let status_snapshot = self.status.read().await.clone();upfront.DaemonStatusis alreadyClone(small enum). Snapshot is microseconds.Fix 2 —
uffs-mcp/src/handler/mod.rs:256+uffs-mcp/src/roots.rs:43(MCP dispatch)UffsMcpServer::dispatch_toolheldself.roots.read().awaitacross the entire tool dispatch chain —tools::search::run(..., &roots_state).awaitcan run for seconds against the daemon RPC for large queries. Any concurrenton_roots_list_changednotification (which takesself.roots.write().await) blocked for the duration of every in-flight tool call across the whole MCP session.Fix:
CloneonRootsState(cheap: typically<10short-stringRootScopeentries).let roots_state = self.roots.read().await.clone();and pass the owned value by reference to the tool runners. Tool signatures unchanged (they already took&RootsState).Rule-1 adherence
#[allow(...)]introductions.# Concurrencysection explaining the invariant + the failure mode the snapshot prevents).Verification
Local:
cargo check -p uffs-daemon -p uffs-mcp— clean.cargo clippy -p uffs-daemon -p uffs-mcp --all-targets -- -W clippy::await_holding_lock -D warnings— 0 warnings.cargo test -p uffs-daemon --lib— 298 passed / 0 failed.cargo test -p uffs-mcp— 12 passed / 0 failed + 1 doctest passed.scripts/dev/lint-fast(pre-commit) — all 7 stages passed in 25s.scripts/dev/lint-pre-push— all 17 stages passed in 92s.CI: see check suite below.
Audit verdict for the remaining 34 sites
Per-site verdict tables for all 36 candidate sites are documented in
docs/dev/baseline/2026-05-19/phase_10_lock_across_await_audit.md(local-only;docs/dev/baseline/is gitignored). This PR closes the lock-across-await hazard ledger for Phase 10b.Tracking
Refs #302 (Phase 10 umbrella). Follow-on PRs for sub-phases 10c (task ownership), 10d (backpressure), 10e (timeouts), 10f (blocking IO), 10g (policy doc), 10h (CONTRIBUTING) will come in separate diffs.