fix(concurrency): snapshot lock state before awaits in status RPC + MCP dispatch by githubrobbi · Pull Request #304 · skyllc-ai/UltraFastFileSearch

githubrobbi · 2026-05-20T01:44:22Z

What

Phase 10b hand-audit fixes for the 2 of 36 lock-across-await candidate sites that held a guard across long-running inner awaits. Neither was a deadlock (no lock-acquisition cycle), but both caused unnecessary contention with concurrent writers on the same lock.

Why

Phase 10b of the workspace refactor playbook (#302) requires that every lock-across-await site identified by scripts/dev/concurrency_audit.sh (added in #303) is either proven safe by hand-audit or fixed. Hand-audit of all 36 candidate sites found:

34 sites: already textbook-clean. Dominant patterns: block-scoped extract-then-await returning owned values, explicit drop(guard); before next .await, single-statement guards with no inner await.
2 sites: fix required. Both fixed in this PR via the extract-then-clone-then-await snapshot pattern.

Fix 1 — `uffs-daemon/src/index/stats.rs:78` (status RPC)

IndexManager::status() held self.status.read().await across three inner awaits (has_drives, total_records, snapshot) and the allocator-mem snapshot call. The guard was only used to clone the DaemonStatus into the response at the very end. This blocked any concurrent set_ready / set_loading_progress / refresh writer on self.status for the full duration of the status RPC (tens of ms on many-drive boxes).

Fix: snapshot via let status_snapshot = self.status.read().await.clone(); upfront. DaemonStatus is already Clone (small enum). Snapshot is microseconds.

Fix 2 — `uffs-mcp/src/handler/mod.rs:256` + `uffs-mcp/src/roots.rs:43` (MCP dispatch)

UffsMcpServer::dispatch_tool held self.roots.read().await across the entire tool dispatch chain — tools::search::run(..., &roots_state).await can run for seconds against the daemon RPC for large queries. Any concurrent on_roots_list_changed notification (which takes self.roots.write().await) blocked for the duration of every in-flight tool call across the whole MCP session.

Fix:

Derive Clone on RootsState (cheap: typically <10 short-string RootScope entries).
Snapshot via let roots_state = self.roots.read().await.clone(); and pass the owned value by reference to the tool runners. Tool signatures unchanged (they already took &RootsState).

Rule-1 adherence

Zero #[allow(...)] introductions.
No suppression hacks, no disabled lints, no skipped tests.
Both fixes are minimal extract-then-await snapshots.
Full rustdoc justification at each modified site (# Concurrency section explaining the invariant + the failure mode the snapshot prevents).
Behavior preserved: status response shape unchanged; tool dispatch shape unchanged.

Verification

Local:

cargo check -p uffs-daemon -p uffs-mcp — clean.
cargo clippy -p uffs-daemon -p uffs-mcp --all-targets -- -W clippy::await_holding_lock -D warnings — 0 warnings.
cargo test -p uffs-daemon --lib — 298 passed / 0 failed.
cargo test -p uffs-mcp — 12 passed / 0 failed + 1 doctest passed.
scripts/dev/lint-fast (pre-commit) — all 7 stages passed in 25s.
scripts/dev/lint-pre-push — all 17 stages passed in 92s.

CI: see check suite below.

Audit verdict for the remaining 34 sites

Per-site verdict tables for all 36 candidate sites are documented in docs/dev/baseline/2026-05-19/phase_10_lock_across_await_audit.md (local-only; docs/dev/baseline/ is gitignored). This PR closes the lock-across-await hazard ledger for Phase 10b.

Tracking

Refs #302 (Phase 10 umbrella). Follow-on PRs for sub-phases 10c (task ownership), 10d (backpressure), 10e (timeouts), 10f (blocking IO), 10g (policy doc), 10h (CONTRIBUTING) will come in separate diffs.

…CP dispatch Phase 10b hand-audit found 2 of 36 lock-across-await candidate sites where the read guard was held across long-running inner awaits. Neither was a deadlock (no lock-acquisition cycle), but both caused unnecessary contention with concurrent writers on the same lock. ## Fix 1 — crates/uffs-daemon/src/index/stats.rs:78 `IndexManager::status()` held `self.status.read().await` across three inner awaits (`has_drives`, `total_records`, `snapshot`) and the allocator-mem snapshot call. The guard was only used to clone the `DaemonStatus` into the response at the very end. This blocked any concurrent `set_ready` / `set_loading_progress` / `refresh` writer on `self.status` for the full duration of the status RPC (tens of ms on many-drive boxes). Fix: snapshot the `DaemonStatus` upfront via `.read().await.clone()` and use the owned value in the response. `DaemonStatus` is already `Clone` (small enum: bare unit variants, plus Loading{usize,usize}, plus Refreshing{Vec<DriveLetter>} which is typically empty in steady state). Snapshot is microseconds. ## Fix 2 — crates/uffs-mcp/src/handler/mod.rs:256 + roots.rs:43 `UffsMcpServer::dispatch_tool` held `self.roots.read().await` across the entire tool dispatch chain — `tools::search::run(..., &roots_state).await` can run for seconds against the daemon RPC for large queries. Any concurrent `on_roots_list_changed` notification (which takes `self.roots.write().await`) blocked for the duration of every in-flight tool call across the whole MCP session. Fix: * Derive `Clone` on `RootsState` (cheap: typically <10 short-string `RootScope` entries). * Snapshot via `.read().await.clone()` and pass the owned value by reference to the tool runners. Tool signatures unchanged (they already took `&RootsState`). ## Verification * `cargo check -p uffs-daemon -p uffs-mcp` — clean. * `cargo clippy -p uffs-daemon -p uffs-mcp --all-targets -- -W clippy::await_holding_lock -D warnings` — 0 warnings. * `cargo test -p uffs-daemon --lib` — 298 passed / 0 failed. * `cargo test -p uffs-mcp` — 12 passed / 0 failed + 1 doctest passed. ## Audit verdict for the remaining 34 candidate sites All 34 are textbook-clean and require no changes. See the per-site verdict table in docs/dev/baseline/2026-05-19/phase_10_lock_across_await_audit.md (local-only; gitignored). The dominant patterns are: * Block-scoped extract-then-await: `let x: Vec<_> = { let g = lock.read().await; … .collect() };` (used in dispatch.rs Phase-1, drives.rs, forget_drive.rs, hibernate_shards Phase-1, cascade_demote_one_step Phase-1, etc). * Explicit `drop(guard);` immediately before any further `.await` (used in every write-guard swap path: add_drive, replace_drive, promote/demote, journal apply, etc). * Single-statement guard (no inner await possible): `*lock.write().await = …;` or `lock.read().await.is_empty()`. ## Rule-1 adherence Zero `#[allow(...)]` introductions. No suppression hacks, no disabled lints, no skipped tests. Both fixes are minimal extract- then-await snapshots with full rustdoc justification at each modified site. Refs #302.

githubrobbi merged commit 4e3ee54 into main May 20, 2026
21 checks passed

githubrobbi deleted the fix/phase-10b-lock-across-await branch May 20, 2026 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(concurrency): snapshot lock state before awaits in status RPC + MCP dispatch#304

fix(concurrency): snapshot lock state before awaits in status RPC + MCP dispatch#304
githubrobbi merged 1 commit into
mainfrom
fix/phase-10b-lock-across-await

githubrobbi commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

githubrobbi commented May 20, 2026

What

Why

Fix 1 — uffs-daemon/src/index/stats.rs:78 (status RPC)

Fix 2 — uffs-mcp/src/handler/mod.rs:256 + uffs-mcp/src/roots.rs:43 (MCP dispatch)

Rule-1 adherence

Verification

Audit verdict for the remaining 34 sites

Tracking

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix 1 — `uffs-daemon/src/index/stats.rs:78` (status RPC)

Fix 2 — `uffs-mcp/src/handler/mod.rs:256` + `uffs-mcp/src/roots.rs:43` (MCP dispatch)