Skip to content

fix(concurrency): snapshot lock state before awaits in status RPC + MCP dispatch#304

Merged
githubrobbi merged 1 commit into
mainfrom
fix/phase-10b-lock-across-await
May 20, 2026
Merged

fix(concurrency): snapshot lock state before awaits in status RPC + MCP dispatch#304
githubrobbi merged 1 commit into
mainfrom
fix/phase-10b-lock-across-await

Conversation

@githubrobbi
Copy link
Copy Markdown
Collaborator

What

Phase 10b hand-audit fixes for the 2 of 36 lock-across-await candidate sites that held a guard across long-running inner awaits. Neither was a deadlock (no lock-acquisition cycle), but both caused unnecessary contention with concurrent writers on the same lock.

Why

Phase 10b of the workspace refactor playbook (#302) requires that every lock-across-await site identified by scripts/dev/concurrency_audit.sh (added in #303) is either proven safe by hand-audit or fixed. Hand-audit of all 36 candidate sites found:

  • 34 sites: already textbook-clean. Dominant patterns: block-scoped extract-then-await returning owned values, explicit drop(guard); before next .await, single-statement guards with no inner await.
  • 2 sites: fix required. Both fixed in this PR via the extract-then-clone-then-await snapshot pattern.

Fix 1 — uffs-daemon/src/index/stats.rs:78 (status RPC)

IndexManager::status() held self.status.read().await across three inner awaits (has_drives, total_records, snapshot) and the allocator-mem snapshot call. The guard was only used to clone the DaemonStatus into the response at the very end. This blocked any concurrent set_ready / set_loading_progress / refresh writer on self.status for the full duration of the status RPC (tens of ms on many-drive boxes).

Fix: snapshot via let status_snapshot = self.status.read().await.clone(); upfront. DaemonStatus is already Clone (small enum). Snapshot is microseconds.

Fix 2 — uffs-mcp/src/handler/mod.rs:256 + uffs-mcp/src/roots.rs:43 (MCP dispatch)

UffsMcpServer::dispatch_tool held self.roots.read().await across the entire tool dispatch chain — tools::search::run(..., &roots_state).await can run for seconds against the daemon RPC for large queries. Any concurrent on_roots_list_changed notification (which takes self.roots.write().await) blocked for the duration of every in-flight tool call across the whole MCP session.

Fix:

  • Derive Clone on RootsState (cheap: typically <10 short-string RootScope entries).
  • Snapshot via let roots_state = self.roots.read().await.clone(); and pass the owned value by reference to the tool runners. Tool signatures unchanged (they already took &RootsState).

Rule-1 adherence

  • Zero #[allow(...)] introductions.
  • No suppression hacks, no disabled lints, no skipped tests.
  • Both fixes are minimal extract-then-await snapshots.
  • Full rustdoc justification at each modified site (# Concurrency section explaining the invariant + the failure mode the snapshot prevents).
  • Behavior preserved: status response shape unchanged; tool dispatch shape unchanged.

Verification

Local:

  • cargo check -p uffs-daemon -p uffs-mcp — clean.
  • cargo clippy -p uffs-daemon -p uffs-mcp --all-targets -- -W clippy::await_holding_lock -D warnings — 0 warnings.
  • cargo test -p uffs-daemon --lib298 passed / 0 failed.
  • cargo test -p uffs-mcp12 passed / 0 failed + 1 doctest passed.
  • scripts/dev/lint-fast (pre-commit) — all 7 stages passed in 25s.
  • scripts/dev/lint-pre-push — all 17 stages passed in 92s.

CI: see check suite below.

Audit verdict for the remaining 34 sites

Per-site verdict tables for all 36 candidate sites are documented in docs/dev/baseline/2026-05-19/phase_10_lock_across_await_audit.md (local-only; docs/dev/baseline/ is gitignored). This PR closes the lock-across-await hazard ledger for Phase 10b.

Tracking

Refs #302 (Phase 10 umbrella). Follow-on PRs for sub-phases 10c (task ownership), 10d (backpressure), 10e (timeouts), 10f (blocking IO), 10g (policy doc), 10h (CONTRIBUTING) will come in separate diffs.

…CP dispatch

Phase 10b hand-audit found 2 of 36 lock-across-await candidate sites
where the read guard was held across long-running inner awaits.  Neither
was a deadlock (no lock-acquisition cycle), but both caused unnecessary
contention with concurrent writers on the same lock.

## Fix 1 — crates/uffs-daemon/src/index/stats.rs:78

`IndexManager::status()` held `self.status.read().await` across three
inner awaits (`has_drives`, `total_records`, `snapshot`) and the
allocator-mem snapshot call.  The guard was only used to clone the
`DaemonStatus` into the response at the very end.  This blocked any
concurrent `set_ready` / `set_loading_progress` / `refresh` writer
on `self.status` for the full duration of the status RPC (tens of ms
on many-drive boxes).

Fix: snapshot the `DaemonStatus` upfront via `.read().await.clone()`
and use the owned value in the response.  `DaemonStatus` is already
`Clone` (small enum: bare unit variants, plus Loading{usize,usize},
plus Refreshing{Vec<DriveLetter>} which is typically empty in steady
state).  Snapshot is microseconds.

## Fix 2 — crates/uffs-mcp/src/handler/mod.rs:256 + roots.rs:43

`UffsMcpServer::dispatch_tool` held `self.roots.read().await` across
the entire tool dispatch chain — `tools::search::run(..., &roots_state).await`
can run for seconds against the daemon RPC for large queries.  Any
concurrent `on_roots_list_changed` notification (which takes
`self.roots.write().await`) blocked for the duration of every
in-flight tool call across the whole MCP session.

Fix:
  * Derive `Clone` on `RootsState` (cheap: typically <10 short-string
    `RootScope` entries).
  * Snapshot via `.read().await.clone()` and pass the owned value
    by reference to the tool runners.  Tool signatures unchanged
    (they already took `&RootsState`).

## Verification

  * `cargo check -p uffs-daemon -p uffs-mcp` — clean.
  * `cargo clippy -p uffs-daemon -p uffs-mcp --all-targets -- -W clippy::await_holding_lock -D warnings`
    — 0 warnings.
  * `cargo test -p uffs-daemon --lib` — 298 passed / 0 failed.
  * `cargo test -p uffs-mcp` — 12 passed / 0 failed + 1 doctest passed.

## Audit verdict for the remaining 34 candidate sites

All 34 are textbook-clean and require no changes.  See the per-site
verdict table in docs/dev/baseline/2026-05-19/phase_10_lock_across_await_audit.md
(local-only; gitignored).  The dominant patterns are:

  * Block-scoped extract-then-await: `let x: Vec<_> = { let g = lock.read().await; … .collect() };`
    (used in dispatch.rs Phase-1, drives.rs, forget_drive.rs, hibernate_shards Phase-1,
    cascade_demote_one_step Phase-1, etc).
  * Explicit `drop(guard);` immediately before any further `.await`
    (used in every write-guard swap path: add_drive, replace_drive,
    promote/demote, journal apply, etc).
  * Single-statement guard (no inner await possible): `*lock.write().await = …;`
    or `lock.read().await.is_empty()`.

## Rule-1 adherence

Zero `#[allow(...)]` introductions.  No suppression hacks, no
disabled lints, no skipped tests.  Both fixes are minimal extract-
then-await snapshots with full rustdoc justification at each
modified site.

Refs #302.
@githubrobbi githubrobbi merged commit 4e3ee54 into main May 20, 2026
21 checks passed
@githubrobbi githubrobbi deleted the fix/phase-10b-lock-across-await branch May 20, 2026 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant