Skip to content

fix(session-ingest): shrink cli_sessions_v2 status row lock#4322

Open
RSO wants to merge 1 commit into
mainfrom
fix/session-ingest-status-lock-duration
Open

fix(session-ingest): shrink cli_sessions_v2 status row lock#4322
RSO wants to merge 1 commit into
mainfrom
fix/session-ingest-status-lock-duration

Conversation

@RSO

@RSO RSO commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Reduces per-session row-lock hold time on cli_sessions_v2, the second major lock contributor identified in production Supabase lock-wait logs (the select "status" from "cli_sessions_v2" ... for update query). Over a representative 7-day window this exact query accounted for 1,440 acquired lock waits, dominated by ShareLock-on-transaction waits (max 110s) that the earlier ExclusiveLock-only investigation missed.

The previous applyMetadataChanges flow opened a JS transaction, took SELECT ... FOR UPDATE early, then held that row lock across a metadata UPDATE, a parent-session lookup + update, and a final full-row SELECT before commit. This change shrinks the locked critical section:

  • Status transitions now run as a single statement: a WITH locked AS (SELECT ... FOR UPDATE) CTE plus UPDATE ... RETURNING, returning both the locked previous status and the updated row. No multi-round-trip JS transaction holds the lock.
  • No-op status writes (same status re-sent) no longer stamp status_updated_at/updated_at, no longer hold the lock through a write, and emit no status notification.
  • Non-status metadata and parent-session updates moved out of the status lock path into their own idempotent UPDATE ... RETURNING statements.
  • The notification payload now comes from RETURNING instead of a final in-transaction SELECT.

Locked-status notification semantics are preserved: session.status.updated still reports the persisted previous status captured under the row lock, not the queued intake snapshot.

Verification

  • Reviewed production Supabase lock-wait logs (Axiom) to confirm the dominant wait mode and that long durations are lock waits on the PK index, not missing-index scans.
  • Manually traced the queue-consumer code path to confirm the lock is no longer held across parent lookup, parent update, or the final payload read.

Visual Changes

N/A

Reviewer Notes

  • The status transition is now raw SQL (db.execute(sql\...`)) because Drizzle's query builder can't express "return the locked previous value plus the updated row" in one statement. previous_statusis read via a CTE subquery inRETURNINGrather than relying onFROM`-alias visibility, for cross-version safety.
  • Behavior change worth confirming: a no-op status re-send no longer bumps updated_at/status_updated_at and emits no notification. This is intended (it was previously incidental lock-holding work), but flagging in case any consumer relied on the timestamp bump.
  • Non-status metadata, parent updates, and the status transition are now separate statements rather than one transaction. They remain individually idempotent and per-session; ordering across them is unchanged.

Reduce per-session row lock hold time on cli_sessions_v2 by removing the
multi-step JS transaction around status metadata. Status transitions now
run as a single SELECT ... FOR UPDATE + UPDATE ... RETURNING CTE, no-op
status writes no longer stamp timestamps or lock the row, and parent and
non-status metadata updates no longer execute while the status lock is
held.
)
.limit(1);
let changedNonStatus = false;
let notificationRow = await updateSessionMetadata(db, kiloUserId, sessionId, updates);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Splitting these writes across separate committed statements breaks the queue retry guarantee

applyMetadataChanges() used to persist metadata, parent updates, and status changes in one transaction, so a later failure rolled the whole batch back. With the three independent statements here, updateSessionMetadata() or updateSessionParent() can commit and then updateSessionStatus() can still fail. The queue will retry the message, but the Durable Object only re-emits metadata changes when its stored value differs, so the replay can come back with an empty changes list and Postgres will stay permanently out of sync on whichever fields were not written before the failure. This path still needs all-or-nothing persistence, or a retry path that can reconstruct and reapply the full merged change set.


Reply with @kilocode-bot fix it to have Kilo Code address this issue.

@kilo-code-bot

kilo-code-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
services/session-ingest/src/queue-consumer.ts 598 Splitting the persistence path into separate committed statements means a later failure can leave Postgres permanently out of sync with the Durable Object replay path.
Files Reviewed (3 files)
  • services/session-ingest/src/queue-consumer.ts - 1 issue
  • services/session-ingest/src/queue-consumer.test.ts - 0 issues
  • services/session-ingest/src/types/user-connection-protocol.ts - 0 issues

Reviewed by gpt-5.4-20260305 · Input: 90.7K · Output: 13.2K · Cached: 624.6K

Review guidance: REVIEW.md from base branch main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant