Skip to content

fix(session): retry resumed turns that fail against an expired Cursor agent#52

Open
justin-carper wants to merge 1 commit into
mainfrom
flint-humidity
Open

fix(session): retry resumed turns that fail against an expired Cursor agent#52
justin-carper wants to merge 1 commit into
mainfrom
flint-humidity

Conversation

@justin-carper

Copy link
Copy Markdown
Collaborator

Problem

After a session sits idle then resumes, turns intermittently fail with:

Cursor run ended with status "error"

Timing is inconsistent because the trigger is server-side Cursor agent expiry, not any local clock. Cursor's API publishes no agent-retention TTL (only 429 backoff), and the server drops agents well before our local 7-day reuse window.

Root cause

The create-fallback guarded resume but not the send that follows it.

Path Behavior
resumeAgent() throws Caught → falls through to fresh createAgent (full replay). Graceful.
resumeAgent() succeeds, later send() fails Not caught. Error propagated and failed the turn.

When a resumed agent is server-side-stale, Agent.resume(agentId) still succeeds locally; the failure only surfaces when the run completes with status === "error" inside the stream. The pooled record had already been re-pointed at the dead agentId, so the next turn resumed the same dead agent again.

Fix

agentRun now wraps the resumed-turn stream. On a resumed turn that throws before emitting any event and when not aborted, it transparently:

  1. Re-creates a fresh agent (same pool key/record, no resumeAgentId).
  2. Replays the full transcript (no context loss).
  3. Re-pools under the same session, overwriting the dead agentId.

Guarded to a single attempt. Never retries a fresh-create turn, an already-emitting stream, or a user abort. If re-acquire itself fails, the original resume failure is chained as error.cause for diagnosability.

Tests

New test/language-model.test.ts (7 cases, drive doStream end-to-end with a mocked SDK backend):

  • Resumed + error + no emit → re-create + full replay, pool re-pointed
  • Resumed + error after emit → no retry (no double-emit)
  • Fresh-create error → no retry
  • Aborted → no retry
  • Retry send also fails → propagates (single attempt)
  • Non-pooled explicit-agentId resume → retries, closes both agents, pools neither
  • Re-acquire throws → original failure chained as cause

Full suite 228 pass, tsc --noEmit clean, npm run build success.

Notes

  • No change to session-pool.ts or agent-events.ts — retry is composed from existing primitives.
  • Local TTLs (7d session, 24h/30d model cache) left unchanged; they were not the cause.

A pooled Cursor agent can pass resume() yet fail the subsequent send when
Cursor's server has already expired it — surfacing as `Cursor run ended
with status "error"` after a session sits idle. acquireAgent only wrapped
resumeAgent() in its create-fallback, so a successful-resume-then-failed-send
went uncaught and failed the turn (server retention is shorter than our local
7-day reuse window and is undocumented).

agentRun now wraps the resumed-turn stream: on a resumed turn that throws
before emitting any event (and is not aborted), it re-creates a fresh agent,
replays the full transcript, and re-pools under the same session, overwriting
the dead agentId. Guarded to a single attempt; never retries a fresh-create
turn, an already-emitting stream, or a user abort.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant