Skip to content

epic: persistent worker sessions with critic ping-pong resumption (kills #78) #95

@hadamrd

Description

@hadamrd

Why

Today's worker is fire-and-forget: spawns SDK session → opens PR → exits → critic runs in a separate session → on reject a fresh worker is dispatched. This burns prompt cache, re-bootstraps file/context knowledge every iteration, and is the root cause of the dry-exit failure mode (handoff = drop). Issue #78's "multi-iteration ping-pong" tries to fix this but still uses the one-shot model — it should be superseded by this epic.

Design (architected, not bolted on)

1. Worker lifecycle as an explicit state machine

DISPATCHED → RUNNING → AWAITING_CRITIC ⇄ REVISING → MERGED
                            │                │
                            ↓                ↓
                        ABANDONED        ABANDONED
                        (budget/iter cap, sev1 stuck, operator kill)

State transitions are persisted to SQLite (worker_sessions table) on every edge. Process can die at any point; state survives.

2. Session persistence via Claude Agent SDK session_id

The SDK exposes session resumption: pass the prior session_id and the next user message gets appended to that transcript. We thread session_id through WorkerOutcomeworker_sessions.session_id column → next-tick resume call.

Prompt cache stays warm across critic round-trips (5-min TTL is the constraint — drives the polling cadence below).

3. Worktree lifecycle bound to session lifecycle

  • RUNNING, AWAITING_CRITIC, REVISING → worktree preserved at /tmp/wt-<issue>-<session_id_short>/
  • MERGED → reap (existing logic)
  • ABANDONED → preserve with worktree_preserved event for operator inspection

Extend existing reap policy in tick.py (it already preserves on failed/no_pr/timeout).

4. Parallel-slot accounting

Paused workers in AWAITING_CRITIC do not count against parallel: N. Only RUNNING + REVISING (i.e. actively burning tokens) count. This is the key efficiency win — 3 active workers + N paused-awaiting-critic is fine.

5. Critic ping-pong protocol

On RUNNING → AWAITING_CRITIC (worker opens PR + exits cleanly):

  1. Critic SDK session runs against PR diff + linked issue + transcript pointer
  2. Critic emits typed CriticReport { verdict: APPROVE | REQUEST_CHANGES | BLOCK, comments: [...] }
  3. APPROVE → auto-merge → reap
  4. REQUEST_CHANGES → enqueue REVISING tick: resume worker session with critic comments as next user message, capped iterations
  5. BLOCK (sev1) → ABANDONED, operator label, worktree preserved

6. Termination conditions (must be explicit, not implicit)

  • max_critic_iterations: 3 (config) — past this, ABANDONED with loop:needs-review
  • Per-worker hard budget (turns or wallclock) carried across iterations, not reset
  • Critic verdict BLOCK = immediate abandon
  • Operator loop:abandon label = abandon next tick

7. Crash recovery

If the loop runner dies mid-session: on restart, scan worker_sessions for non-terminal states. Each gets:

  • Worktree path probed (exists? clean? on expected branch?)
  • SDK session_id probed (resumable?)
  • If both OK → re-enter RUNNING or AWAITING_CRITIC based on PR state
  • If worktree/session lost → ABANDONED with crash_recovery_failed event

8. Migration

Acceptance

  1. A worker that gets REQUEST_CHANGES from critic resumes in the same SDK session — verifiable via session_id in events log + prompt cache hit ratio
  2. Worktree survives critic round-trip (no re-clone, no re-checkout)
  3. Paused workers don't block dispatch of new ones (parallel slot freed)
  4. Crash mid-session → recovery picks up correctly (integration test with SIGKILL of runner)
  5. max_critic_iterations cap enforced (test: critic always REQUEST_CHANGES → worker abandoned at N=3)
  6. Cost telemetry shows iteration 2+ is ≥30% cheaper than iteration 1 (prompt cache hit)

Non-goals

  • Worker-to-worker collaboration (out of scope)
  • Critic acting as a separate "pair" (still one-shot per round)
  • Distributed runner (single-host SQLite remains source of truth)

File pointers

  • src/forge_loop/runner/tick.py — state machine + dispatch loop
  • src/forge_loop/_worker_sdk.py — add session_id resume param
  • src/forge_loop/store/ (new) — worker_sessions SQLite schema + DAO
  • src/forge_loop/critic/ (refactor) — typed CriticReport
  • src/forge_loop/config.pymax_critic_iterations, LOOP_PERSISTENT_WORKER flag

Metadata

Metadata

Assignees

No one assigned

    Labels

    epicMulti-PR umbrella tracking a major themepriority:p1Important, near-term

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions