Why
Today's worker is fire-and-forget: spawns SDK session → opens PR → exits → critic runs in a separate session → on reject a fresh worker is dispatched. This burns prompt cache, re-bootstraps file/context knowledge every iteration, and is the root cause of the dry-exit failure mode (handoff = drop). Issue #78's "multi-iteration ping-pong" tries to fix this but still uses the one-shot model — it should be superseded by this epic.
Design (architected, not bolted on)
1. Worker lifecycle as an explicit state machine
DISPATCHED → RUNNING → AWAITING_CRITIC ⇄ REVISING → MERGED
│ │
↓ ↓
ABANDONED ABANDONED
(budget/iter cap, sev1 stuck, operator kill)
State transitions are persisted to SQLite (worker_sessions table) on every edge. Process can die at any point; state survives.
2. Session persistence via Claude Agent SDK session_id
The SDK exposes session resumption: pass the prior session_id and the next user message gets appended to that transcript. We thread session_id through WorkerOutcome → worker_sessions.session_id column → next-tick resume call.
Prompt cache stays warm across critic round-trips (5-min TTL is the constraint — drives the polling cadence below).
3. Worktree lifecycle bound to session lifecycle
RUNNING, AWAITING_CRITIC, REVISING → worktree preserved at /tmp/wt-<issue>-<session_id_short>/
MERGED → reap (existing logic)
ABANDONED → preserve with worktree_preserved event for operator inspection
Extend existing reap policy in tick.py (it already preserves on failed/no_pr/timeout).
4. Parallel-slot accounting
Paused workers in AWAITING_CRITIC do not count against parallel: N. Only RUNNING + REVISING (i.e. actively burning tokens) count. This is the key efficiency win — 3 active workers + N paused-awaiting-critic is fine.
5. Critic ping-pong protocol
On RUNNING → AWAITING_CRITIC (worker opens PR + exits cleanly):
- Critic SDK session runs against PR diff + linked issue + transcript pointer
- Critic emits typed
CriticReport { verdict: APPROVE | REQUEST_CHANGES | BLOCK, comments: [...] }
- APPROVE → auto-merge → reap
- REQUEST_CHANGES → enqueue
REVISING tick: resume worker session with critic comments as next user message, capped iterations
- BLOCK (sev1) → ABANDONED, operator label, worktree preserved
6. Termination conditions (must be explicit, not implicit)
max_critic_iterations: 3 (config) — past this, ABANDONED with loop:needs-review
- Per-worker hard budget (turns or wallclock) carried across iterations, not reset
- Critic verdict BLOCK = immediate abandon
- Operator
loop:abandon label = abandon next tick
7. Crash recovery
If the loop runner dies mid-session: on restart, scan worker_sessions for non-terminal states. Each gets:
- Worktree path probed (exists? clean? on expected branch?)
- SDK
session_id probed (resumable?)
- If both OK → re-enter
RUNNING or AWAITING_CRITIC based on PR state
- If worktree/session lost → ABANDONED with
crash_recovery_failed event
8. Migration
Acceptance
- A worker that gets
REQUEST_CHANGES from critic resumes in the same SDK session — verifiable via session_id in events log + prompt cache hit ratio
- Worktree survives critic round-trip (no re-clone, no re-checkout)
- Paused workers don't block dispatch of new ones (parallel slot freed)
- Crash mid-session → recovery picks up correctly (integration test with SIGKILL of runner)
max_critic_iterations cap enforced (test: critic always REQUEST_CHANGES → worker abandoned at N=3)
- Cost telemetry shows iteration 2+ is ≥30% cheaper than iteration 1 (prompt cache hit)
Non-goals
- Worker-to-worker collaboration (out of scope)
- Critic acting as a separate "pair" (still one-shot per round)
- Distributed runner (single-host SQLite remains source of truth)
File pointers
src/forge_loop/runner/tick.py — state machine + dispatch loop
src/forge_loop/_worker_sdk.py — add session_id resume param
src/forge_loop/store/ (new) — worker_sessions SQLite schema + DAO
src/forge_loop/critic/ (refactor) — typed CriticReport
src/forge_loop/config.py — max_critic_iterations, LOOP_PERSISTENT_WORKER flag
Why
Today's worker is fire-and-forget: spawns SDK session → opens PR → exits → critic runs in a separate session → on reject a fresh worker is dispatched. This burns prompt cache, re-bootstraps file/context knowledge every iteration, and is the root cause of the dry-exit failure mode (handoff = drop). Issue #78's "multi-iteration ping-pong" tries to fix this but still uses the one-shot model — it should be superseded by this epic.
Design (architected, not bolted on)
1. Worker lifecycle as an explicit state machine
State transitions are persisted to SQLite (
worker_sessionstable) on every edge. Process can die at any point; state survives.2. Session persistence via Claude Agent SDK
session_idThe SDK exposes session resumption: pass the prior
session_idand the next user message gets appended to that transcript. We threadsession_idthroughWorkerOutcome→worker_sessions.session_idcolumn → next-tick resume call.Prompt cache stays warm across critic round-trips (5-min TTL is the constraint — drives the polling cadence below).
3. Worktree lifecycle bound to session lifecycle
RUNNING,AWAITING_CRITIC,REVISING→ worktree preserved at/tmp/wt-<issue>-<session_id_short>/MERGED→ reap (existing logic)ABANDONED→ preserve withworktree_preservedevent for operator inspectionExtend existing reap policy in
tick.py(it already preserves on failed/no_pr/timeout).4. Parallel-slot accounting
Paused workers in
AWAITING_CRITICdo not count againstparallel: N. OnlyRUNNING+REVISING(i.e. actively burning tokens) count. This is the key efficiency win — 3 active workers + N paused-awaiting-critic is fine.5. Critic ping-pong protocol
On
RUNNING → AWAITING_CRITIC(worker opens PR + exits cleanly):CriticReport { verdict: APPROVE | REQUEST_CHANGES | BLOCK, comments: [...] }REVISINGtick: resume worker session with critic comments as next user message, capped iterations6. Termination conditions (must be explicit, not implicit)
max_critic_iterations: 3(config) — past this, ABANDONED withloop:needs-reviewloop:abandonlabel = abandon next tick7. Crash recovery
If the loop runner dies mid-session: on restart, scan
worker_sessionsfor non-terminal states. Each gets:session_idprobed (resumable?)RUNNINGorAWAITING_CRITICbased on PR statecrash_recovery_failedevent8. Migration
CriticReportnot gh-cli scrapingWorkerStateTransitionevent needs typed schemaLOOP_PERSISTENT_WORKER=0for one release, then deletedAcceptance
REQUEST_CHANGESfrom critic resumes in the same SDK session — verifiable via session_id in events log + prompt cache hit ratiomax_critic_iterationscap enforced (test: critic always REQUEST_CHANGES → worker abandoned at N=3)Non-goals
File pointers
src/forge_loop/runner/tick.py— state machine + dispatch loopsrc/forge_loop/_worker_sdk.py— addsession_idresume paramsrc/forge_loop/store/(new) —worker_sessionsSQLite schema + DAOsrc/forge_loop/critic/(refactor) — typedCriticReportsrc/forge_loop/config.py—max_critic_iterations,LOOP_PERSISTENT_WORKERflag