epic: persistent worker sessions with critic ping-pong resumption (kills #78)

## Why

Today's worker is fire-and-forget: spawns SDK session → opens PR → exits → critic runs in a separate session → on reject a *fresh* worker is dispatched. This burns prompt cache, re-bootstraps file/context knowledge every iteration, and is the root cause of the dry-exit failure mode (handoff = drop). Issue #78's "multi-iteration ping-pong" tries to fix this but still uses the one-shot model — it should be superseded by this epic.

## Design (architected, not bolted on)

### 1. Worker lifecycle as an explicit state machine

```
DISPATCHED → RUNNING → AWAITING_CRITIC ⇄ REVISING → MERGED
                            │                │
                            ↓                ↓
                        ABANDONED        ABANDONED
                        (budget/iter cap, sev1 stuck, operator kill)
```

State transitions are persisted to SQLite (`worker_sessions` table) on every edge. Process can die at any point; state survives.

### 2. Session persistence via Claude Agent SDK `session_id`

The SDK exposes session resumption: pass the prior `session_id` and the next user message gets appended to that transcript. We thread `session_id` through `WorkerOutcome` → `worker_sessions.session_id` column → next-tick resume call.

Prompt cache stays warm across critic round-trips (5-min TTL is the constraint — drives the polling cadence below).

### 3. Worktree lifecycle bound to session lifecycle

- `RUNNING`, `AWAITING_CRITIC`, `REVISING` → worktree preserved at `/tmp/wt-<issue>-<session_id_short>/`
- `MERGED` → reap (existing logic)
- `ABANDONED` → preserve with `worktree_preserved` event for operator inspection

Extend existing reap policy in `tick.py` (it already preserves on failed/no_pr/timeout).

### 4. Parallel-slot accounting

Paused workers in `AWAITING_CRITIC` **do not** count against `parallel: N`. Only `RUNNING` + `REVISING` (i.e. actively burning tokens) count. This is the key efficiency win — 3 active workers + N paused-awaiting-critic is fine.

### 5. Critic ping-pong protocol

On `RUNNING → AWAITING_CRITIC` (worker opens PR + exits cleanly):
1. Critic SDK session runs against PR diff + linked issue + transcript pointer
2. Critic emits typed `CriticReport { verdict: APPROVE | REQUEST_CHANGES | BLOCK, comments: [...] }`
3. APPROVE → auto-merge → reap
4. REQUEST_CHANGES → enqueue `REVISING` tick: resume worker session with critic comments as next user message, capped iterations
5. BLOCK (sev1) → ABANDONED, operator label, worktree preserved

### 6. Termination conditions (must be explicit, not implicit)

- `max_critic_iterations: 3` (config) — past this, ABANDONED with `loop:needs-review`
- Per-worker hard budget (turns or wallclock) carried across iterations, not reset
- Critic verdict BLOCK = immediate abandon
- Operator `loop:abandon` label = abandon next tick

### 7. Crash recovery

If the loop runner dies mid-session: on restart, scan `worker_sessions` for non-terminal states. Each gets:
- Worktree path probed (exists? clean? on expected branch?)
- SDK `session_id` probed (resumable?)
- If both OK → re-enter `RUNNING` or `AWAITING_CRITIC` based on PR state
- If worktree/session lost → ABANDONED with `crash_recovery_failed` event

### 8. Migration

- Kill #78 (multi-iteration ping-pong) — subsumed
- Prereq: #85 (PO/critic SDK migration) so critic emits typed `CriticReport` not gh-cli scraping
- Prereq: #88 (typed events) — `WorkerStateTransition` event needs typed schema
- Old one-shot path: gated off behind `LOOP_PERSISTENT_WORKER=0` for one release, then deleted

## Acceptance

1. A worker that gets `REQUEST_CHANGES` from critic resumes in the same SDK session — verifiable via session_id in events log + prompt cache hit ratio
2. Worktree survives critic round-trip (no re-clone, no re-checkout)
3. Paused workers don't block dispatch of new ones (parallel slot freed)
4. Crash mid-session → recovery picks up correctly (integration test with SIGKILL of runner)
5. `max_critic_iterations` cap enforced (test: critic always REQUEST_CHANGES → worker abandoned at N=3)
6. Cost telemetry shows iteration 2+ is ≥30% cheaper than iteration 1 (prompt cache hit)

## Non-goals

- Worker-to-worker collaboration (out of scope)
- Critic acting as a separate "pair" (still one-shot per round)
- Distributed runner (single-host SQLite remains source of truth)

## File pointers

- `src/forge_loop/runner/tick.py` — state machine + dispatch loop
- `src/forge_loop/_worker_sdk.py` — add `session_id` resume param
- `src/forge_loop/store/` (new) — `worker_sessions` SQLite schema + DAO
- `src/forge_loop/critic/` (refactor) — typed `CriticReport`
- `src/forge_loop/config.py` — `max_critic_iterations`, `LOOP_PERSISTENT_WORKER` flag


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

epic: persistent worker sessions with critic ping-pong resumption (kills #78) #95

Why

Design (architected, not bolted on)

1. Worker lifecycle as an explicit state machine

2. Session persistence via Claude Agent SDK `session_id`

3. Worktree lifecycle bound to session lifecycle

4. Parallel-slot accounting

5. Critic ping-pong protocol

6. Termination conditions (must be explicit, not implicit)

7. Crash recovery

8. Migration

Acceptance

Non-goals

File pointers

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

epic: persistent worker sessions with critic ping-pong resumption (kills #78) #95

Description

Why

Design (architected, not bolted on)

1. Worker lifecycle as an explicit state machine

2. Session persistence via Claude Agent SDK session_id

3. Worktree lifecycle bound to session lifecycle

4. Parallel-slot accounting

5. Critic ping-pong protocol

6. Termination conditions (must be explicit, not implicit)

7. Crash recovery

8. Migration

Acceptance

Non-goals

File pointers

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

2. Session persistence via Claude Agent SDK `session_id`