ateapi/syncer: release actor when host pod is deleted by dims · Pull Request #75 · agent-substrate/substrate

Davanum Srinivas (dims) · 2026-05-24T18:08:54Z

Today, if a worker pod that hosts an active actor is forcibly destroyed, substrate does not migrate the actor. The actor's AteomPodName still points at the dead pod and the router times out forwarding to a dead IP.

WorkerPoolSyncer's pod-informer DeleteFunc and soft-delete branch now check whether the worker being removed is bound to an actor, and if so, reset that actor to STATUS_SUSPENDED before the worker row goes away.

The new helper releaseActorOnDeadWorker reads the actor, only acts if the actor still claims this worker (a concurrent SuspendActor or DeleteActor that already advanced state is respected), clears the pod-binding fields and InProgressSnapshot, preserves LastSnapshot, and writes via version-checked UpdateActor. On ErrPersistenceRetry we drop the attempt and let the next informer event retry — no separate lock, no in-handler retry budget.

STATUS_SUSPENDED is reused rather than introducing a new state. The post-orphan invariant is identical and reusing keeps the proto, printer, dialer, and router contracts unchanged. The next request through atenet triggers an implicit resume; findFreeWorker picks any free worker in the pool; LastSnapshot is restored.

Test plan: TestSyncer_DeleteBoundWorker_ClearsActor covers RUNNING actor → pod delete → SUSPENDED with cleared bind fields, InProgressSnapshot dropped, LastSnapshot preserved. All existing tests in cmd/ateapi/internal/controlapi/ still pass.

Six-beat OpenShell-on-Substrate scenario: cold ask, suspend, idle, follow-up with memory preserved, exfil deny, and pod-kill migration. Verified end-to-end on a kind cluster running substrate `main` plus `agent-substrate/substrate#75` (`ateapi/syncer: release actor when host pod is deleted`), which closes the gap behind Beat 6. The example reuses `tests/integration/build-image.sh` for the supervisor image; the helpdesk-specific files (Python agent, OPA policy data, OpenShell route config, substrate ActorTemplate, thin derivative Dockerfile, six-beat driver script) live under `examples/helpdesk/`. `README.md` is self-contained — prereqs, quick-start, expected output, troubleshooting, cleanup. `routes.yaml` ships as a template carrying a `<your-ollama-cloud-key>` placeholder; operators stage `routes.local.yaml` with a real Ollama Cloud free-tier key. `*.local.yaml` is gitignored at the example root. Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>

Both the top-level README and docs/poc-intro.md were dated to before the M3 wiring landed and before the driver became load-bearing in a real gateway. Update both to: - Note M3.14 + M3.16 commits on dims/OpenShell@chore/gvisor-degraded-netns as the gateway-side wiring that makes the crate load-bearing. - Surface the driver-driven 10-beat helpdesk demo at examples/helpdesk/ as the canonical integration showcase. - Add agent-substrate/substrate#75 (actor migration on pod loss) to the companion-change list in poc-intro.md. Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>

The base README only mentioned the helpdesk example in passing inside the "What's in the box" table, and never linked to docs/poc-intro.md at all. Both are the actual entry points for newcomers — the joint architecture overview and the demo walkthrough. Add a "Read first" block right under the title that names both with one-line summaries, so a teammate landing on the repo can pick the right entry point without scanning past the prerequisites. Also add agent-substrate/substrate#75 (actor migration on pod loss) to the Companion Changes table; it was missing despite being a prerequisite for the helpdesk demo's pod-kill migration beat. Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>

@juli4n

The previous comment claimed contention would "rely on the next informer event to retry", but the caller unconditionally proceeds to DeleteWorker after this returns — so any retry path is short-circuited by GetWorker returning ErrNotFound on subsequent events. Replace the misleading claim with an honest "best-effort" description and a link to agent-substrate#23 for the long-term finalizer-based fix. Per code review feedback from @juli4n on PR agent-substrate#75. Signed-off-by: Davanum Srinivas <davanum@gmail.com>

Today, if a worker pod that hosts an active actor is forcibly destroyed, substrate does not migrate the actor. Actor.AteomPodName still points at the dead pod and the router times out forwarding to a dead IP. The WorkerPoolSyncer's pod-informer DeleteFunc and soft-delete branch (syncWorkerToStore with DeletionTimestamp != nil) now check whether the worker being removed is bound to an actor, and if so, reset that actor to STATUS_SUSPENDED before the worker row goes away. The new helper releaseActorOnDeadWorker reads the actor, only acts if the actor still claims this worker (a concurrent SuspendActor / DeleteActor that already advanced state is respected), clears AteomPod{Namespace,Name,Ip} and InProgressSnapshot, preserves LastSnapshot (the previous *successful* checkpoint), and writes via version-checked UpdateActor. On version contention (ErrPersistenceRetry) we drop the attempt silently. Best-effort only: the caller always proceeds to DeleteWorker after the release attempt, so any non-contention failure leaves the actor stranded (STATUS_RUNNING, pointer at a now-deleted worker). Recovery then needs a manual SuspendActor. The long-term fix is a finalizer-based controller that holds the pod in Terminating state until the actor is gracefully suspended, tracked in agent-substrate#23. STATUS_SUSPENDED is reused as the target state rather than introducing a new value — the post-orphan invariant is identical and reusing keeps the proto / printer / dialer / router contracts unchanged. The next request through atenet triggers an implicit resume; findFreeWorker picks any free worker in the pool; LastSnapshot is restored. Test plan: - TestSyncer_DeleteBoundWorker_ClearsActor — RUNNING actor → pod delete → SUSPENDED with cleared bind fields, InProgressSnapshot dropped, LastSnapshot preserved. All existing tests in cmd/ateapi/internal/controlapi/ still pass. Signed-off-by: Davanum Srinivas <davanum@gmail.com>

Davanum Srinivas (dims) force-pushed the fix/actor-resume-recovery branch 2 times, most recently from 314c39d to 803304b Compare May 24, 2026 20:53

a4-a4s1 Bot mentioned this pull request May 26, 2026

fix(ateom-gvisor): make eth0 move/restore idempotent + roll back on error #66

Merged

Julian Gutierrez Oschmann (juli4n) reviewed May 26, 2026

View reviewed changes

Comment thread cmd/ateapi/internal/controlapi/syncer.go Outdated

Davanum Srinivas (dims) force-pushed the fix/actor-resume-recovery branch from 803304b to 6450112 Compare May 26, 2026 18:33

Davanum Srinivas (dims) force-pushed the fix/actor-resume-recovery branch from 6450112 to c6d4031 Compare May 26, 2026 18:33

Benjamin Elder (BenTheElder) assigned Julian Gutierrez Oschmann (juli4n) May 27, 2026

Merge branch 'main' into fix/actor-resume-recovery

7e2cd9c

Julian Gutierrez Oschmann (juli4n) merged commit 1a852c8 into agent-substrate:main May 27, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ateapi/syncer: release actor when host pod is deleted#75

ateapi/syncer: release actor when host pod is deleted#75
Julian Gutierrez Oschmann (juli4n) merged 2 commits into
agent-substrate:mainfrom
dims:fix/actor-resume-recovery

Davanum Srinivas (dims) commented May 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Davanum Srinivas (dims) commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Davanum Srinivas (dims) commented May 24, 2026 •

edited

Loading