fix(ateom-gvisor): make eth0 move/restore idempotent + roll back on error by dims · Pull Request #66 · agent-substrate/substrate

Davanum Srinivas (dims) · 2026-05-23T15:12:47Z

ateom-gvisor moves the pod's eth0 interface into the actor's interior netns before invoking runsc run/runsc restore. The current code assumes eth0 is in the pod netns at the start of every RunWorkload/RestoreWorkload. If a previous call fails mid-flight, eth0 is left stranded in the interior netns and every subsequent call on the same worker fails with eth0: Link not found until the pod is restarted.

This change adds two complementary safeguards:

ensureEth0InPodNetns at the top of both entry points. Moves eth0 back to the pod netns when it is currently in the interior; no-op otherwise. Recovers from a previously-aborted call.
Deferred rollback after the eth0-into-interior move. Returns eth0 to the pod netns if the current call returns an error.

Hammer CreateActor + ResumeActor + SuspendActor cycles against a worker pod. Before this change, every second iteration trips the original failure path and the third leaves the pod unusable. After this change, the failure does not recur.

Now that the three companion changes are filed upstream as - NVIDIA/OpenShell#1548 (env-var-gated best-effort bootstrap) - agent-substrate/substrate#66 (ateom-gvisor eth0 fix) - agent-substrate/substrate#67 (install-ate.sh publishes ateom-gvisor) rewrite the README + poc-intro.md to point at the PRs rather than at specific commits or fork branches. Easier to follow for any reader who isn't already deep in our local-fork state. Also fold the operator-handshake follow-up into the §3 component table and §9 "Where to next" list with the PR reference. Signed-off-by: Davanum Srinivas <davanum@gmail.com>

Items 1-3 in §9 "Where to next" have all been filed as PRs (NVIDIA/OpenShell#1548, agent-substrate/substrate#66, #67); marking them with strike-through and an "awaiting review" callout so readers don't think they're still TODO. Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>

a4-a4s1 · 2026-05-26T14:54:18Z

Looking at #66 and #75 side by side — both handle partial-state-on-failure for the actor lifecycle, but pick different shapes:

fix(ateom-gvisor): make eth0 move/restore idempotent + roll back on error #66 (ateom-gvisor): inline ensure-then-act, rollback wired per call-site
ateapi/syncer: release actor when host pod is deleted #75 (WorkerPoolSyncer informer): event-driven, controller re-reconciles on delete

One Q: is the difference deliberate per-layer (data-plane inline, control-plane reconcile), or worth a stated convention so future PRs pick consistently?

(Why I'm asking: if more ateom-gvisor mutations grow the same ensure+rollback pattern, an extracted helper avoids each call-site re-deriving rollback steps. If periodic-reconcile is the longer-arc preference, this is the right local fix and probably shouldn't generalize.)

Davanum Srinivas (dims) · 2026-05-26T15:27:43Z

Alex Bulankou (@AlexBulankou) is this you asking?

Alex Bulankou (AlexBulankou) · 2026-05-26T21:49:11Z

Davanum Srinivas (@dims) the question was not from me, but from my bot on my standing instructions. It identified the differences in how your own PR #75 handled state of failure and recovery compared to this PR.
The reason it brought it up is that standing request to this bot was to comment on PRs where it identified any inconsistencies between technical approaches across multiple PRs in progress to ensure consistency across the codebase.
Let me know if you believe the comment is out of place or misses your approach altogether, this would be helpful context to the bot.

Taahir Ahmed (ahmedtd) · 2026-05-26T22:01:08Z

Davanum Srinivas (@dims) This looks good. Can you rebase?

…rror ateom-gvisor moves the pod's eth0 interface into the actor's interior netns before invoking runsc run/restore. The current code assumes eth0 is in the pod netns at the start of every RunWorkload/RestoreWorkload. That assumption breaks if a previous call failed mid-flight: eth0 is left stranded in the interior netns, the next call errors with "eth0: Link not found", and the worker pod is stuck until restart. Add ensureEth0InPodNetns that moves eth0 back to the pod netns if it is in the interior, and is a no-op otherwise. Call it at the top of both entry points to recover from a previously-aborted call, and via a deferred handler after the eth0-into-interior move to roll back on errors in the current call. The success path is unchanged. Signed-off-by: Davanum Srinivas <davanum@gmail.com>

Davanum Srinivas (dims) · 2026-05-26T22:42:25Z

Davanum Srinivas (@dims) This looks good. Can you rebase?

On it!

Dmitry Berkovich (dberkov) requested a review from Taahir Ahmed (ahmedtd) May 23, 2026 19:07

Davanum Srinivas (dims) force-pushed the fix/ateom-gvisor-eth0-rollback branch from adcb8d2 to 6580a5f Compare May 24, 2026 21:45

Taahir Ahmed (ahmedtd) approved these changes May 26, 2026

View reviewed changes

Davanum Srinivas (dims) force-pushed the fix/ateom-gvisor-eth0-rollback branch from 6580a5f to 2c77ca8 Compare May 26, 2026 22:42

Taahir Ahmed (ahmedtd) merged commit de2ad8a into agent-substrate:main May 26, 2026
8 checks passed

Benjamin Elder (BenTheElder) added the bug Something isn't working / bugfixes label May 26, 2026

a4-a4s1 Bot mentioned this pull request May 28, 2026

Substrate suspend/resume PoC (Dmitry Berkovich coordination — a4s1 umbrella) AlexBulankou/substrate-poc1#1

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ateom-gvisor): make eth0 move/restore idempotent + roll back on error#66

fix(ateom-gvisor): make eth0 move/restore idempotent + roll back on error#66
Taahir Ahmed (ahmedtd) merged 1 commit into
agent-substrate:mainfrom
dims:fix/ateom-gvisor-eth0-rollback

Davanum Srinivas (dims) commented May 23, 2026 •

edited

Loading

Uh oh!

a4-a4s1 Bot commented May 26, 2026 •

edited

Loading

Uh oh!

Davanum Srinivas (dims) commented May 26, 2026

Uh oh!

Alex Bulankou (AlexBulankou) commented May 26, 2026

Uh oh!

Taahir Ahmed (ahmedtd) commented May 26, 2026

Uh oh!

Davanum Srinivas (dims) commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Davanum Srinivas (dims) commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a4-a4s1 Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Davanum Srinivas (dims) commented May 26, 2026

Uh oh!

Alex Bulankou (AlexBulankou) commented May 26, 2026

Uh oh!

Taahir Ahmed (ahmedtd) commented May 26, 2026

Uh oh!

Davanum Srinivas (dims) commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Davanum Srinivas (dims) commented May 23, 2026 •

edited

Loading

a4-a4s1 Bot commented May 26, 2026 •

edited

Loading