Skip to content

fix(ateom-gvisor): make eth0 move/restore idempotent + roll back on error#66

Merged
Taahir Ahmed (ahmedtd) merged 1 commit into
agent-substrate:mainfrom
dims:fix/ateom-gvisor-eth0-rollback
May 26, 2026
Merged

fix(ateom-gvisor): make eth0 move/restore idempotent + roll back on error#66
Taahir Ahmed (ahmedtd) merged 1 commit into
agent-substrate:mainfrom
dims:fix/ateom-gvisor-eth0-rollback

Conversation

@dims
Copy link
Copy Markdown
Collaborator

@dims Davanum Srinivas (dims) commented May 23, 2026

ateom-gvisor moves the pod's eth0 interface into the actor's interior netns before invoking runsc run/runsc restore. The current code assumes eth0 is in the pod netns at the start of every RunWorkload/RestoreWorkload. If a previous call fails mid-flight, eth0 is left stranded in the interior netns and every subsequent call on the same worker fails with eth0: Link not found until the pod is restarted.

This change adds two complementary safeguards:

  1. ensureEth0InPodNetns at the top of both entry points. Moves eth0 back to the pod netns when it is currently in the interior; no-op otherwise. Recovers from a previously-aborted call.
  2. Deferred rollback after the eth0-into-interior move. Returns eth0 to the pod netns if the current call returns an error.

Hammer CreateActor + ResumeActor + SuspendActor cycles against a worker pod. Before this change, every second iteration trips the original failure path and the third leaves the pod unusable. After this change, the failure does not recur.

Davanum Srinivas (dims) added a commit to dims/openshell-driver-substrate that referenced this pull request May 23, 2026
Now that the three companion changes are filed upstream as
- NVIDIA/OpenShell#1548 (env-var-gated best-effort bootstrap)
- agent-substrate/substrate#66 (ateom-gvisor eth0 fix)
- agent-substrate/substrate#67 (install-ate.sh publishes ateom-gvisor)

rewrite the README + poc-intro.md to point at the PRs rather than at
specific commits or fork branches. Easier to follow for any reader
who isn't already deep in our local-fork state.

Also fold the operator-handshake follow-up into the §3 component table
and §9 "Where to next" list with the PR reference.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Davanum Srinivas (dims) added a commit to dims/openshell-driver-substrate that referenced this pull request May 23, 2026
Items 1-3 in §9 "Where to next" have all been filed as PRs (NVIDIA/OpenShell#1548, agent-substrate/substrate#66, #67); marking them with strike-through and an "awaiting review" callout so readers don't think they're still TODO.

Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
@dims Davanum Srinivas (dims) force-pushed the fix/ateom-gvisor-eth0-rollback branch from adcb8d2 to 6580a5f Compare May 24, 2026 21:45
@a4-a4s1
Copy link
Copy Markdown

a4-a4s1 Bot commented May 26, 2026

Looking at #66 and #75 side by side — both handle partial-state-on-failure for the actor lifecycle, but pick different shapes:

One Q: is the difference deliberate per-layer (data-plane inline, control-plane reconcile), or worth a stated convention so future PRs pick consistently?

(Why I'm asking: if more ateom-gvisor mutations grow the same ensure+rollback pattern, an extracted helper avoids each call-site re-deriving rollback steps. If periodic-reconcile is the longer-arc preference, this is the right local fix and probably shouldn't generalize.)

@dims
Copy link
Copy Markdown
Collaborator Author

Alex Bulankou (@AlexBulankou) is this you asking?

@AlexBulankou
Copy link
Copy Markdown
Collaborator

Davanum Srinivas (@dims) the question was not from me, but from my bot on my standing instructions. It identified the differences in how your own PR #75 handled state of failure and recovery compared to this PR.
The reason it brought it up is that standing request to this bot was to comment on PRs where it identified any inconsistencies between technical approaches across multiple PRs in progress to ensure consistency across the codebase.
Let me know if you believe the comment is out of place or misses your approach altogether, this would be helpful context to the bot.

@ahmedtd
Copy link
Copy Markdown
Collaborator

Davanum Srinivas (@dims) This looks good. Can you rebase?

…rror

ateom-gvisor moves the pod's eth0 interface into the actor's interior
netns before invoking runsc run/restore. The current code assumes eth0
is in the pod netns at the start of every RunWorkload/RestoreWorkload.
That assumption breaks if a previous call failed mid-flight: eth0 is
left stranded in the interior netns, the next call errors with
"eth0: Link not found", and the worker pod is stuck until restart.

Add ensureEth0InPodNetns that moves eth0 back to the pod netns if it
is in the interior, and is a no-op otherwise. Call it at the top of
both entry points to recover from a previously-aborted call, and via
a deferred handler after the eth0-into-interior move to roll back on
errors in the current call. The success path is unchanged.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
@dims
Copy link
Copy Markdown
Collaborator Author

Davanum Srinivas (@dims) This looks good. Can you rebase?

On it!

@dims Davanum Srinivas (dims) force-pushed the fix/ateom-gvisor-eth0-rollback branch from 6580a5f to 2c77ca8 Compare May 26, 2026 22:42
@ahmedtd Taahir Ahmed (ahmedtd) merged commit de2ad8a into agent-substrate:main May 26, 2026
8 checks passed
@BenTheElder Benjamin Elder (BenTheElder) added the bug Something isn't working / bugfixes label May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working / bugfixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants