fix(ateom-gvisor): make eth0 move/restore idempotent + roll back on error#66
Conversation
Now that the three companion changes are filed upstream as - NVIDIA/OpenShell#1548 (env-var-gated best-effort bootstrap) - agent-substrate/substrate#66 (ateom-gvisor eth0 fix) - agent-substrate/substrate#67 (install-ate.sh publishes ateom-gvisor) rewrite the README + poc-intro.md to point at the PRs rather than at specific commits or fork branches. Easier to follow for any reader who isn't already deep in our local-fork state. Also fold the operator-handshake follow-up into the §3 component table and §9 "Where to next" list with the PR reference. Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Items 1-3 in §9 "Where to next" have all been filed as PRs (NVIDIA/OpenShell#1548, agent-substrate/substrate#66, #67); marking them with strike-through and an "awaiting review" callout so readers don't think they're still TODO. Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
adcb8d2 to
6580a5f
Compare
|
Looking at #66 and #75 side by side — both handle partial-state-on-failure for the actor lifecycle, but pick different shapes:
One Q: is the difference deliberate per-layer (data-plane inline, control-plane reconcile), or worth a stated convention so future PRs pick consistently? (Why I'm asking: if more ateom-gvisor mutations grow the same ensure+rollback pattern, an extracted helper avoids each call-site re-deriving rollback steps. If periodic-reconcile is the longer-arc preference, this is the right local fix and probably shouldn't generalize.) |
|
Alex Bulankou (@AlexBulankou) is this you asking? |
|
Davanum Srinivas (@dims) the question was not from me, but from my bot on my standing instructions. It identified the differences in how your own PR #75 handled state of failure and recovery compared to this PR. |
|
Davanum Srinivas (@dims) This looks good. Can you rebase? |
…rror ateom-gvisor moves the pod's eth0 interface into the actor's interior netns before invoking runsc run/restore. The current code assumes eth0 is in the pod netns at the start of every RunWorkload/RestoreWorkload. That assumption breaks if a previous call failed mid-flight: eth0 is left stranded in the interior netns, the next call errors with "eth0: Link not found", and the worker pod is stuck until restart. Add ensureEth0InPodNetns that moves eth0 back to the pod netns if it is in the interior, and is a no-op otherwise. Call it at the top of both entry points to recover from a previously-aborted call, and via a deferred handler after the eth0-into-interior move to roll back on errors in the current call. The success path is unchanged. Signed-off-by: Davanum Srinivas <davanum@gmail.com>
On it! |
6580a5f to
2c77ca8
Compare
ateom-gvisormoves the pod'seth0interface into the actor's interior netns before invokingrunsc run/runsc restore. The current code assumeseth0is in the pod netns at the start of everyRunWorkload/RestoreWorkload. If a previous call fails mid-flight,eth0is left stranded in the interior netns and every subsequent call on the same worker fails witheth0: Link not founduntil the pod is restarted.This change adds two complementary safeguards:
ensureEth0InPodNetnsat the top of both entry points. Moveseth0back to the pod netns when it is currently in the interior; no-op otherwise. Recovers from a previously-aborted call.eth0to the pod netns if the current call returns an error.Hammer
CreateActor+ResumeActor+SuspendActorcycles against a worker pod. Before this change, every second iteration trips the original failure path and the third leaves the pod unusable. After this change, the failure does not recur.