OCPBUGS-86826: Make vsphere template updates atomic#6117
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@djoshy: This pull request references Jira Issue OCPBUGS-86826, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughAdds deterministic temp/rollback VM naming, pre-import cleanup of stale temp/rollback VMs, importing OVFs to a deterministic temp VM, atomic rename-based template replacement (rename old→rollback, temp→prod, destroy rollback), and reconciliation recovery that renames rollback back to production when needed. ChangesTemplate Replacement Atomicity
sequenceDiagram
participant Reconciler
participant vSphere
Reconciler->>vSphere: compute atomicTempName and rollback name
Reconciler->>vSphere: destroyVMIfPresent stale temp/rollback
Reconciler->>vSphere: import OVF to temp VM name
alt production template exists
Reconciler->>vSphere: swapTemplate (rename Prod→rollback)
Reconciler->>vSphere: rename temp→Prod
Reconciler->>vSphere: destroy rollback
else
Reconciler->>vSphere: rename temp→Prod
end
alt production template missing during reconcile
Reconciler->>vSphere: locate rollback VM by name
Reconciler->>vSphere: rename rollback→Prod
Reconciler->>vSphere: re-fetch Prod reference
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)Command failed Comment |
|
@djoshy: This pull request references Jira Issue OCPBUGS-86826, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-vsphere-mco-disruptive |
|
@djoshy: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/81361080-5e98-11f1-8281-d019b8daff8d-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/controller/bootimage/vsphere_helpers.go (1)
321-333: 💤 Low valueConsider also cleaning up stale
oldTempNameVMs.Currently, only
tempName(mco-tmp-*) is cleaned up at the start. If a previousswapTemplatecompleted both renames but the finalDestroyfailed, the old template underoldTempName(mco-old-*) would remain orphaned. Adding a cleanup call foroldTempNamehere would prevent accumulation of orphaned VMs from repeated destroy failures.♻️ Suggested improvement
if err := cleanupStaleTempVM(ctx, finder, tempName); err != nil { return err } + if err := cleanupStaleTempVM(ctx, finder, oldTempName); err != nil { + return err + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controller/bootimage/vsphere_helpers.go` around lines 321 - 333, The current startup cleanup only calls cleanupStaleTempVM for tempName which leaves orphaned oldTempName VMs if a prior swap left mco-old-* behind; add a call to cleanupStaleTempVM(ctx, finder, oldTempName) alongside the existing cleanup for tempName (handle its error return the same way) before proceeding with swapTemplate/findAllRequiredResources so both tempName and oldTempName are cleaned up; reference tempName, oldTempName, cleanupStaleTempVM and ensure error is returned if cleanupStaleTempVM(ctx, finder, oldTempName) fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/controller/bootimage/vsphere_helpers.go`:
- Around line 556-559: The error returned when finder.VirtualMachine(ctx,
oldTempName) fails currently wraps only the original err and omits oldErr;
update the error construction in the rollback VM lookup (around
finder.VirtualMachine / oldVM / oldErr) to include oldErr details (e.g., add
oldErr to the fmt.Errorf message or wrap it with %w) so the returned error
contains both the template not found context and the actual failure reason from
oldErr when looking up oldTempName alongside the existing err and variables
name/oldTempName.
---
Nitpick comments:
In `@pkg/controller/bootimage/vsphere_helpers.go`:
- Around line 321-333: The current startup cleanup only calls cleanupStaleTempVM
for tempName which leaves orphaned oldTempName VMs if a prior swap left
mco-old-* behind; add a call to cleanupStaleTempVM(ctx, finder, oldTempName)
alongside the existing cleanup for tempName (handle its error return the same
way) before proceeding with swapTemplate/findAllRequiredResources so both
tempName and oldTempName are cleaned up; reference tempName, oldTempName,
cleanupStaleTempVM and ensure error is returned if cleanupStaleTempVM(ctx,
finder, oldTempName) fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 518a9907-6871-46fa-85ac-cc35b0a64636
📒 Files selected for processing (1)
pkg/controller/bootimage/vsphere_helpers.go
|
/payload-abort |
80b5b21 to
276fe58
Compare
|
/payload-job periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-vsphere-mco-disruptive |
|
@djoshy: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bb9eb130-5e9a-11f1-9e66-32f081cd610f-0 |
276fe58 to
fab78e4
Compare
isabella-janssen
left a comment
There was a problem hiding this comment.
/lgtm
This looks fair to me
|
Scheduling tests matching the |
|
We will only run a regression to make sure that the new changes don't break anything Manually launched executions: |
|
/test e2e-gcp-op-ocl-part2 |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-vsphere-mco-disruptive-techpreview periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-vsphere-mco-disruptive |
|
@sergiordlr: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9743f230-64a5-11f1-888c-7c4b8ea79051-0 |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-vsphere-mco-disruptive-techpreview |
|
@sergiordlr: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5e304ca0-64c3-11f1-83a9-335e639bf0fa-0 |
|
@sergiordlr: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/26579e40-64c4-11f1-96f2-617dd1b8636b-0 |
fab78e4 to
0ebbf63
Compare
|
Scheduling tests matching the |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: djoshy, isabella-janssen The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-vsphere-mco-disruptive-techpreview Jobs logs are looking for v1alpha1 osimagestream CRDs which are no longer in use, I suspect payload-job command is not merging against main, so rebasing and testing again. |
|
@djoshy: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ff7af940-6b24-11f1-8ddf-24b62c38228e-0 |
|
/retest-required |
|
/verified by @sergiordlr We will only verify that nothing is broken. Long duration jobs and vsphere disruptive suite test cases passed. |
|
@sergiordlr: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-4.22 release-4.21 release-4.20 |
|
@djoshy: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest-required |
|
GCP failures are related to the RHEL10 switch and it cannot be affected by vsphere changes in the bootimage controller. Overriding: /override ci/prow/e2e-gcp-op-part2 |
|
@djoshy: Overrode contexts on behalf of djoshy: ci/prow/e2e-gcp-op-part2 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest-required |
|
@djoshy: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@djoshy: Jira Issue Verification Checks: Jira Issue OCPBUGS-86826 Jira Issue OCPBUGS-86826 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@djoshy: new pull request created: #6226 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fix included in release 5.0.0-0.nightly-2026-06-24-062812 |
- What I did
- How to verify it
This is fairly hard to reproduce as it only happens due to network failures. At the very least, we should ensure that vsphere bootimage tests continue to work as expected.
Summary by CodeRabbit
New Features
Bug Fixes