Skip to content

OCPBUGS-92182: Use providerSpec.Template in vSphere machineset reconciliation#6234

Open
djoshy wants to merge 1 commit into
openshift:mainfrom
djoshy:vsphere-fd-fix
Open

OCPBUGS-92182: Use providerSpec.Template in vSphere machineset reconciliation#6234
djoshy wants to merge 1 commit into
openshift:mainfrom
djoshy:vsphere-fd-fix

Conversation

@djoshy

@djoshy djoshy commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

When a new failure domain is added, the controller always looked up the template by the computed name, ignoring providerSpec.Template, causing reconciliation to fail. Fix by checking providerSpec.Template first, using errors.As for NotFoundError, and creating the template from OVA when none exists. I broke out the resolution step into a separate function to reduce the cyclomatic complexity as the lint check failed(which is fair 😄).

- How to verify it
A machineset with a non standard vsphere template name(not matching the old name the MCO compute from the infra object) should be successfully reconciled.

Existing vSphere boot image e2es should also continue to pass.

Summary by CodeRabbit

  • Bug Fixes
    • Improved template discovery during boot image setup using a two-step lookup strategy for existing templates.
    • Added automatic recovery for rollback scenarios by restoring a previously renamed template when encountered.
    • When no suitable template is found, the system now creates the required template instead of failing.
    • Added validation to prevent template name length issues before template creation.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 25, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-92182, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When a new failure domain is added, the controller always looked up the template by the computed name, ignoring providerSpec.Template, causing reconciliation to fail. Fix by checking providerSpec.Template first, using errors.As for NotFoundError, and creating the template from OVA when none exists. I broke out the resolution step into a separate function to reduce the cyclomatic complexity as the lint check failed(which is fair 😄 )

- How to verify it
A machineset with a non standard vsphere template name(not matching the old name the MCO compute from the infra object) should be successfully reconciled.

Existing vSphere boot image e2es should also continue to pass.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-92182, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

When a new failure domain is added, the controller always looked up the template by the computed name, ignoring providerSpec.Template, causing reconciliation to fail. Fix by checking providerSpec.Template first, using errors.As for NotFoundError, and creating the template from OVA when none exists. I broke out the resolution step into a separate function to reduce the cyclomatic complexity as the lint check failed(which is fair 😄).

- How to verify it
A machineset with a non standard vsphere template name(not matching the old name the MCO compute from the infra object) should be successfully reconciled.

Existing vSphere boot image e2es should also continue to pass.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2b3c051c-1ed2-4d4d-8cb9-c42ff281d853

📥 Commits

Reviewing files that changed from the base of the PR and between e4c4d7d and 58a949d.

📒 Files selected for processing (1)
  • pkg/controller/bootimage/vsphere_helpers.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/controller/bootimage/vsphere_helpers.go

Walkthrough

The PR adds a helper that resolves the target vSphere template by checking the configured name, then the computed name, and either renames a rollback VM or creates a new template from the OVA when lookup fails. The caller now uses that helper and returns early after creation.

Changes

vSphere template resolution

Layer / File(s) Summary
Resolve or create template
pkg/controller/bootimage/vsphere_helpers.go
resolveExistingTemplateVM now checks providerSpec.Template before the computed name, renames mco-old-* rollback VMs back to the target name, creates a new template from the downloaded OVA when needed, and validates the 80-character name limit.
Use helper in caller
pkg/controller/bootimage/vsphere_helpers.go
createNewVMTemplate now delegates lookup and recovery to resolveExistingTemplateVM, returns immediately when the helper creates a template, and uses the resolved template name for reconciliation decisions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • cheesesashimi
  • pablintino
🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly reflects the main change: vSphere machineset reconciliation now prefers providerSpec.Template.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR only changes pkg/controller/bootimage/vsphere_helpers.go, and that file contains no Ginkgo test titles or dynamic test-name strings.
Test Structure And Quality ✅ Passed No Ginkgo test code was added or modified; the package only has plain testing unit tests, so this check is not applicable.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; the PR only changes bootimage helper logic in pkg/controller/bootimage/vsphere_helpers.go, which has no MicroShift-affected APIs.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR only changes vSphere helper logic; no new Ginkgo e2e tests or multi-node assumptions were added.
Topology-Aware Scheduling Compatibility ✅ Passed Only vSphere template-resolution logic changed; no pod scheduling, node selectors, replicas, affinity, or topology-dependent controllers/manifests were introduced.
Ote Binary Stdout Contract ✅ Passed The PR only changes helper code and adds klog logging; no fmt.Print/os.Stdout writes or main/suite stdout writes were introduced.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; the PR only changes bootimage vSphere helper logic, so IPv4/disconnected-network compatibility check is not applicable.
No-Weak-Crypto ✅ Passed No MD5/SHA1/DES/RC4/3DES/Blowfish/ECB or custom crypto was added; the new code uses SHA-256 only and no secret/token comparisons.
Container-Privileges ✅ Passed PASS: The PR only changes vSphere bootimage Go logic; no container/K8s manifests or securityContext settings with privileged/host* fields were introduced.
No-Sensitive-Data-In-Logs ✅ Passed No logs emit passwords/tokens/PII; the new messages only include VM/template names and version strings, and credentials are passed to clients, not logged.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot requested review from ptalgulk01 and sergiordlr June 25, 2026 16:15
@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 25, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controller/bootimage/vsphere_helpers.go`:
- Around line 262-283: The rollback VM lookup in createTemplateForFailureDomain
treats any oldErr from finder.VirtualMachine as if the VM does not exist, which
can hide real vCenter failures and incorrectly trigger OVA recreation. Update
this branch to only take the “no existing template” path for a
*find.NotFoundError, matching the computed-name lookup behavior, and return
other errors immediately so transient lookup failures are surfaced instead of
falling back to createNewVMTemplateWithNameForFailureDomain.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e445c5ba-c20b-48d5-9b12-d660d0468c8a

📥 Commits

Reviewing files that changed from the base of the PR and between 60b01ff and e4c4d7d.

📒 Files selected for processing (1)
  • pkg/controller/bootimage/vsphere_helpers.go

Comment thread pkg/controller/bootimage/vsphere_helpers.go
@djoshy

djoshy commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

/retest-required

1 similar comment
@djoshy

djoshy commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

/retest-required

@jcpowermac

Copy link
Copy Markdown
Contributor

@djoshy surprised that there are no unit tests

jcpowermac

This comment was marked as duplicate.

jcpowermac

This comment was marked as outdated.

@jcpowermac jcpowermac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Claude (AI Reviewer)

This review was performed by Claude Code (claude-opus-4-6) on behalf of @jcpowermac. I am the reviewer.

The refactor into resolveExistingTemplateVM is clean — extracting the resolution logic reduces cyclomatic complexity and the errors.As fix correctly handles govmomi error types. The create-from-OVA path for new failure domains fills the missing gap.

However there is a critical issue with what happens after the providerSpec.Template lookup succeeds — the caller still returns the computed name, which will patch the MachineSet to point at a VM that doesn't exist. See inline comments for details.

Findings summary:

  1. 🔴 Critical: providerSpec.Template lookup succeeds but caller overwrites with non-existent computed name
  2. 🟡 Silent error swallowing on providerSpec.Template lookup failure
  3. 🟡 Hardcoded "thin" disk type for new templates
  4. ⚪ Typo in doc comment

Comment thread pkg/controller/bootimage/vsphere_helpers.go Outdated
// already has a valid template doesn't fail just because the infra computed name isn't there yet.
if providerSpec.Template != "" && providerSpec.Template != name {
if tmplVM, tmplErr := finder.VirtualMachine(ctx, providerSpec.Template); tmplErr == nil {
return tmplVM, false, nil

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Review — Critical] When this branch succeeds and returns the VM found via providerSpec.Template, the caller in createNewVMTemplate still has name set to the computed {infraID}-rhcos-{failureDomainName}. The caller proceeds to check the RHCOS version and at line 621-623 does:

if providerSpec.Template != name {
    return name, true, nil
}

This patches the MachineSet's template to the computed name — but no VM with that name exists in vSphere. The MachineSet gets updated to reference a non-existent template, which will cause the same failure on the next reconciliation.

The fix needs to either:

  1. Return the resolved name alongside the VM (e.g. add a resolvedName string return), so the caller uses the name that actually matched, or
  2. Have the caller skip the providerSpec.Template != name divergence check when the template was resolved via providerSpec.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

if providerSpec.Template != "" && providerSpec.Template != name {
if tmplVM, tmplErr := finder.VirtualMachine(ctx, providerSpec.Template); tmplErr == nil {
return tmplVM, false, nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Review] If finder.VirtualMachine fails with a non-NotFoundError (e.g., permission denied, network timeout), the error is silently swallowed and we fall through to the computed name lookup. This could mask real infrastructure issues and make debugging harder. Consider at least logging:

klog.Warningf("Failed to look up providerSpec.Template %s, falling back to computed name: %v", providerSpec.Template, tmplErr)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid.

vCenter won't always be available (upgrade - maintenance, etc.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with a log line.

return nil, false, fmt.Errorf("failed to download %s: %w", ova.Location, ovaErr)
}
if createErr := createNewVMTemplateWithNameForFailureDomain(ctx, providerSpec, failureDomain, finder, client, tagManager, name, ovaPath, infraID, "thin"); createErr != nil {
return nil, false, createErr

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Review] The existing update path uses getDiskTypeFromExistingVM(vmMo) to preserve the disk type from the current template. Hardcoding "thin" here for new failure domains may not match the cluster's convention (e.g., thick or eagerZeroedThick). Might be fine as a default but worth either a comment explaining why, or reading the disk type from a sibling failure domain's existing template if one was already processed in this loop.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose in this scenario no template exists anywhere. I am unsure if we should select disk type for a user.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, should we just error out instead of creating a new VM?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Review] I checked the API types — disk type is not persisted anywhere in the OpenShift API. It is not in VSphereMachineProviderSpec, Infrastructure spec, or the failure domain topology. The install-config diskType is consumed by the installer at template creation time but never written to a cluster object that MCO can read.

So getDiskTypeFromExistingVM() is the only way to determine disk type at runtime — by inspecting an existing VM backing. For a brand new failure domain with no template, the options are:

  1. Read it from a sibling failure domain template — createNewVMTemplate loops over all failure domains, so if one was already resolved earlier in the loop with an existing VM, its disk type could be captured and reused here. This preserves whatever the installer originally set.
  2. Error out rather than guessing — force the user to ensure the template exists before adding the failure domain, since MCO cannot know the intended disk type.

Given that picking the wrong disk type could have real storage/performance implications and MCO should not be making that choice for the user, erroring out (option 2) with a clear message seems safer. Alternatively, option 1 is reasonable if we can assume all failure domains in a cluster use the same disk type.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djoshy I am concerned about erroring, I have seen issues in testing when mco goes degraded for the entire cluster, other operators themselves become degraded as a result - I need to investigate this further.

When a new failure domain is added, the controller always looked up the template by the computed name, ignoring providerSpec.Template, causing reconciliation to fail. Fix by checking providerSpec.Template first, using errors.As for NotFoundError, and creating the template from OVA when none exists.
@djoshy

djoshy commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

@djoshy surprised that there are no unit tests

It probably wasn't trivial to unit test govmomi calls without additional vendors when it was originally written. E2Es are required for feature promotion, so we probably decided to prioritize those instead. Although at this point, the amount of vSphere carveouts in the boot image controller warrants it 😅 ....but I'd rather not block this bug fix on that.

@vr4manta

Copy link
Copy Markdown
Contributor

/test ?

@vr4manta

Copy link
Copy Markdown
Contributor

/test e2e-vsphere e2e-vsphere-ovn-zones e2e-vsphere-ovn-upi

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@djoshy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-upi 58a949d link false /test e2e-vsphere-ovn-upi

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants