Skip to content

OCPBUGS-86643: Skip control-plane label migration on External topolog…#6205

Open
harsh-thakare wants to merge 1 commit into
openshift:mainfrom
harsh-thakare:hthakare-OCPBUGS-86643
Open

OCPBUGS-86643: Skip control-plane label migration on External topolog…#6205
harsh-thakare wants to merge 1 commit into
openshift:mainfrom
harsh-thakare:hthakare-OCPBUGS-86643

Conversation

@harsh-thakare

@harsh-thakare harsh-thakare commented Jun 18, 2026

Copy link
Copy Markdown

Jira

https://issues.redhat.com/browse/OCPBUGS-86643

Problem

After upgrading HyperShift HostedClusters from 4.20.x to 4.21.x, dedicated
worker nodes incorrectly receive node-role.kubernetes.io/control-plane.
Hub compact nodes that are real control-plane nodes are expected to carry
both master and control-plane labels; guest HostedCluster worker nodes
must not
.

Customer environment:

  • HUB upgraded: 4.20.8 → 4.20.22 → 4.21.15
  • HostedClusters upgraded to 4.21.16
  • NodePool objects exist on the HUB (clusters namespace), not inside guest clusters
  • Incorrect control-plane label observed on existing HostedCluster worker nodes

Root cause

MCO reconcileMaster() calls updateMasterNodeControlPlaneLabel() (added for
OCPBUGS-58180) to migrate legacy mastercontrol-plane labels. This runs
without excluding External topology clusters (HyperShift guest clusters),
where there is no in-cluster control plane.

Fix

Skip reconcileMaster() when ControllerConfig.spec.infra.status.controlPlaneTopology == External.

Files changed

  • pkg/controller/node/node_controller.go — guard in reconcileMaster()
  • pkg/controller/node/node_controller_test.go — unit tests

Local testing

Environment: macOS, branch ocpbugs-86643-skip-cp-label-external-topology, commit 73db7395

$ cd machine-config-operator
$ go test ./pkg/controller/node/... -run 'TestReconcileMaster' -count=1 -v
=== RUN   TestReconcileMasterSkipsExternalTopology
=== PAUSE TestReconcileMasterSkipsExternalTopology
=== RUN   TestReconcileMasterAddsControlPlaneLabel
=== PAUSE TestReconcileMasterAddsControlPlaneLabel
=== CONT  TestReconcileMasterSkipsExternalTopology
=== CONT  TestReconcileMasterAddsControlPlaneLabel
--- PASS: TestReconcileMasterAddsControlPlaneLabel (0.20s)
--- PASS: TestReconcileMasterSkipsExternalTopology (0.20s)
PASS
ok  	github.com/openshift/machine-config-operator/pkg/controller/node	1.267s
go test ./pkg/controller/node/... -count=1
ok  	github.com/openshift/machine-config-operator/pkg/controller/node	2.630s

Summary by CodeRabbit

  • Bug Fixes

    • Improved node controller efficiency by skipping unnecessary reconciliation steps for external topology cluster configurations.
    • Fixed control plane label management to occur only after topology verification.
  • Tests

    • Added test coverage for node reconciliation behavior across different cluster topology modes.
    • Added test validation for simultaneous off-cluster and on-cluster machine configuration layering.

…y clusters

HostedCluster (HyperShift) worker nodes must not receive the
node-role.kubernetes.io/control-plane label during upgrade. External
topology clusters have no in-cluster control plane; MCO master
reconciliation should not run label migration there.
https://issues.redhat.com/browse/OCPBUGS-86643

Signed-off-by: Harshal Thakare <hthakare@redhat.com>
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 18, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@harsh-thakare: This pull request references Jira Issue OCPBUGS-86643, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Jira

https://issues.redhat.com/browse/OCPBUGS-86643

Problem

After upgrading HyperShift HostedClusters from 4.20.x to 4.21.x, dedicated
worker nodes incorrectly receive node-role.kubernetes.io/control-plane.
Hub compact nodes that are real control-plane nodes are expected to carry
both master and control-plane labels; guest HostedCluster worker nodes
must not
.

Customer environment:

  • HUB upgraded: 4.20.8 → 4.20.22 → 4.21.15
  • HostedClusters upgraded to 4.21.16
  • NodePool objects exist on the HUB (clusters namespace), not inside guest clusters
  • Incorrect control-plane label observed on existing HostedCluster worker nodes

Root cause

MCO reconcileMaster() calls updateMasterNodeControlPlaneLabel() (added for
OCPBUGS-58180) to migrate legacy mastercontrol-plane labels. This runs
without excluding External topology clusters (HyperShift guest clusters),
where there is no in-cluster control plane.

Fix

Skip reconcileMaster() when ControllerConfig.spec.infra.status.controlPlaneTopology == External.

Files changed

  • pkg/controller/node/node_controller.go — guard in reconcileMaster()
  • pkg/controller/node/node_controller_test.go — unit tests

Local testing

Environment: macOS, branch ocpbugs-86643-skip-cp-label-external-topology, commit 73db7395

$ cd machine-config-operator
$ go test ./pkg/controller/node/... -run 'TestReconcileMaster' -count=1 -v
=== RUN   TestReconcileMasterSkipsExternalTopology
=== PAUSE TestReconcileMasterSkipsExternalTopology
=== RUN   TestReconcileMasterAddsControlPlaneLabel
=== PAUSE TestReconcileMasterAddsControlPlaneLabel
=== CONT  TestReconcileMasterSkipsExternalTopology
=== CONT  TestReconcileMasterAddsControlPlaneLabel
--- PASS: TestReconcileMasterAddsControlPlaneLabel (0.20s)
--- PASS: TestReconcileMasterSkipsExternalTopology (0.20s)
PASS
ok  	github.com/openshift/machine-config-operator/pkg/controller/node	1.267s
go test ./pkg/controller/node/... -count=1
ok  	github.com/openshift/machine-config-operator/pkg/controller/node	2.630s

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Walkthrough

reconcileMaster gains an early-return guard that fetches ControllerConfig and skips control-plane label reconciliation when ControlPlaneTopology is ExternalTopologyMode (HyperShift hosted clusters). Two new unit tests cover both topology branches. Separately, a new extended e2e test verifies that off-cluster and on-cluster layering can coexist on the same MachineConfigPool.

Changes

ExternalTopology guard in reconcileMaster

Layer / File(s) Summary
ExternalTopology guard implementation and unit tests
pkg/controller/node/node_controller.go, pkg/controller/node/node_controller_test.go
reconcileMaster now calls ccLister.Get to retrieve ControllerConfig, returns early (without calling updateMasterNodeControlPlaneLabel) when ControlPlaneTopology is ExternalTopologyMode, and logs an error on fetch failure. Two new unit tests verify no patch/update actions are emitted for ExternalTopologyMode and that a ControlPlaneLabel-containing patch is emitted for HighlyAvailableTopologyMode.

OCB off-cluster + on-cluster layering coexistence e2e test

Layer / File(s) Summary
OCB coexistence e2e test
test/extended-priv/mco_ocb.go
Adds a new Skipped:Disconnected Ginkgo test that orchestrates off-cluster image build/apply, OCL enablement via MachineOSConfig, dual-marker verification, off-cluster MachineConfig deletion, new MachineOSBuild wait, and final assertions that off-cluster content is removed while on-cluster content persists.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • openshift/machine-config-operator#6043: Directly related — the new Ginkgo test in test/extended-priv/mco_ocb.go for off-cluster and on-cluster layering coexistence overlaps with work introduced in that PR in the same file.

Suggested reviewers

  • dkhater-redhat
🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning The new Ginkgo test in test/extended-priv/mco_ocb.go at line 281 defers DisableOCL(mosc) without checking its error return value, violating the "Never ignore error returns" Go guideline. The same r... Line 281: Replace defer DisableOCL(mosc) with defer func() { o.Expect(DisableOCL(mosc)).To(o.Succeed(), "failed to disable OCL for pool %s", mcp.GetName()) }() to properly handle cleanup errors.
Microshift Test Compatibility ⚠️ Warning New test uses MachineOSConfig/MachineOSBuild (machineconfiguration.openshift.io) APIs not available on MicroShift and lacks required skip protections. Add [apigroup:machineconfiguration.openshift.io] tag to test name or wrap test with exutil.IsMicroShiftCluster() check to skip on MicroShift.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: skipping control-plane label migration on External topology clusters (HyperShift), matching the primary objective of the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The new Ginkgo test in mco_ocb.go has a stable, deterministic name: "[PolarionID:88202][Skipped:Disconnected] Both off-cluster and on-cluster layering can coexist on the same pool [Disruptive]". It...
Single Node Openshift (Sno) Test Compatibility ✅ Passed New Ginkgo test (PolarionID:88202) does not make multi-node or HA assumptions. Uses GetCompactCompatiblePool() which intelligently adapts to SNO topology and operates on single node only.
Topology-Aware Scheduling Compatibility ✅ Passed This PR adds topology-aware scheduling improvements. The main fix explicitly checks ControlPlaneTopology == ExternalTopologyMode to skip control-plane label application for HyperShift clusters. N...
Ote Binary Stdout Contract ✅ Passed PR changes contain no non-JSON stdout writes in process-level code (main, init, TestMain, BeforeSuite, etc). Controller logging uses klog with logtostderr=true, and all test code is within Ginkgo t...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The new Ginkgo test in test/extended-priv/mco_ocb.go is correctly marked with [Skipped:Disconnected], which ensures it skips on IPv6-only disconnected environments where external connectivity to qu...
No-Weak-Crypto ✅ Passed No weak cryptography (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB) or custom crypto implementations detected. PR adds topology checks and tests with no cryptographic code.
Container-Privileges ✅ Passed The PR modifies only Go source and test files (node_controller.go, node_controller_test.go, and mco_ocb.go). No Kubernetes manifests or container specifications with privilege settings (privileged:...
No-Sensitive-Data-In-Logs ✅ Passed All logging statements in the PR are safe: error logs contain standard Kubernetes errors, and test logs only contain object names, file paths, and digest hashes—no passwords, tokens, API keys, PII,...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Warning

⚠️ This pull request shows signs of AI-generated slop (description_diff_mismatch). It has been flagged by CodeRabbit slop detection and should be reviewed carefully.

@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 18, 2026
@openshift-ci

openshift-ci Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Hi @harsh-thakare. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci

openshift-ci Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: harsh-thakare
Once this PR has been reviewed and has the lgtm label, please assign ptalgulk01 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/controller/node/node_controller_test.go (1)

1639-1644: ⚡ Quick win

Handle both node patch and update actions in the assertion

Line 1639 only checks core.PatchAction; if node mutation is emitted as update, this test can fail even when behavior is correct. Consider accepting both verbs and validating label presence from either payload/object.

Suggested test adjustment
  patched := false
  for _, action := range filterInformerActions(f.kubeclient.Actions()) {
-   patchAction, ok := action.(core.PatchAction)
-   if !ok || action.GetResource().Resource != "nodes" {
-     continue
-   }
-   if strings.Contains(string(patchAction.GetPatch()), ControlPlaneLabel) {
-     patched = true
-     break
-   }
+   if action.GetResource().Resource != "nodes" {
+     continue
+   }
+   switch a := action.(type) {
+   case core.PatchAction:
+     if strings.Contains(string(a.GetPatch()), ControlPlaneLabel) {
+       patched = true
+     }
+   case core.UpdateAction:
+     if n, ok := a.GetObject().(*corev1.Node); ok {
+       _, patched = n.Labels[ControlPlaneLabel]
+     }
+   }
+   if patched {
+     break
+   }
  }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/node/node_controller_test.go` around lines 1639 - 1644, The
test assertion at line 1639 only checks for core.PatchAction when verifying node
mutations, but if the controller emits an update action instead of a patch
action, the test fails incorrectly. Modify the type assertion to accept both
core.PatchAction and core.UpdateAction (or whichever update action type is
used). For PatchAction, continue checking the patch content via GetPatch(), and
for UpdateAction, check the ControlPlaneLabel in the object field being updated.
Set patched to true when the label is found in either action type's relevant
payload.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended-priv/mco_ocb.go`:
- Line 281: The deferred call to DisableOCL(mosc) on line 281 is ignoring the
error return value that the function provides. Replace the simple defer
statement with a deferred anonymous function that explicitly captures and
handles the error returned by DisableOCL(mosc) by either logging it, storing it
for assertion, or using an error handling pattern consistent with the test's
cleanup strategy, ensuring that any cleanup failures are properly tracked rather
than silently dropped.

---

Nitpick comments:
In `@pkg/controller/node/node_controller_test.go`:
- Around line 1639-1644: The test assertion at line 1639 only checks for
core.PatchAction when verifying node mutations, but if the controller emits an
update action instead of a patch action, the test fails incorrectly. Modify the
type assertion to accept both core.PatchAction and core.UpdateAction (or
whichever update action type is used). For PatchAction, continue checking the
patch content via GetPatch(), and for UpdateAction, check the ControlPlaneLabel
in the object field being updated. Set patched to true when the label is found
in either action type's relevant payload.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 27ce11ea-e4d5-472b-8348-57644378bbb6

📥 Commits

Reviewing files that changed from the base of the PR and between d0c3a5e and ebabb0f.

📒 Files selected for processing (3)
  • pkg/controller/node/node_controller.go
  • pkg/controller/node/node_controller_test.go
  • test/extended-priv/mco_ocb.go

exutil.By("Enable on-cluster layering (OCL) with a containerFile")
mosc, err := CreateMachineOSConfigUsingExternalOrInternalRegistry(oc.AsAdmin(), MachineConfigNamespace, mcp.GetName(), mcp.GetName(), containerFiles)
o.Expect(err).NotTo(o.HaveOccurred(), "Error creating MachineOSConfig for %s", mcp.GetName())
defer DisableOCL(mosc)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the ignored error return and the function signature
rg -n 'defer\s+DisableOCL\(' test/extended-priv/mco_ocb.go
rg -n '^func\s+DisableOCL\(.+\)\s+error' test/extended-priv/mco_ocb.go

Repository: openshift/machine-config-operator

Length of output: 243


🏁 Script executed:

# Get context on which lines are new in the PR by checking the ranges
# The review mentions lines 238-337 as new test code
# Check if lines 176, 198 are before the new block and 382 is after
sed -n '170,390p' test/extended-priv/mco_ocb.go | cat -n | grep -E '(176|198|281|382):|defer.*DisableOCL'

Repository: openshift/machine-config-operator

Length of output: 204


🏁 Script executed:

# Also check how errors are handled with other deferred calls in the file to understand the pattern
rg -B2 -A2 'defer.*\(' test/extended-priv/mco_ocb.go | head -60

Repository: openshift/machine-config-operator

Length of output: 3051


Handle deferred DisableOCL errors instead of dropping them.

Line 281 defers DisableOCL(mosc) but ignores its error return. If cleanup fails, the spec can leak OCL state and destabilize later tests. Assert this deferred cleanup result explicitly.

Proposed fix
-		defer DisableOCL(mosc)
+		defer func() {
+			o.Expect(DisableOCL(mosc)).To(o.Succeed(), "failed to disable OCL for pool %s", mcp.GetName())
+		}()

Go guideline: **/*.go — "Never ignore error returns".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
defer DisableOCL(mosc)
defer func() {
o.Expect(DisableOCL(mosc)).To(o.Succeed(), "failed to disable OCL for pool %s", mcp.GetName())
}()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended-priv/mco_ocb.go` at line 281, The deferred call to
DisableOCL(mosc) on line 281 is ignoring the error return value that the
function provides. Replace the simple defer statement with a deferred anonymous
function that explicitly captures and handles the error returned by
DisableOCL(mosc) by either logging it, storing it for assertion, or using an
error handling pattern consistent with the test's cleanup strategy, ensuring
that any cleanup failures are properly tracked rather than silently dropped.

Source: Coding guidelines

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in test/extended-priv/mco_ocb.go appears
unrelated to this PR. Could you please let me know the reason behind this file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants