Skip to content

OCPBUGS-59958: move cleanUpDuplicatedMC to avoid double reboot on first updated Master node#6201

Merged
openshift-merge-bot[bot] merged 2 commits into
openshift:mainfrom
proietfb:OCPBUGS-59958_double_reboot_MC_cleanup
Jun 29, 2026
Merged

OCPBUGS-59958: move cleanUpDuplicatedMC to avoid double reboot on first updated Master node#6201
openshift-merge-bot[bot] merged 2 commits into
openshift:mainfrom
proietfb:OCPBUGS-59958_double_reboot_MC_cleanup

Conversation

@proietfb

@proietfb proietfb commented Jun 17, 2026

Copy link
Copy Markdown
Member

Closes OCPBUGS-59958

What I did

Moved cleanUpDuplicatedMC to after the loop that creates/updates MachineConfigs.

How to verify it

  1. Deploy a stock cluster (without this fix)
  2. Apply a KubeletConfig with autoSizingReserved: true targeting the master pool and wait for the pool to reach UPDATED: True
  3. Corrupt the GeneratedByControllerVersionAnnotationKey annotation on 97-master-generated-kubelet with an arbitrary value to simulate the pre-upgrade state, then restart the
    machine-config-controller pod
  4. Without the fix: DELETED followed by ADDED events appear on 97-master-generated-kubelet, and a new rendered-master-* is created
  5. Apply the MCC image containing this fix and repeat steps 3–4
  6. With the fix: only MODIFIED events appear on 97-master-generated-kubelet, with no new rendered-master-*

Note: corrupting the annotation manually is necessary to simulate the version mismatch that occurs naturally during an MCO upgrade, when the new MCC binary carries a different GeneratedByControllerVersionAnnotationKey value than the one stored in the existing MC annotation.

Description for the changelog

Running the loop first guarantees that all existing MCs have their GeneratedByControllerVersionAnnotationKey annotation updated before cleanUpDuplicatedMC runs, preventing it from wrongly removing them.

cleanUpDuplicatedMC will only act on MCs not associated with any existing MachineConfigPool.

cleanUpDuplicatedMC's git history shows that its original position was after the create/update MCs loop and was moved to avoid a corner case related to an early exit when no cgroup v2 was present. Then, after defaulting cgroups, that corner case was removed.

Summary by CodeRabbit

  • Bug Fixes
    • Improved kubelet configuration reconciliation by adjusting when duplicate machine configuration cleanup occurs, resulting in more predictable kubelet configuration updates across cluster pools.
  • Tests
    • Updated node configuration tests to reflect the updated machine configuration fetch behavior during reconciliation.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 17, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@proietfb: This pull request references Jira Issue OCPBUGS-59958, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Closes OCPBUGS-59958

What I did

Moved cleanUpDuplicatedMC to after the loop that creates/updates MachineConfigs.

How to verify it

  1. Deploy a stock cluster (without this fix)
  2. Apply a KubeletConfig with autoSizingReserved: true targeting the master pool and wait for the pool to reach UPDATED: True
  3. Corrupt the GeneratedByControllerVersionAnnotationKey annotation on 97-master-generated-kubelet with an arbitrary value to simulate the pre-upgrade state, then restart the
    machine-config-controller pod
  4. Without the fix: DELETED followed by ADDED events appear on 97-master-generated-kubelet, and a new rendered-master-* is created
  5. Apply the MCC image containing this fix and repeat steps 3–4
  6. With the fix: only MODIFIED events appear on 97-master-generated-kubelet, with no new rendered-master-*

Note: corrupting the annotation manually is necessary to simulate the version mismatch that occurs naturally during an MCO upgrade, when the new MCC binary carries a different GeneratedByControllerVersionAnnotationKey value than the one stored in the existing MC annotation.

Description for the changelog

Running the loop first guarantees that all existing MCs have their GeneratedByControllerVersionAnnotationKey annotation updated before cleanUpDuplicatedMC runs, preventing it from wrongly removing them.

cleanUpDuplicatedMC will only act on MCs not associated with any existing MachineConfigPool.

cleanUpDuplicatedMC's git history shows that its original position was after the create/update MCs loop and was moved to avoid a corner case related to an early exit when no cgroup v2 was present. Then, after defaulting cgroups, that corner case was removed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 830309d0-8c63-45aa-8b49-9745f20ea808

📥 Commits

Reviewing files that changed from the base of the PR and between 51e1b0d and 754cec0.

📒 Files selected for processing (1)
  • pkg/controller/kubelet-config/kubelet_config_nodes_test.go

Walkthrough

In syncNodeConfigHandler, the cleanUpDuplicatedMC(managedNodeConfigKeyPrefix) call is relocated from before the controller config and pool processing block to after the pool reconciliation loop finishes and immediately before the existing kubeletconfigs are synced. Test expectations are updated to verify the new call sequence.

Changes

Cleanup call reorder in syncNodeConfigHandler

Layer / File(s) Summary
Reorder cleanUpDuplicatedMC call and update test expectations
pkg/controller/kubelet-config/kubelet_config_nodes.go, pkg/controller/kubelet-config/kubelet_config_nodes_test.go
cleanUpDuplicatedMC(managedNodeConfigKeyPrefix) is moved from before pool processing to after all pool reconciliation completes and before kubeletconfig syncing begins. Test expectations are updated to include an additional GET call for the worker machine config.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically references the bug fix (OCPBUGS-59958) and clearly summarizes the main change: relocating cleanUpDuplicatedMC to prevent double reboot on Master nodes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies kubelet_config_nodes_test.go which uses standard Go testing, not Ginkgo. No Ginkgo test definitions (It, Describe, Context, etc.) exist in the file, so the check is not applicable.
Test Structure And Quality ✅ Passed TestNodeConfigDefault follows Ginkgo test quality requirements: uses Go's standard t.Run for test isolation, employs fake clients for API interactions, assertions include meaningful context message...
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added in this PR. Changes are limited to refactoring existing controller code and updating unit test expectations in the kubelet_config_nodes_test.go file.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR only modifies unit tests in pkg/controller/kubelet-config/ using Go's standard testing framework, not new Ginkgo e2e tests. The custom check for SNO compatibility applies only to new Ginkgo e2e...
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only controller reconciliation logic for MachineConfig ordering; no deployment manifests, pod scheduling constraints, affinity rules, or topology-aware code changes are introduced.
Ote Binary Stdout Contract ✅ Passed Modified files are controller package code (not binary entry points) and test assertions, with no process-level stdout writes. Changes relocate a function call and update test expectations, introdu...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. Only unit test mock expectations are updated in kubelet_config_nodes_test.go; check does not apply.
No-Weak-Crypto ✅ Passed PR contains no cryptographic operations. Changes are purely orchestration logic that reorders function calls in a Kubernetes controller, with no weak crypto, custom crypto, or secret comparison iss...
Container-Privileges ✅ Passed No container/Kubernetes manifests modified in this PR; check only applies to manifest files defining privileged container settings.
No-Sensitive-Data-In-Logs ✅ Passed PR only moves cleanUpDuplicatedMC call between lines; no new logging statements were introduced that could expose passwords, tokens, API keys, PII, or sensitive data.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from umohnani8 and yuqi-zhang June 17, 2026 13:56
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2026

@yuqi-zhang yuqi-zhang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will probably fix the problem, but we did explicitly move this to the current location in #3563

With the commit message:

2. Execute the clean up of duplicate MCs in the kubeletCfg_NodeCfg controller before returning due to empty nodes.config spec

So I'm wondering if that's still relevant, i.e. if you do set a node.config spec, then remove it, is this properly handling it?

I was debating whether or not we should actually add an explicit check and delete instead of relying on the cleanUpDuplicatedMC code, since I'm not sure if its doing it properly in current version clusters.

@isabella-janssen

Copy link
Copy Markdown
Member

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 18, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@isabella-janssen: This pull request references Jira Issue OCPBUGS-59958, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @asahay19

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested a review from asahay19 June 18, 2026 16:02
@proietfb

Copy link
Copy Markdown
Member Author

I think this will probably fix the problem, but we did explicitly move this to the current location in #3563

From what I'm understanding, this was true until MCO defaulted to cgroupv2 (#3972). Before that, if nodeCfg.Spec was empty, the sync function would return without errors. For this reason, if cleanUpDuplicatedMC was placed after the for loop, it would never run. By removing that Path from sync function (#3972) the execution of cleanUpDuplicatedMC should be guaranteed.

So I'm wondering if that's still relevant, i.e. if you do set a node.config spec, then remove it, is this properly handling it?

Since there are not early returns without errors, I think the kubelet config nodes worker should cover this scenario, but this is not fully true for controller worker. Unlike the others, for kubelet controller configs I'm seeing 2 more scenarios:

  • If a config has been added and then deleted, that config should not be orphan due to a deletion or timeout.
  • If a config is part of a custom pool and that custom pool has been deleted, the related MCs (if not deleted by something else), will remain orphaned until the MCO is updated. In that case, those MCs will not have the updated GeneratedByControllerVersionAnnotationKey annotation with the latest MCO version and orphaned MCs should be deleted because of a mismatched annotation.

So, in theory, if what I've said it is true, the last scenario was still present even before #3972 and #3563. In that case, we can investigate further by opening a bug related to that case.

@yuqi-zhang what do you think about?

@yuqi-zhang

Copy link
Copy Markdown
Contributor

From what I'm understanding, this was true until MCO defaulted to cgroupv2 (#3972). Before that, if nodeCfg.Spec was empty, the sync function would return without errors. For this reason, if cleanUpDuplicatedMC was placed after the for loop, it would never run. By removing that Path from sync function (#3972) the execution of cleanUpDuplicatedMC should be guaranteed.

Ack, that makes sense, thanks for checking into it. (side note, it's not really just cleaning up duplicated MCs, so the name is kinda misleading, but not a big deal)

If a config is part of a custom pool and that custom pool has been deleted, the related MCs (if not deleted by something else), will remain orphaned until the MCO is updated. In that case, those MCs will not have the updated GeneratedByControllerVersionAnnotationKey annotation with the latest MCO version and orphaned MCs should be deleted because of a mismatched annotation.

If I understand correctly, you're saying that the controller running here to delete the additional MC doesn't key off of pool changes, so until something actually triggers a re-render, we have a bunch of stale MCs. I think this is probably fine since if the pool is gone, nothing would be using these MCs anyways. Since they eventually get cleaned up, it's up to you on whether you want to make this a followup issue or not.

Also, the unit test failures are legitimate (albeit not because you're doing anything wrong I think, just that the expected behaviour has changed?). You probably need to update https://github.com/openshift/machine-config-operator/blob/main/pkg/controller/kubelet-config/kubelet_config_nodes_test.go#L72-L75 based on an initial look. You may have one additional get() for some reason - may be best to check if that's expected first before adding it to the test.

@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@proietfb: This pull request references Jira Issue OCPBUGS-59958, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @asahay19

Details

In response to this:

Closes OCPBUGS-59958

What I did

Moved cleanUpDuplicatedMC to after the loop that creates/updates MachineConfigs.

How to verify it

  1. Deploy a stock cluster (without this fix)
  2. Apply a KubeletConfig with autoSizingReserved: true targeting the master pool and wait for the pool to reach UPDATED: True
  3. Corrupt the GeneratedByControllerVersionAnnotationKey annotation on 97-master-generated-kubelet with an arbitrary value to simulate the pre-upgrade state, then restart the
    machine-config-controller pod
  4. Without the fix: DELETED followed by ADDED events appear on 97-master-generated-kubelet, and a new rendered-master-* is created
  5. Apply the MCC image containing this fix and repeat steps 3–4
  6. With the fix: only MODIFIED events appear on 97-master-generated-kubelet, with no new rendered-master-*

Note: corrupting the annotation manually is necessary to simulate the version mismatch that occurs naturally during an MCO upgrade, when the new MCC binary carries a different GeneratedByControllerVersionAnnotationKey value than the one stored in the existing MC annotation.

Description for the changelog

Running the loop first guarantees that all existing MCs have their GeneratedByControllerVersionAnnotationKey annotation updated before cleanUpDuplicatedMC runs, preventing it from wrongly removing them.

cleanUpDuplicatedMC will only act on MCs not associated with any existing MachineConfigPool.

cleanUpDuplicatedMC's git history shows that its original position was after the create/update MCs loop and was moved to avoid a corner case related to an early exit when no cgroup v2 was present. Then, after defaulting cgroups, that corner case was removed.

Summary by CodeRabbit

  • Bug Fixes
  • Improved kubelet configuration reconciliation by adjusting when duplicate machine configuration cleanup occurs, resulting in more predictable kubelet configuration updates across cluster pools.
  • Tests
  • Updated node configuration tests to reflect the updated machine configuration fetch behavior during reconciliation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@proietfb

Copy link
Copy Markdown
Member Author

If I understand correctly, you're saying that the controller running here to delete the additional MC doesn't key off of pool changes, so until something actually triggers a re-render, we have a bunch of stale MCs. I think this is probably fine since if the pool is gone, nothing would be using these MCs anyways.

Yes exactly

Also, the unit test failures are legitimate (albeit not because you're doing anything wrong I think, just that the expected behaviour has changed?). You probably need to update https://github.com/openshift/machine-config-operator/blob/main/pkg/controller/kubelet-config/kubelet_config_nodes_test.go#L72-L75 based on an initial look. You may have one additional get() for some reason - may be best to check if that's expected first before adding it to the test.

Thank you. Yes, moving the cleanup function after, requires an extra get() to unit tests.

@yuqi-zhang yuqi-zhang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Based on the conversation I think this is a fine backportable fix to get rid of the immediate problem. Long term I'd like us to consider something like https://redhat.atlassian.net/browse/MCO-2340 for this as well (way beyond the scope of this PR, just linking for context)

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op-ocl-part1
/test e2e-gcp-op-ocl-part2
/test e2e-gcp-op-part1
/test e2e-gcp-op-part2
/test e2e-gcp-op-single-node
/test e2e-hypershift

@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: proietfb, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [proietfb,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yuqi-zhang

Copy link
Copy Markdown
Contributor

/cherry-pick release-4.22

@openshift-cherrypick-robot

Copy link
Copy Markdown

@yuqi-zhang: once the present PR merges, I will cherry-pick it on top of release-4.22 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@proietfb

Copy link
Copy Markdown
Member Author

/test e2e-gcp-op-part2
/test e2e-aws-ovn-upgrade
/test e2e-hypershift

@proietfb

proietfb commented Jun 26, 2026

Copy link
Copy Markdown
Member Author

/cherry-pick release-4.21

@openshift-cherrypick-robot

Copy link
Copy Markdown

@proietfb: once the present PR merges, I will cherry-pick it on top of release-4.21 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-4.21 release-4.20 release-4.19 release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@proietfb

Copy link
Copy Markdown
Member Author

/cherry-pick release-4.20
/cherry-pick release-4.19
/cherry-pick release-4.18

@openshift-cherrypick-robot

Copy link
Copy Markdown

@proietfb: once the present PR merges, I will cherry-pick it on top of release-4.18, release-4.19, release-4.20 in new PRs and assign them to you.

Details

In response to this:

/cherry-pick release-4.20
/cherry-pick release-4.19
/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sergiordlr

sergiordlr commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Verified using IPI on AWS.

Reproduce the issue

In a 5.0 cluster without the fix we executed the following steps

  1. Create a kubeletconfig with autoSizingReserved: true in the master pool
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: autosizing-reserved
spec:
  autoSizingReserved: true
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/master: ""


  1. Manually edit the machineconfiguration.openshift.io/generated-by-controller-version annotation in 97-master-generated-kubelet machineconfig.

  2. Remove the MCC pod and let MCO recreate it

  3. Check that the 97-master-generated-kubelet machineconfig is deleted and recreated

Verify the fix

In a 5.0 cluster with the fix we executed the following steps

  1. Create a kubeletconfig with autoSizingReserved: true in the master pool
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: autosizing-reserved
spec:
  autoSizingReserved: true
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/master: ""


  1. Manually edit the machineconfiguration.openshift.io/generated-by-controller-version annotation in 97-master-generated-kubelet machineconfig.

  2. Remove the MCC pod and let MCO recreate it

  3. Check that the 97-master-generated-kubelet machineconfig was NOT deleted, but updated.

Verify the fix with an actual upgrade

We executed the following steps

  1. Create a 4.22 cluster
  2. Create a kubeletconfig with autoSizingReserved: true in the master pool
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: autosizing-reserved
spec:
  autoSizingReserved: true
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/master: ""
  1. Upgrade the cluster to a 5.0+fix version
  2. Check that the nodes were only rebooted once
The upgrade window is 10:01:45Z – 11:06:45Z.                               
                                                                                                                                                                                                                   
  Each last reboot line shows when a boot started. A reboot happened during the upgrade if the boot start time falls within that window.                                                                           
                                                                                                                                                                                                                   
  Masters:                                                                                                                                                                                                         
                                                                                                                                                                                                                   
  ┌────────────────┬────────────────────────────┬──────────────────────────┐                                                                                                                                       
  │      Node      │         Boot times         │ Boots within 10:01–11:06 │
  ├────────────────┼────────────────────────────┼──────────────────────────┤                                                                                                                                       
  │ ip-10-0-19-104 │ 08:39, 08:41, 09:26, 10:51 │ 1                        │
  ├────────────────┼────────────────────────────┼──────────────────────────┤                                                                                                                                       
  │ ip-10-0-41-162 │ 08:39, 08:41, 09:32, 11:05 │ 1                        │                                                                                                                                       
  ├────────────────┼────────────────────────────┼──────────────────────────┤                                                                                                                                       
  │ ip-10-0-88-60  │ 08:39, 08:41, 09:37, 10:58 │ 1                        │                                                                                                                                       
  └────────────────┴────────────────────────────┴──────────────────────────┘                                                                                                                                       
                  
  Workers:                                                                                                                                                                                                         
                  
  ┌────────────────┬─────────────────────┬──────────────────────────┐
  │      Node      │     Boot times      │ Boots within 10:01–11:06 │
  ├────────────────┼─────────────────────┼──────────────────────────┤                                                                                                                                              
  │ ip-10-0-22-154 │ 08:47, 08:48, 10:47 │ 1                        │
  ├────────────────┼─────────────────────┼──────────────────────────┤                                                                                                                                              
  │ ip-10-0-48-220 │ 08:47, 08:49, 10:50 │ 1                        │
  ├────────────────┼─────────────────────┼──────────────────────────┤                                                                                                                                              
  │ ip-10-0-72-255 │ 08:47, 08:50, 10:53 │ 1                        │
  └────────────────┴─────────────────────┴──────────────────────────┘                                                                                                                                              
                  
  Every node rebooted exactly 1 time during the upgrade.

/verified by @sergiordlr

@sergiordlr

Copy link
Copy Markdown
Contributor

/verified by @sergiordlr

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 29, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@sergiordlr: This PR has been marked as verified by @sergiordlr.

Details

In response to this:

/verified by @sergiordlr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

@proietfb: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 6396296 into openshift:main Jun 29, 2026
17 checks passed
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@proietfb: Jira Issue OCPBUGS-59958: Some pull requests linked via external trackers have merged:

The following pull request, linked via external tracker, has not merged:

All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-59958 has not been moved to the MODIFIED state.

This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload.

Details

In response to this:

Closes OCPBUGS-59958

What I did

Moved cleanUpDuplicatedMC to after the loop that creates/updates MachineConfigs.

How to verify it

  1. Deploy a stock cluster (without this fix)
  2. Apply a KubeletConfig with autoSizingReserved: true targeting the master pool and wait for the pool to reach UPDATED: True
  3. Corrupt the GeneratedByControllerVersionAnnotationKey annotation on 97-master-generated-kubelet with an arbitrary value to simulate the pre-upgrade state, then restart the
    machine-config-controller pod
  4. Without the fix: DELETED followed by ADDED events appear on 97-master-generated-kubelet, and a new rendered-master-* is created
  5. Apply the MCC image containing this fix and repeat steps 3–4
  6. With the fix: only MODIFIED events appear on 97-master-generated-kubelet, with no new rendered-master-*

Note: corrupting the annotation manually is necessary to simulate the version mismatch that occurs naturally during an MCO upgrade, when the new MCC binary carries a different GeneratedByControllerVersionAnnotationKey value than the one stored in the existing MC annotation.

Description for the changelog

Running the loop first guarantees that all existing MCs have their GeneratedByControllerVersionAnnotationKey annotation updated before cleanUpDuplicatedMC runs, preventing it from wrongly removing them.

cleanUpDuplicatedMC will only act on MCs not associated with any existing MachineConfigPool.

cleanUpDuplicatedMC's git history shows that its original position was after the create/update MCs loop and was moved to avoid a corner case related to an early exit when no cgroup v2 was present. Then, after defaulting cgroups, that corner case was removed.

Summary by CodeRabbit

  • Bug Fixes
  • Improved kubelet configuration reconciliation by adjusting when duplicate machine configuration cleanup occurs, resulting in more predictable kubelet configuration updates across cluster pools.
  • Tests
  • Updated node configuration tests to reflect the updated machine configuration fetch behavior during reconciliation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

Copy link
Copy Markdown

@yuqi-zhang: new pull request created: #6240

Details

In response to this:

/cherry-pick release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

Copy link
Copy Markdown

@proietfb: new pull request created: #6241

Details

In response to this:

/cherry-pick release-4.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

Copy link
Copy Markdown

@proietfb: new pull request created: #6242

Details

In response to this:

/cherry-pick release-4.20
/cherry-pick release-4.19
/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

Copy link
Copy Markdown

@proietfb: new pull request created: #6243

Details

In response to this:

/cherry-pick release-4.20
/cherry-pick release-4.19
/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

Copy link
Copy Markdown

@proietfb: #6201 failed to apply on top of branch "release-4.18":

Applying: OCPBUGS-59958: move cleanUpDuplicatedMC to after pool loop in syncNodeConfigHandler
Using index info to reconstruct a base tree...
M	pkg/controller/kubelet-config/kubelet_config_nodes.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/controller/kubelet-config/kubelet_config_nodes.go
CONFLICT (content): Merge conflict in pkg/controller/kubelet-config/kubelet_config_nodes.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 OCPBUGS-59958: move cleanUpDuplicatedMC to after pool loop in syncNodeConfigHandler

Details

In response to this:

/cherry-pick release-4.20
/cherry-pick release-4.19
/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants