OCPBUGS-59958: move cleanUpDuplicatedMC to avoid double reboot on first updated Master node#6201
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@proietfb: This pull request references Jira Issue OCPBUGS-59958, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
WalkthroughIn ChangesCleanup call reorder in syncNodeConfigHandler
Estimated code review effort🎯 2 (Simple) | ⏱️ ~5 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
yuqi-zhang
left a comment
There was a problem hiding this comment.
I think this will probably fix the problem, but we did explicitly move this to the current location in #3563
With the commit message:
2. Execute the clean up of duplicate MCs in the kubeletCfg_NodeCfg controller before returning due to empty nodes.config spec
So I'm wondering if that's still relevant, i.e. if you do set a node.config spec, then remove it, is this properly handling it?
I was debating whether or not we should actually add an explicit check and delete instead of relying on the cleanUpDuplicatedMC code, since I'm not sure if its doing it properly in current version clusters.
|
/jira refresh |
|
@isabella-janssen: This pull request references Jira Issue OCPBUGS-59958, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
From what I'm understanding, this was true until MCO defaulted to cgroupv2 (#3972). Before that, if nodeCfg.Spec was empty, the sync function would return without errors. For this reason, if cleanUpDuplicatedMC was placed after the for loop, it would never run. By removing that Path from sync function (#3972) the execution of cleanUpDuplicatedMC should be guaranteed.
Since there are not early returns without errors, I think the kubelet config nodes worker should cover this scenario, but this is not fully true for controller worker. Unlike the others, for kubelet controller configs I'm seeing 2 more scenarios:
So, in theory, if what I've said it is true, the last scenario was still present even before #3972 and #3563. In that case, we can investigate further by opening a bug related to that case. @yuqi-zhang what do you think about? |
Ack, that makes sense, thanks for checking into it. (side note, it's not really just cleaning up duplicated MCs, so the name is kinda misleading, but not a big deal)
If I understand correctly, you're saying that the controller running here to delete the additional MC doesn't key off of pool changes, so until something actually triggers a re-render, we have a bunch of stale MCs. I think this is probably fine since if the pool is gone, nothing would be using these MCs anyways. Since they eventually get cleaned up, it's up to you on whether you want to make this a followup issue or not. Also, the unit test failures are legitimate (albeit not because you're doing anything wrong I think, just that the expected behaviour has changed?). You probably need to update https://github.com/openshift/machine-config-operator/blob/main/pkg/controller/kubelet-config/kubelet_config_nodes_test.go#L72-L75 based on an initial look. You may have one additional get() for some reason - may be best to check if that's expected first before adding it to the test. |
|
@proietfb: This pull request references Jira Issue OCPBUGS-59958, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Yes exactly
Thank you. Yes, moving the cleanup function after, requires an extra get() to unit tests. |
yuqi-zhang
left a comment
There was a problem hiding this comment.
/lgtm
Based on the conversation I think this is a fine backportable fix to get rid of the immediate problem. Long term I'd like us to consider something like https://redhat.atlassian.net/browse/MCO-2340 for this as well (way beyond the scope of this PR, just linking for context)
|
Scheduling tests matching the |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: proietfb, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cherry-pick release-4.22 |
|
@yuqi-zhang: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test e2e-gcp-op-part2 |
|
/cherry-pick release-4.21 |
|
@proietfb: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cherry-pick release-4.20 |
|
@proietfb: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Verified using IPI on AWS. Reproduce the issueIn a 5.0 cluster without the fix we executed the following steps
Verify the fixIn a 5.0 cluster with the fix we executed the following steps
Verify the fix with an actual upgradeWe executed the following steps
/verified by @sergiordlr |
|
/verified by @sergiordlr |
|
@sergiordlr: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@proietfb: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@proietfb: Jira Issue OCPBUGS-59958: Some pull requests linked via external trackers have merged: The following pull request, linked via external tracker, has not merged: All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-59958 has not been moved to the MODIFIED state. This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@yuqi-zhang: new pull request created: #6240 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@proietfb: new pull request created: #6241 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@proietfb: new pull request created: #6242 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@proietfb: new pull request created: #6243 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@proietfb: #6201 failed to apply on top of branch "release-4.18": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Closes OCPBUGS-59958
What I did
Moved
cleanUpDuplicatedMCto after the loop that creates/updates MachineConfigs.How to verify it
KubeletConfigwithautoSizingReserved: truetargeting the master pool and wait for the pool to reachUPDATED: TrueGeneratedByControllerVersionAnnotationKeyannotation on97-master-generated-kubeletwith an arbitrary value to simulate the pre-upgrade state, then restart themachine-config-controllerpodDELETEDfollowed byADDEDevents appear on97-master-generated-kubelet, and a newrendered-master-*is createdMODIFIEDevents appear on97-master-generated-kubelet, with no newrendered-master-*Note: corrupting the annotation manually is necessary to simulate the version mismatch that occurs naturally during an MCO upgrade, when the new MCC binary carries a different
GeneratedByControllerVersionAnnotationKeyvalue than the one stored in the existing MC annotation.Description for the changelog
Running the loop first guarantees that all existing MCs have their
GeneratedByControllerVersionAnnotationKeyannotation updated beforecleanUpDuplicatedMCruns, preventing it from wrongly removing them.cleanUpDuplicatedMCwill only act on MCs not associated with any existing MachineConfigPool.cleanUpDuplicatedMC's git history shows that its original position was after the create/update MCs loop and was moved to avoid a corner case related to an early exit when no cgroup v2 was present. Then, after defaulting cgroups, that corner case was removed.Summary by CodeRabbit