OCPBUGS-88333: Fix bootupd workaround in old nodes#6199
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@pablintino: This pull request references Jira Issue OCPBUGS-88333, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughOS base version parsing now returns numeric major and minor components. Bootloader updates use a new gated update path with a direct node-level attempt for RHEL 9.6+ and container fallback. RHEL target-root binary selection now compares integer major versions. ChangesBootloader and version handling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/daemon/bootupd.go`:
- Around line 360-370: The direct call to runCmdSync("bootupctl", "update") does
not properly execute in a systemd context and lacks the INVOCATION_ID
environment variable that bootupctl requires to pass systemd checks. Modify the
bootupctl command invocation to run through systemd-run by changing the command
arguments passed to runCmdSync so that bootupctl executes within a systemd
context, similar to how the container-based fallback path handles INVOCATION_ID
injection. This ensures the node-level bootupctl update attempt will succeed on
RHEL 9.6+ nodes instead of prematurely falling back to the container path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2938802f-abaf-4248-8fb0-bacb0c6e05d4
📒 Files selected for processing (5)
pkg/daemon/bootupd.gopkg/daemon/daemon.gopkg/daemon/osrelease/osrelease.gopkg/daemon/osrelease/osrelease_test.gopkg/daemon/update.go
bfffcd4 to
d9d2cc9
Compare
|
Generally makes sense to me from the MCO POV, I'll leave tagging to coreos folks cc @travier |
|
I think the fastest way to test this would be:
|
|
/jira refresh |
|
@isabella-janssen: This pull request references Jira Issue OCPBUGS-88333, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
travier
left a comment
There was a problem hiding this comment.
I have not tested this but the logic/code LGTM. Thanks!
|
/lgtm |
|
Scheduling tests matching the |
|
/retest-required |
|
Pre-merge verification- Setup: Steps:
|
|
@pablintino: This pull request references Jira Issue OCPBUGS-88333, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Pre-merge re-verification- Setup: Steps:
/label qe-approved |
|
@ptalgulk01: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| if major > 9 || (major == 9 && minor >= 6) { | ||
| logSystem("runBootloaderUpdate: RHEL 9.6+ detected, attempting node-level bootloader update via bootloader-update.service") | ||
| var err error | ||
| if err = runCmdSync("systemctl", "start", "bootloader-update.service"); err == nil { |
There was a problem hiding this comment.
I think this will essentially never fail as the return code here is not about the process launched but whether or not systemd successfully launched the service.
There was a problem hiding this comment.
Instead I think you need to start the unit like you do here, then wait until it completes (or maybe use the --wait option?) and then check it's status again. If anything failed, it should show up in the status.
There was a problem hiding this comment.
I initially tried with --wait but because the service sets RemainAfterExit=yes; it blocks forever. I thought it being a oneshot unit will cause it to return the correct error code?
There was a problem hiding this comment.
$ cat ~/.config/systemd/user/test.service
[Service]
Type=oneshot
ExecStart=bash -c "sleep 2; exit 1"
RemainAfterExit=yes
# Keep this stuff in sync with SYSTEMD_ARGS_BOOTUPD in general
PrivateNetwork=yes
ProtectHome=yes
KillMode=mixed
MountFlags=slave
$ systemctl --user daemon-reload
$ systemctl --user start test.service
$ echo $?
0
There was a problem hiding this comment.
OK, never mind, it does fail as expected. I had a previous remaining instance so systemd was not starting it again.
There was a problem hiding this comment.
So maybe this should be restart instead and then we can add the --wait back. But not sure if that would matter in this case. And if we have to restart it then it means that it already ran, so there should be nothing to do. So maybe we leave this as is.
That looks weird. There should be more output like the versions here.
The ESP is not mounted by default in RHCOS. The following should do it without changing the mount points in the main namespace:
This is weird.
Should not matter here. |
The shim version check (shimIsSafe) looked at the RPM database in /usr, which reflects the system image. For nodes born from older releases, the shim in /usr is up to date but the ESP still contains old bootloader binaries, causing the workaround to be incorrectly skipped. Replace the shim version check with a node-level bootupctl update on RHEL 9.6+ before falling back to the container-based approach. Signed-off-by: Pablo Rodriguez Nava <git@amail.pablintino.eu>
|
/lgtm |
|
Scheduling tests matching the |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pablintino, travier The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required /verified by @ptalgulk01 |
|
@djoshy: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
|
/retest-required |
|
@pablintino: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@pablintino: Jira Issue Verification Checks: Jira Issue OCPBUGS-88333 Jira Issue OCPBUGS-88333 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-4.22 |
|
@djoshy: new pull request created: #6239 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fix included in release 5.0.0-0.nightly-2026-06-28-044905 |
Closes: #OCPBUGS-88333
- What I did
The shim version check (shimIsSafe) looked at the RPM database in /usr, which reflects the system image. For nodes born from older releases, the shim in /usr is up to date but the ESP still contains old bootloader binaries, causing the workaround to be incorrectly skipped.
Replace the shim version check with a node-level bootupctl update on RHEL 9.6+ before falling back to the container-based approach.
- How to verify it
- Description for the changelog
Fix bootloader update being skipped on nodes with stale bootloader binaries
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests