Skip to content

OCPEDGE-2498: Add migration support#83

Open
eggfoobar wants to merge 3 commits into
openshift-eng:mainfrom
eggfoobar:add-migration-support
Open

OCPEDGE-2498: Add migration support#83
eggfoobar wants to merge 3 commits into
openshift-eng:mainfrom
eggfoobar:add-migration-support

Conversation

@eggfoobar

@eggfoobar eggfoobar commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • New Features

    • Added support for transitioning a single-node cluster to a 3-node high-availability setup.
    • Added new command options for SNO deployment and cleanup in the help output.
    • Added a guided cleanup flow for reverting mutable topology changes.
  • Bug Fixes

    • Improved retry handling during project setup to better tolerate temporary startup delays.
    • Expanded topology validation and cluster readiness checks for SNO-based workflows.
    • Added DNS, node, and storage cleanup steps to keep cluster state consistent after changes.

eggfoobar and others added 3 commits June 25, 2026 07:01
Add a new Ansible role and make target to perform a one-shot migration
of a platform:None SNO cluster to a 3-node HA cluster. The role creates
two new control-plane VMs, boots them with RHCOS via coreos-installer
ISO ignition embed, waits for nodes to join and etcd to scale, then
transitions the topology to HighlyAvailable.

Usage: make sno-to-3node

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ehila <ehila@redhat.com>
Signed-off-by: ehila <ehila@redhat.com>
@openshift-ci-robot

openshift-ci-robot commented Jun 26, 2026

Copy link
Copy Markdown

@eggfoobar: This pull request references OCPEDGE-2498 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 26, 2026
@openshift-ci openshift-ci Bot requested review from jerpeter1 and qJkee June 26, 2026 06:20
@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eggfoobar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2026
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Adds SNO deployment and SNO-to-3node transition entrypoints, install-dev SNO support, role defaults and ignition generation, preflight checks, VM provisioning and boot tasks, topology and rollout waits, DNS refresh, verification, and cleanup.

Changes

SNO-to-3node workflow

Layer / File(s) Summary
Command entrypoints and playbooks
deploy/Makefile, deploy/openshift-clusters/scripts/*.sh, deploy/openshift-clusters/*.yml
Adds Make targets, shell wrappers, and playbooks for the SNO transition and cleanup flows.
Install-dev SNO support
deploy/openshift-clusters/roles/dev-scripts/install-dev/files/*, deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml, deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml, deploy/openshift-clusters/roles/dev-scripts/install-dev/vars/main.yml
Adds SNO install config examples, ignores config_sno.sh, retries oc project, and extends topology validation.
Role variables and ignition assets
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/defaults/main.yml, deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml, deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/*.j2, deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.yml
Defines SNO defaults, role variables, Ignition templates, and the MCS CA fetch task.
Preflight checks
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml
Queries cluster state, node readiness, etcd, cluster operators, and api-int DNS before the transition.
VM creation
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml
Removes stale libvirt state, creates VM disks, defines new VMs, and adds DHCP and DNS reservations.
VM boot and install
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml
Discovers the RHCOS live ISO, embeds ignition into per-node ISOs, boots the VMs, waits for install completion, and returns them to disk boot.
Topology transition and rollout waits
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.yml, deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml, deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml, deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.yml
Scales CVO down, patches the infrastructure topology, and waits for node, etcd, and MCO status changes.
DNS refresh and verification
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml, deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml
Rewrites dnsmasq, updates /etc/hosts, and captures post-transition node, topology, etcd, CVO, and MCP state.
Cleanup tasks
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml
Destroys the new VMs and removes their disks, NVRAM, ISOs, DHCP reservations, DNS entries, and leases.

Sequence Diagram(s)

sequenceDiagram
  participant Makefile
  participant sno_to_3node_sh as "sno-to-3node.sh"
  participant sno_to_3node_yml as "sno-to-3node.yml"
  participant mutable_topology_role as "mutable-topology/sno-to-3node"
  participant oc
  participant virsh

  Makefile->>sno_to_3node_sh: runs sno-to-3node target
  sno_to_3node_sh->>sno_to_3node_yml: ansible-playbook sno-to-3node.yml
  sno_to_3node_yml->>mutable_topology_role: import_tasks phases
  mutable_topology_role->>oc: preflight, topology patch, CSR approval, and verification
  mutable_topology_role->>virsh: create VMs, boot disks, and update libvirt DNS
  mutable_topology_role->>oc: wait for nodes, etcd, MCO rollout, and final cluster checks
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

Suggested reviewers

  • slintes
  • fonta-rh
  • jaypoulz

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error New Ansible debug/echo output prints cluster domains, node IPs, hostnames, and api-int DNS targets without no_log protection. Remove or redact those messages, or gate them behind no_log/verbose-only output; avoid logging cluster domains, IPs, CSR names, and internal DNS records.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is concise and matches the main change: adding migration support for the SNO-to-3-node flow.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
No-Weak-Crypto ✅ Passed No MD5/SHA1/DES/RC4/3DES/Blowfish/ECB or secret/token comparisons were found in the changed files.
Container-Privileges ✅ Passed No changed container/K8s manifests contain privileged, hostPID/Network/IPC, allowPrivilegeEscalation, SYS_ADMIN, or root-run settings.
No-Hardcoded-Secrets ✅ Passed No hardcoded secrets found in touched files; scans found no literal tokens/passwords/private keys or embedded-credential URLs, and only a CI_TOKEN comment reference.
No-Injection-Vectors ✅ Passed No flagged injection APIs or patterns appear in the changed files; the new shell blocks only interpolate fixed role-owned vars and hardcoded data.
Ai-Attribution ✅ Passed No AI tools are mentioned in the commit/PR text, and the commit only has a Signed-off-by trailer—no Co-Authored-By, Assisted-by, or Generated-by found.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
deploy/Makefile (1)

122-130: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Document the new sno-ipi and sno-agent targets in make help.

Lines 79-83 add both commands, but this help block only advertises sno-to-3node. make help will miss the new entrypoints entirely.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/Makefile` around lines 122 - 130, Update the Makefile help output so
the new `sno-ipi` and `sno-agent` targets are listed alongside the existing
deployment entries in `make help`. Add matching `@echo` lines in the help block
near the `sno-to-3node` entry, using the same wording/style as the other target
descriptions so the new entrypoints are discoverable.
deploy/openshift-clusters/scripts/deploy-cluster.sh (1)

62-69: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Reject the unsupported sno + kcli combination explicitly.

This validation now allows --topology sno for every method, so a direct deploy-cluster.sh --topology sno --method kcli falls through to kcli-install.yml. The PR only adds SNO support through the install-dev/setup flow, so this opens an unsupported path instead of failing fast.

Suggested fix
 if [[ "${METHOD}" != "ipi" && "${METHOD}" != "agent" && "${METHOD}" != "kcli" ]]; then
     echo "Error: Invalid method '${METHOD}'. Must be 'ipi', 'agent', or 'kcli'."
     exit 1
 fi
+
+if [[ "${TOPOLOGY}" == "sno" && "${METHOD}" == "kcli" ]]; then
+    echo "Error: Topology 'sno' only supports 'ipi' or 'agent'."
+    exit 1
+fi

Also applies to: 87-104

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/scripts/deploy-cluster.sh` around lines 62 - 69,
The validation in deploy-cluster.sh currently permits every METHOD with
TOPOLOGY=sno, which allows an unsupported sno + kcli path to continue into the
kcli install flow. Update the argument checks around the TOPOLOGY/METHOD
validation blocks to explicitly reject the unsupported combination before any
install logic runs, and keep the failure consistent with the existing validation
messages. Reference the validation section and the kcli-install path so the
guard is placed where the topology and method are first checked.
deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml (1)

26-32: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Include the SNO scenario in the missing-variable error.

Line 31 still only tells agent users to add TNF_IPV4 or TNA_IPV4. Now that sno is a supported prefix on Line 43, a missing AGENT_E2E_TEST_SCENARIO for SNO will produce the wrong remediation.

Proposed fix
-      AGENT_E2E_TEST_SCENARIO="TNF_IPV4" (fencing) or "TNA_IPV4" (arbiter) to
-      the config file.
+      AGENT_E2E_TEST_SCENARIO="TNF_IPV4" (fencing), "TNA_IPV4" (arbiter), or
+      "SNO_IPV4" (sno) to the config file.

Also applies to: 43-43

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml`
around lines 26 - 32, The missing-AGENT_E2E_TEST_SCENARIO failure message in the
config validation task should be updated to include the SNO option now supported
by the method-prefix logic. Adjust the fail message in the config validation
flow so it lists the correct remediation for all supported prefixes, including
the new sno case alongside the existing agent scenarios, and make sure the
guidance matches the prefix handling used in the task that selects by method.
🧹 Nitpick comments (2)
deploy/openshift-clusters/scripts/sno-to-3node.sh (1)

1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use the repository-standard Bash shebang.

This new wrapper still uses #!/bin/bash, but the repo shell-script rule requires #!/usr/bin/bash. As per coding guidelines, "Shell scripts should use #!/usr/bin/bash shebang".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/scripts/sno-to-3node.sh` around lines 1 - 2, The
wrapper script currently uses the non-standard Bash shebang, so update the
script header to match the repository convention by changing the shebang at the
top of sno-to-3node.sh to the repo-standard Bash path. Keep the rest of the
script unchanged, and verify the file still starts with the expected shebang
followed by set -euo pipefail.

Source: Coding guidelines

deploy/openshift-clusters/scripts/clean-mutable-topology.sh (1)

1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use the repository-standard Bash shebang.

This new wrapper still uses #!/bin/bash, but the repo shell-script rule requires #!/usr/bin/bash. As per coding guidelines, "Shell scripts should use #!/usr/bin/bash shebang".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh` around lines 1 -
2, The clean-mutable-topology.sh wrapper uses the wrong Bash shebang, so update
the script header to match the repository standard. Replace the current shebang
at the top of the file with the repo-required Bash interpreter line, keeping the
rest of the script unchanged; this is the only fix needed in the script’s
startup header.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deploy/openshift-clusters/clean-mutable-topology.yml`:
- Around line 35-39: The cleanup playbook currently imports
mutable-topology/sno-to-3node tasks/clean.yml unconditionally, which can destroy
a healthy HA control plane. Add a topology check in clean-mutable-topology.yml
before the import_role so it only runs when the cluster is not already HA, or
require an explicit force_cleanup=true override to proceed. Use the existing
import_role task as the entry point and gate it with a condition or precheck
that prevents destructive cleanup unless the override is set.

In
`@deploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.sh`:
- Around line 35-40: This example hardcodes an x86_64 release image and leaves
out the ARM64-specific Metal3 overrides, so update the config example to
explicitly document how to switch to aarch64/Graviton. Use
OPENSHIFT_RELEASE_IMAGE as the payload architecture selector and add the
required Metal3 container override guidance (IRONIC_IMAGE, VBMC_IMAGE,
SUSHY_TOOLS_IMAGE) in the example so the install path is correct for ARM64
hypervisors. Refer to the config_sno_example.sh example block and keep the
existing installer/image-policy comments consistent with the new ARM64 notes.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`:
- Around line 68-72: The cleanup task in clean.yml is deleting the original SNO
node’s api-int DNS entry, which can break the surviving cluster after cleanup.
Update the [clean] Remove api-int from libvirt network DNS task to target only
mutable-topology-added resources associated with sno_new_nodes, and avoid
removing the api-int record for sno_master0_ip in the libvirt network update
logic.
- Around line 37-43: The embedded Python in the cleanup task scripts is
indented, which causes the inline `python3 -c` snippets used in `clean.yml` to
fail before the DHCP cleanup logic runs. Update the scripts in the task blocks
that use `EXISTING_MAC` and the lease cleanup command so the Python content is
left-aligned or otherwise dedented, and keep the logic under the same task names
so `|| true` / `failed_when: false` no longer hides a parse failure.
- Around line 26-30: The per-node ISO cleanup in the [clean] task is using a
hostname-based filename that can collide across clusters on the same hypervisor.
Update the ISO naming scheme in the related ISO creation and cleanup flow to use
a cluster-scoped identifier, such as item.name or a sno_cluster_name-prefixed
name, and make the file path in the cleanup task match that same unique pattern.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml`:
- Around line 27-32: The DNS cleanup in the `create-vms.yml` task is still
targeting `{{ item.ip }}`, so stale `dns-host` records for `master-1`/`master-2`
with old IPs are not removed before `virsh net-update add dns-host` runs. Update
the cleanup logic around the `virsh net-update` calls to delete entries by
hostname instead of by IP, and then tighten the `dig` verification so it checks
that the resolved answers actually contain `{{ item.ip }}` rather than only a
non-empty response.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.yml`:
- Around line 19-24: The MCS reachability check in the ignition task is
currently allowed to fail and continue, which defers a required configuration
error until boot. Update the existing "[ignition] Verify MCS reachability" step
so it aborts when the curl probe to the MCS health endpoint fails, and only keep
a non-failing path if you explicitly add a break-glass override. Use the
existing mcs_health probe logic as the place to enforce fail-fast behavior.
- Around line 13-18: The ignition generation step writes /tmp/master.ign but
never populates sno_master_ign_b64, so auto-install.ign.j2 cannot render
reliably. Update the relevant task flow in ignition.yml to read back the
generated master.ign and set sno_master_ign_b64 before the template that
consumes it, using the existing ignition generation and auto-install rendering
steps as the anchor points. Ensure the variable is produced within the role
rather than assuming an external caller supplies it.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.yml`:
- Around line 68-81: The MCO rollout polling in the mco-rollout task currently
suppresses timeout failures by using failed_when: false, which allows the play
to continue even if mcp_poll never reports completion. Update the
polling/verification logic in the mcp_poll task so a timeout is fatal by
default, or make the non-fatal continuation explicitly opt-in via a best-effort
mode. Keep the success check based on mcp_poll.stdout and ensure the final MCO
rollout result step reflects failure when the rollout does not complete.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`:
- Around line 80-120: The preflight checks in the etcd probe and the
cluster-operator health check are only logging warnings instead of stopping the
topology transition, which makes this phase non-gating. Update the preflight
logic so the tasks for the etcd pod check and the degraded CO check fail the
play by default, removing the unconditional `|| true` and the `failed_when:
false` behavior; keep any bypass only behind an explicit break-glass override.
Use the existing task names like "[preflight] Check etcd pods" and "[preflight]
Check for degraded cluster operators" to locate the affected steps.
- Around line 129-154: The preflight repair flow for api-int only deletes
entries that already match sno_master0_ip and the verification step only checks
for any DNS answer, so stale mappings can remain undetected. Update the shell
logic in the api-int repair task to remove existing api-int entries by
hostname/FQDN regardless of IP, then make the dns_apiint_verify check confirm
the resolved result includes sno_master0_ip rather than just being non-empty.
Use the existing preflight tasks and variables like sno_libvirt_network,
sno_master0_ip, and sno_cluster_domain to locate and adjust the repair and
verification steps.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml`:
- Around line 24-31: The dnsmasq config in the update-dns task is incorrectly
pinning the wildcard apps record to sno_all_node_ips[0], which can vary because
node ordering is not stable. Update the logic in the sno_dnsmasq_lines
generation to use a stable ingress target such as a VIP/load balancer, or
otherwise emit all intended ingress node IPs instead of selecting the first node
IP. Make the change in the set_fact block that builds the DNS entries so the
*.apps record does not depend on array order.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml`:
- Around line 28-48: The wait-etcd task only checks for 3 Ready pods, while the
etcd membership and endpoint health probes in the same flow still suppress
failures via failed_when: false and fallback echo text. Update the wait logic in
wait-etcd.yml so the shell checks using oc exec and etcdctl member list /
endpoint health are allowed to fail the task, or fold their success into the
retry condition, so the role does not proceed to topology.yml until etcd quorum
and health are actually good.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml`:
- Around line 4-21: The CSR approval logic in the wait-nodes task is too broad
because it approves every pending request returned by oc get csr. Update the
pending-CSR collection and approval loop in the shell block so it only selects
kubelet client/server CSRs associated with sno_new_nodes, and then approve only
those entries before calling oc adm certificate approve. Use the existing
PENDING/CSR handling in wait-nodes.yml as the place to narrow the filter.

In `@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml`:
- Around line 2-5: The kubeconfig path in sno_kubeconfig_resolved is incorrectly
derived from the default sno_cluster_name, so it can point to the wrong cluster
before discovery happens. Update the mutable-topology SNO-to-3node vars so
sno_cluster_name is not assumed here; instead require sno_cluster_name or
sno_kubeconfig to be provided up front, or change the resolution logic to use a
source that does not depend on the default value. Keep the fix localized to the
sno_kubeconfig_resolved expression and related variable handling.

---

Outside diff comments:
In `@deploy/Makefile`:
- Around line 122-130: Update the Makefile help output so the new `sno-ipi` and
`sno-agent` targets are listed alongside the existing deployment entries in
`make help`. Add matching `@echo` lines in the help block near the
`sno-to-3node` entry, using the same wording/style as the other target
descriptions so the new entrypoints are discoverable.

In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml`:
- Around line 26-32: The missing-AGENT_E2E_TEST_SCENARIO failure message in the
config validation task should be updated to include the SNO option now supported
by the method-prefix logic. Adjust the fail message in the config validation
flow so it lists the correct remediation for all supported prefixes, including
the new sno case alongside the existing agent scenarios, and make sure the
guidance matches the prefix handling used in the task that selects by method.

In `@deploy/openshift-clusters/scripts/deploy-cluster.sh`:
- Around line 62-69: The validation in deploy-cluster.sh currently permits every
METHOD with TOPOLOGY=sno, which allows an unsupported sno + kcli path to
continue into the kcli install flow. Update the argument checks around the
TOPOLOGY/METHOD validation blocks to explicitly reject the unsupported
combination before any install logic runs, and keep the failure consistent with
the existing validation messages. Reference the validation section and the
kcli-install path so the guard is placed where the topology and method are first
checked.

---

Nitpick comments:
In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh`:
- Around line 1-2: The clean-mutable-topology.sh wrapper uses the wrong Bash
shebang, so update the script header to match the repository standard. Replace
the current shebang at the top of the file with the repo-required Bash
interpreter line, keeping the rest of the script unchanged; this is the only fix
needed in the script’s startup header.

In `@deploy/openshift-clusters/scripts/sno-to-3node.sh`:
- Around line 1-2: The wrapper script currently uses the non-standard Bash
shebang, so update the script header to match the repository convention by
changing the shebang at the top of sno-to-3node.sh to the repo-standard Bash
path. Keep the rest of the script unchanged, and verify the file still starts
with the expected shebang followed by set -euo pipefail.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d1bb47f1-0e94-4c52-bf4c-ea4c9efe1020

📥 Commits

Reviewing files that changed from the base of the PR and between 903e071 and 497dfcc.

📒 Files selected for processing (27)
  • deploy/Makefile
  • deploy/openshift-clusters/clean-mutable-topology.yml
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/files/.gitignore
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.sh
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/vars/main.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/defaults/main.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/main.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/auto-install.ign.j2
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/master.ign.j2
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml
  • deploy/openshift-clusters/scripts/clean-mutable-topology.sh
  • deploy/openshift-clusters/scripts/deploy-cluster.sh
  • deploy/openshift-clusters/scripts/sno-to-3node.sh
  • deploy/openshift-clusters/sno-to-3node.yml

Comment on lines +35 to +39
tasks:
- name: Run mutable topology cleanup
import_role:
name: mutable-topology/sno-to-3node
tasks_from: clean.yml

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Refuse cleanup once the cluster is already HA unless the caller forces it.

This imports tasks/clean.yml unconditionally, and that role deletes master-1/master-2 plus their disks and DHCP state. On a successful 3-node cluster, removing two control-plane nodes here can drop etcd from 3 members to 1 and take the cluster down. Please gate this on the current control-plane topology, or require an explicit force_cleanup=true override before running the destructive role.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/clean-mutable-topology.yml` around lines 35 - 39,
The cleanup playbook currently imports mutable-topology/sno-to-3node
tasks/clean.yml unconditionally, which can destroy a healthy HA control plane.
Add a topology check in clean-mutable-topology.yml before the import_role so it
only runs when the cluster is not already HA, or require an explicit
force_cleanup=true override to proceed. Use the existing import_role task as the
entry point and gate it with a condition or precheck that prevents destructive
cleanup unless the override is set.

Comment on lines +129 to +154
- name: "[preflight] Add api-int to libvirt network DNS"
# Using virsh net-update (not lineinfile) so that api-int survives addnhosts
# regeneration when create-vms.yml calls virsh net-update for master-1/2.
# libvirt regenerates ostestbm.addnhosts from the network XML on every net-update.
shell: |
# Remove any stale api-int entry (idempotent, ignore errors)
sudo virsh net-update {{ sno_libvirt_network }} delete dns-host \
"<host ip='{{ sno_master0_ip }}'><hostname>api-int.{{ sno_cluster_domain }}</hostname></host>" \
--live --config 2>/dev/null || true
# Add the correct entry - triggers addnhosts regeneration with api-int included
sudo virsh net-update {{ sno_libvirt_network }} add dns-host \
"<host ip='{{ sno_master0_ip }}'><hostname>api-int.{{ sno_cluster_domain }}</hostname></host>" \
--live --config
when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout
changed_when: true

- name: "[preflight] Verify api-int DNS after fix"
shell: |
dig +short api-int.{{ sno_cluster_domain }} @192.168.111.1
register: dns_apiint_verify
changed_when: false
failed_when: dns_apiint_verify.stdout | trim == ""
retries: 3
delay: 2
until: dns_apiint_verify.stdout | trim != ""
when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

The api-int repair path does not remove or verify stale mappings correctly.

The delete call matches api-int on {{ sno_master0_ip }}, so it only removes a record that already points at the desired IP. If api-int is stale and points somewhere else, the old XML entry survives, and the follow-up dig still passes because it only checks for any answer, not for {{ sno_master0_ip }}. Remove existing api-int entries by hostname/FQDN and verify the resolved answers include the expected IP.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`
around lines 129 - 154, The preflight repair flow for api-int only deletes
entries that already match sno_master0_ip and the verification step only checks
for any DNS answer, so stale mappings can remain undetected. Update the shell
logic in the api-int repair task to remove existing api-int entries by
hostname/FQDN regardless of IP, then make the dns_apiint_verify check confirm
the resolved result includes sno_master0_ip rather than just being non-empty.
Use the existing preflight tasks and variables like sno_libvirt_network,
sno_master0_ip, and sno_cluster_domain to locate and adjust the repair and
verification steps.

Comment on lines +24 to +31
- name: "[dns] Build NM dnsmasq config"
set_fact:
sno_dnsmasq_lines: |
{% for ip in sno_all_node_ips %}
address=/api.{{ sno_cluster_domain }}/{{ ip }}
address=/api-int.{{ sno_cluster_domain }}/{{ ip }}
{% endfor %}
address=/.apps.{{ sno_cluster_domain }}/{{ sno_all_node_ips[0] }}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map the target file and nearby role files.
git ls-files 'deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/**' \
  'deploy/openshift-clusters/roles/**/tasks/**' \
  'deploy/openshift-clusters/roles/**/files/**' \
  'deploy/openshift-clusters/roles/**/templates/**' | sed -n '1,250p'

echo
echo '--- outline ---'
ast-grep outline deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml --view expanded || true

echo
echo '--- contents ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml

Repository: openshift-eng/two-node-toolbox

Length of output: 9700


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map the target file and nearby role files.
git ls-files 'deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/**' \
  'deploy/openshift-clusters/roles/**/tasks/**' \
  'deploy/openshift-clusters/roles/**/files/**' \
  'deploy/openshift-clusters/roles/**/templates/**' | sed -n '1,250p'

echo
echo '--- outline ---'
ast-grep outline deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml --view expanded || true

echo
echo '--- contents ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml

Repository: openshift-eng/two-node-toolbox

Length of output: 9700


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect neighboring topology and DNS logic for the intended apps target.
rg -n "sno_all_node_ips|apps\\.|dnsmasq|ingress|router|api-int|api\\." \
  deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
  deploy/openshift-clusters/roles/common \
  deploy/openshift-clusters/roles/dev-scripts/install-dev \
  deploy/openshift-clusters/roles/kcli/kcli-install \
  deploy/openshift-clusters/roles/assisted/assisted-spoke \
  deploy/openshift-clusters/roles/assisted/acm-install

echo
echo '--- vars/main.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml

echo
echo '--- topology.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.yml

echo
echo '--- verify.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml

Repository: openshift-eng/two-node-toolbox

Length of output: 16571


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect neighboring topology and DNS logic for the intended apps target.
rg -n "sno_all_node_ips|apps\\.|dnsmasq|ingress|router|api-int|api\\." \
  deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
  deploy/openshift-clusters/roles/common \
  deploy/openshift-clusters/roles/dev-scripts/install-dev \
  deploy/openshift-clusters/roles/kcli/kcli-install \
  deploy/openshift-clusters/roles/assisted/assisted-spoke \
  deploy/openshift-clusters/roles/assisted/acm-install

echo
echo '--- vars/main.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml

echo
echo '--- topology.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.yml

echo
echo '--- verify.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml

Repository: openshift-eng/two-node-toolbox

Length of output: 16571


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect neighboring topology and DNS logic for the intended apps target.
rg -n "sno_all_node_ips|apps\\.|dnsmasq|ingress|router|api-int|api\\." \
  deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
  deploy/openshift-clusters/roles/common \
  deploy/openshift-clusters/roles/dev-scripts/install-dev \
  deploy/openshift-clusters/roles/kcli/kcli-install \
  deploy/openshift-clusters/roles/assisted/assisted-spoke \
  deploy/openshift-clusters/roles/assisted/acm-install

echo
echo '--- vars/main.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml

echo
echo '--- topology.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.yml

echo
echo '--- verify.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml

Repository: openshift-eng/two-node-toolbox

Length of output: 16571


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check how ingress/router is handled for the SNO-to-3node transition.
rg -n "ingress|router|ingresscontroller|router pod|default ingress|apps\\." \
  deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
  deploy/openshift-clusters/docs \
  deploy/openshift-clusters/README.md \
  docs \
  README.md

echo
echo '--- file list around mutable-topology docs ---'
git ls-files 'deploy/openshift-clusters/**README*' 'deploy/openshift-clusters/**/*.md' | rg 'mutable-topology|sno-to-3node|openshift-clusters'

echo
echo '--- update-dns task with context already known ---'
sed -n '1,120p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml

Repository: openshift-eng/two-node-toolbox

Length of output: 1012


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "IngressController|ingresscontroller|router-default|worker|master.*router|taint|schedul" \
  deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
  deploy/openshift-clusters/roles

echo
echo '--- mutable-topology role files ---'
git ls-files deploy/openshift-clusters/roles/mutable-topology/sno-to-3node

echo
echo '--- README candidates ---'
fd -a 'README.md' deploy/openshift-clusters/roles/mutable-topology deploy/openshift-clusters | sed -n '1,200p'

Repository: openshift-eng/two-node-toolbox

Length of output: 4257


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "IngressController|ingresscontroller|router-default|worker|master.*router|taint|schedul" \
  deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
  deploy/openshift-clusters/roles

echo
echo '--- mutable-topology role files ---'
git ls-files deploy/openshift-clusters/roles/mutable-topology/sno-to-3node

echo
echo '--- README candidates ---'
fd -a 'README.md' deploy/openshift-clusters/roles/mutable-topology deploy/openshift-clusters | sed -n '1,200p'

Repository: openshift-eng/two-node-toolbox

Length of output: 4257


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Gather the SNO-to-3node ingress/DNS intent from role docs and nearby logic.
rg -n "apps|ingress|router|vip|load balancer|load-balancer|dnsmasq" \
  deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
  deploy/openshift-clusters/README.md \
  deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh

echo
echo '--- resolve_vm_ip.sh ---'
cat -n deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh

echo
echo '--- clean.yml excerpt ---'
sed -n '60,90p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml

echo
echo '--- preflight.yml excerpt ---'
sed -n '118,155p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml

Repository: openshift-eng/two-node-toolbox

Length of output: 11379


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Gather the SNO-to-3node ingress/DNS intent from role docs and nearby logic.
rg -n "apps|ingress|router|vip|load balancer|load-balancer|dnsmasq" \
  deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
  deploy/openshift-clusters/README.md \
  deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh

echo
echo '--- resolve_vm_ip.sh ---'
cat -n deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh

echo
echo '--- clean.yml excerpt ---'
sed -n '60,90p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml

echo
echo '--- preflight.yml excerpt ---'
sed -n '118,155p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml

Repository: openshift-eng/two-node-toolbox

Length of output: 11379


Don't pin *.apps to sno_all_node_ips[0].
oc get nodes ordering is not guaranteed, so wildcard routes can jump between control-plane IPs and land on a node that isn't serving ingress. Use a stable ingress VIP/load balancer, or publish all ingress node IPs if that's the intended design.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml`
around lines 24 - 31, The dnsmasq config in the update-dns task is incorrectly
pinning the wildcard apps record to sno_all_node_ips[0], which can vary because
node ordering is not stable. Update the logic in the sno_dnsmasq_lines
generation to use a stable ingress target such as a VIP/load balancer, or
otherwise emit all intended ingress node IPs instead of selecting the first node
IP. Make the change in the set_fact block that builds the DNS entries so the
*.apps record does not depend on array order.

Comment on lines +28 to +48
- name: "[etcd] Display etcd member list"
shell: |
ETCD_POD=$(oc get pods -n openshift-etcd -l app=etcd -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl member list -w table 2>/dev/null || echo "Could not query etcd members"
environment: "{{ sno_oc_env }}"
register: etcd_members
changed_when: false
failed_when: false

- name: "[etcd] Members"
debug:
msg: "{{ etcd_members.stdout }}"

- name: "[etcd] Display etcd endpoint health"
shell: |
ETCD_POD=$(oc get pods -n openshift-etcd -l app=etcd -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint health --cluster -w table 2>/dev/null || echo "Could not query endpoint health"
environment: "{{ sno_oc_env }}"
register: etcd_health
changed_when: false
failed_when: false

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '--- wait-etcd.yml ---\n'
nl -ba deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml | sed -n '1,220p'

printf '\n--- role task files ---\n'
fd -a -t f 'tasks' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node | sed -n '1,120p'

printf '\n--- references to topology.yml and wait-etcd ---\n'
rg -n 'topology\.yml|wait-etcd|etcdctl endpoint health|member list|Display etcd' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node -S

Repository: openshift-eng/two-node-toolbox

Length of output: 259


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '--- wait-etcd.yml ---\n'
nl -ba deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml | sed -n '1,220p'

printf '\n--- role task files ---\n'
fd -a -t f 'tasks' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node | sed -n '1,120p'

printf '\n--- references to topology.yml and wait-etcd ---\n'
rg -n 'topology\.yml|wait-etcd|etcdctl endpoint health|member list|Display etcd' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node -S

Repository: openshift-eng/two-node-toolbox

Length of output: 259


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '--- wait-etcd.yml ---\n'
nl -ba deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml | sed -n '1,220p'

printf '\n--- role task files ---\n'
fd -a -t f 'tasks' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node | sed -n '1,120p'

printf '\n--- references to topology.yml and wait-etcd ---\n'
rg -n 'topology\.yml|wait-etcd|etcdctl endpoint health|member list|Display etcd' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node -S

Repository: openshift-eng/two-node-toolbox

Length of output: 259


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '--- wait-etcd.yml ---'
nl -ba deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml | sed -n '1,220p'

echo
echo '--- task files under sno-to-3node ---'
fd -a -t f tasks deploy/openshift-clusters/roles/mutable-topology/sno-to-3node

echo
echo '--- references to topology.yml and etcd checks ---'
rg -n 'topology\.yml|wait-etcd|etcdctl endpoint health|member list|Display etcd' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node -S

Repository: openshift-eng/two-node-toolbox

Length of output: 230


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '--- wait-etcd.yml ---'
sed -n '1,220p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml

echo
echo '--- task files under sno-to-3node ---'
find deploy/openshift-clusters/roles/mutable-topology/sno-to-3node -maxdepth 3 -type f | sort

echo
echo '--- references to topology.yml and etcd checks ---'
grep -RInE 'topology\.yml|wait-etcd|etcdctl endpoint health|member list|Display etcd' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node

Repository: openshift-eng/two-node-toolbox

Length of output: 4602


Gate the transition on etcd health. The loop only waits for 3 Ready pods; the member list and endpoint health steps still ignore failures (failed_when: false / || echo ...). Since topology.yml runs immediately after this task, the role can proceed while etcd membership or quorum is still unhealthy. Make those checks fail the task or include them in the retry condition.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml`
around lines 28 - 48, The wait-etcd task only checks for 3 Ready pods, while the
etcd membership and endpoint health probes in the same flow still suppress
failures via failed_when: false and fallback echo text. Update the wait logic in
wait-etcd.yml so the shell checks using oc exec and etcdctl member list /
endpoint health are allowed to fail the task, or fold their success into the
retry condition, so the role does not proceed to topology.yml until etcd quorum
and health are actually good.

Comment on lines +4 to +21
PENDING=$(oc get csr -o json 2>/dev/null | python3 -c "
import json, sys
data = json.load(sys.stdin)
pending = []
for csr in data.get('items', []):
status = csr.get('status', {})
if not status.get('conditions'):
name = csr['metadata']['name']
pending.append(name)
for name in pending:
print(name)
" 2>/dev/null || true)

if [ -n "$PENDING" ]; then
for CSR in $PENDING; do
oc adm certificate approve "$CSR" 2>/dev/null || true
echo "Approved CSR: $CSR"
done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the target file and nearby context.
git ls-files 'deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml'
wc -l deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml

# Find related uses of CSR approval / sno_new_nodes in this role and repo.
rg -n --hidden --no-ignore-vcs 'oc adm certificate approve|pending CSR|sno_new_nodes|csr' deploy/openshift-clusters/roles/mutable-topology -S

Repository: openshift-eng/two-node-toolbox

Length of output: 6346


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map the role directory to locate adjacent tasks/files.
find deploy/openshift-clusters/roles/mutable-topology/sno-to-3node -maxdepth 3 -type f | sort

# Show the full task file with line numbers if small enough.
stat -c '%n %s bytes %n' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml
sed -n '1,220p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml

Repository: openshift-eng/two-node-toolbox

Length of output: 1520


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the variables that define the new nodes and nearby role flow.
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml
printf '\n---\n'
sed -n '1,220p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/main.yml

Repository: openshift-eng/two-node-toolbox

Length of output: 1943


🏁 Script executed:

#!/bin/bash
set -euo pipefail

cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml
printf '\n---\n'
sed -n '1,220p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/main.yml

Repository: openshift-eng/two-node-toolbox

Length of output: 1943


Restrict CSR approval to the new masters only
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml approves every pending CSR returned by oc get csr. During the migration window, this can approve unrelated or stale requests. Limit approval to the kubelet client/server CSRs for sno_new_nodes before calling oc adm certificate approve.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml`
around lines 4 - 21, The CSR approval logic in the wait-nodes task is too broad
because it approves every pending request returned by oc get csr. Update the
pending-CSR collection and approval loop in the shell block so it only selects
kubelet client/server CSRs associated with sno_new_nodes, and then approve only
those entries before calling oc adm certificate approve. Use the existing
PENDING/CSR handling in wait-nodes.yml as the place to narrow the filter.

Comment on lines +2 to +5
sno_dev_scripts_path: "{{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}"
sno_kubeconfig_resolved: >-
{{ sno_kubeconfig if sno_kubeconfig
else sno_dev_scripts_path ~ '/ocp/' ~ sno_cluster_name ~ '/auth/kubeconfig' }}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

sno_cluster_name is not actually auto-detectable here.

deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/defaults/main.yml sets sno_cluster_name to ostest, and this file immediately bakes that into sno_kubeconfig_resolved. On any non-ostest deployment, the first oc call will use the wrong kubeconfig path and fail before the role can discover the real cluster identity. Require sno_cluster_name/sno_kubeconfig up front, or resolve kubeconfig from a source that does not assume the cluster name.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml`
around lines 2 - 5, The kubeconfig path in sno_kubeconfig_resolved is
incorrectly derived from the default sno_cluster_name, so it can point to the
wrong cluster before discovery happens. Update the mutable-topology SNO-to-3node
vars so sno_cluster_name is not assumed here; instead require sno_cluster_name
or sno_kubeconfig to be provided up front, or change the resolution logic to use
a source that does not depend on the default value. Keep the fix localized to
the sno_kubeconfig_resolved expression and related variable handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants