OCPEDGE-2498: Add migration support#83
Conversation
Add a new Ansible role and make target to perform a one-shot migration of a platform:None SNO cluster to a 3-node HA cluster. The role creates two new control-plane VMs, boots them with RHCOS via coreos-installer ISO ignition embed, waits for nodes to join and etcd to scale, then transitions the topology to HighlyAvailable. Usage: make sno-to-3node Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ehila <ehila@redhat.com>
Signed-off-by: ehila <ehila@redhat.com>
|
@eggfoobar: This pull request references OCPEDGE-2498 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "5.0.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: eggfoobar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughAdds SNO deployment and SNO-to-3node transition entrypoints, install-dev SNO support, role defaults and ignition generation, preflight checks, VM provisioning and boot tasks, topology and rollout waits, DNS refresh, verification, and cleanup. ChangesSNO-to-3node workflow
Sequence Diagram(s)sequenceDiagram
participant Makefile
participant sno_to_3node_sh as "sno-to-3node.sh"
participant sno_to_3node_yml as "sno-to-3node.yml"
participant mutable_topology_role as "mutable-topology/sno-to-3node"
participant oc
participant virsh
Makefile->>sno_to_3node_sh: runs sno-to-3node target
sno_to_3node_sh->>sno_to_3node_yml: ansible-playbook sno-to-3node.yml
sno_to_3node_yml->>mutable_topology_role: import_tasks phases
mutable_topology_role->>oc: preflight, topology patch, CSR approval, and verification
mutable_topology_role->>virsh: create VMs, boot disks, and update libvirt DNS
mutable_topology_role->>oc: wait for nodes, etcd, MCO rollout, and final cluster checks
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90+ minutes Suggested reviewers
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error)
✅ Passed checks (10 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
deploy/Makefile (1)
122-130: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winDocument the new
sno-ipiandsno-agenttargets inmake help.Lines 79-83 add both commands, but this help block only advertises
sno-to-3node.make helpwill miss the new entrypoints entirely.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/Makefile` around lines 122 - 130, Update the Makefile help output so the new `sno-ipi` and `sno-agent` targets are listed alongside the existing deployment entries in `make help`. Add matching `@echo` lines in the help block near the `sno-to-3node` entry, using the same wording/style as the other target descriptions so the new entrypoints are discoverable.deploy/openshift-clusters/scripts/deploy-cluster.sh (1)
62-69: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winReject the unsupported
sno + kclicombination explicitly.This validation now allows
--topology snofor every method, so a directdeploy-cluster.sh --topology sno --method kclifalls through tokcli-install.yml. The PR only adds SNO support through the install-dev/setup flow, so this opens an unsupported path instead of failing fast.Suggested fix
if [[ "${METHOD}" != "ipi" && "${METHOD}" != "agent" && "${METHOD}" != "kcli" ]]; then echo "Error: Invalid method '${METHOD}'. Must be 'ipi', 'agent', or 'kcli'." exit 1 fi + +if [[ "${TOPOLOGY}" == "sno" && "${METHOD}" == "kcli" ]]; then + echo "Error: Topology 'sno' only supports 'ipi' or 'agent'." + exit 1 +fiAlso applies to: 87-104
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/scripts/deploy-cluster.sh` around lines 62 - 69, The validation in deploy-cluster.sh currently permits every METHOD with TOPOLOGY=sno, which allows an unsupported sno + kcli path to continue into the kcli install flow. Update the argument checks around the TOPOLOGY/METHOD validation blocks to explicitly reject the unsupported combination before any install logic runs, and keep the failure consistent with the existing validation messages. Reference the validation section and the kcli-install path so the guard is placed where the topology and method are first checked.deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml (1)
26-32: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winInclude the SNO scenario in the missing-variable error.
Line 31 still only tells agent users to add
TNF_IPV4orTNA_IPV4. Now thatsnois a supported prefix on Line 43, a missingAGENT_E2E_TEST_SCENARIOfor SNO will produce the wrong remediation.Proposed fix
- AGENT_E2E_TEST_SCENARIO="TNF_IPV4" (fencing) or "TNA_IPV4" (arbiter) to - the config file. + AGENT_E2E_TEST_SCENARIO="TNF_IPV4" (fencing), "TNA_IPV4" (arbiter), or + "SNO_IPV4" (sno) to the config file.Also applies to: 43-43
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml` around lines 26 - 32, The missing-AGENT_E2E_TEST_SCENARIO failure message in the config validation task should be updated to include the SNO option now supported by the method-prefix logic. Adjust the fail message in the config validation flow so it lists the correct remediation for all supported prefixes, including the new sno case alongside the existing agent scenarios, and make sure the guidance matches the prefix handling used in the task that selects by method.
🧹 Nitpick comments (2)
deploy/openshift-clusters/scripts/sno-to-3node.sh (1)
1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse the repository-standard Bash shebang.
This new wrapper still uses
#!/bin/bash, but the repo shell-script rule requires#!/usr/bin/bash. As per coding guidelines, "Shell scripts should use#!/usr/bin/bashshebang".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/scripts/sno-to-3node.sh` around lines 1 - 2, The wrapper script currently uses the non-standard Bash shebang, so update the script header to match the repository convention by changing the shebang at the top of sno-to-3node.sh to the repo-standard Bash path. Keep the rest of the script unchanged, and verify the file still starts with the expected shebang followed by set -euo pipefail.Source: Coding guidelines
deploy/openshift-clusters/scripts/clean-mutable-topology.sh (1)
1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse the repository-standard Bash shebang.
This new wrapper still uses
#!/bin/bash, but the repo shell-script rule requires#!/usr/bin/bash. As per coding guidelines, "Shell scripts should use#!/usr/bin/bashshebang".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh` around lines 1 - 2, The clean-mutable-topology.sh wrapper uses the wrong Bash shebang, so update the script header to match the repository standard. Replace the current shebang at the top of the file with the repo-required Bash interpreter line, keeping the rest of the script unchanged; this is the only fix needed in the script’s startup header.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deploy/openshift-clusters/clean-mutable-topology.yml`:
- Around line 35-39: The cleanup playbook currently imports
mutable-topology/sno-to-3node tasks/clean.yml unconditionally, which can destroy
a healthy HA control plane. Add a topology check in clean-mutable-topology.yml
before the import_role so it only runs when the cluster is not already HA, or
require an explicit force_cleanup=true override to proceed. Use the existing
import_role task as the entry point and gate it with a condition or precheck
that prevents destructive cleanup unless the override is set.
In
`@deploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.sh`:
- Around line 35-40: This example hardcodes an x86_64 release image and leaves
out the ARM64-specific Metal3 overrides, so update the config example to
explicitly document how to switch to aarch64/Graviton. Use
OPENSHIFT_RELEASE_IMAGE as the payload architecture selector and add the
required Metal3 container override guidance (IRONIC_IMAGE, VBMC_IMAGE,
SUSHY_TOOLS_IMAGE) in the example so the install path is correct for ARM64
hypervisors. Refer to the config_sno_example.sh example block and keep the
existing installer/image-policy comments consistent with the new ARM64 notes.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`:
- Around line 68-72: The cleanup task in clean.yml is deleting the original SNO
node’s api-int DNS entry, which can break the surviving cluster after cleanup.
Update the [clean] Remove api-int from libvirt network DNS task to target only
mutable-topology-added resources associated with sno_new_nodes, and avoid
removing the api-int record for sno_master0_ip in the libvirt network update
logic.
- Around line 37-43: The embedded Python in the cleanup task scripts is
indented, which causes the inline `python3 -c` snippets used in `clean.yml` to
fail before the DHCP cleanup logic runs. Update the scripts in the task blocks
that use `EXISTING_MAC` and the lease cleanup command so the Python content is
left-aligned or otherwise dedented, and keep the logic under the same task names
so `|| true` / `failed_when: false` no longer hides a parse failure.
- Around line 26-30: The per-node ISO cleanup in the [clean] task is using a
hostname-based filename that can collide across clusters on the same hypervisor.
Update the ISO naming scheme in the related ISO creation and cleanup flow to use
a cluster-scoped identifier, such as item.name or a sno_cluster_name-prefixed
name, and make the file path in the cleanup task match that same unique pattern.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml`:
- Around line 27-32: The DNS cleanup in the `create-vms.yml` task is still
targeting `{{ item.ip }}`, so stale `dns-host` records for `master-1`/`master-2`
with old IPs are not removed before `virsh net-update add dns-host` runs. Update
the cleanup logic around the `virsh net-update` calls to delete entries by
hostname instead of by IP, and then tighten the `dig` verification so it checks
that the resolved answers actually contain `{{ item.ip }}` rather than only a
non-empty response.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.yml`:
- Around line 19-24: The MCS reachability check in the ignition task is
currently allowed to fail and continue, which defers a required configuration
error until boot. Update the existing "[ignition] Verify MCS reachability" step
so it aborts when the curl probe to the MCS health endpoint fails, and only keep
a non-failing path if you explicitly add a break-glass override. Use the
existing mcs_health probe logic as the place to enforce fail-fast behavior.
- Around line 13-18: The ignition generation step writes /tmp/master.ign but
never populates sno_master_ign_b64, so auto-install.ign.j2 cannot render
reliably. Update the relevant task flow in ignition.yml to read back the
generated master.ign and set sno_master_ign_b64 before the template that
consumes it, using the existing ignition generation and auto-install rendering
steps as the anchor points. Ensure the variable is produced within the role
rather than assuming an external caller supplies it.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.yml`:
- Around line 68-81: The MCO rollout polling in the mco-rollout task currently
suppresses timeout failures by using failed_when: false, which allows the play
to continue even if mcp_poll never reports completion. Update the
polling/verification logic in the mcp_poll task so a timeout is fatal by
default, or make the non-fatal continuation explicitly opt-in via a best-effort
mode. Keep the success check based on mcp_poll.stdout and ensure the final MCO
rollout result step reflects failure when the rollout does not complete.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`:
- Around line 80-120: The preflight checks in the etcd probe and the
cluster-operator health check are only logging warnings instead of stopping the
topology transition, which makes this phase non-gating. Update the preflight
logic so the tasks for the etcd pod check and the degraded CO check fail the
play by default, removing the unconditional `|| true` and the `failed_when:
false` behavior; keep any bypass only behind an explicit break-glass override.
Use the existing task names like "[preflight] Check etcd pods" and "[preflight]
Check for degraded cluster operators" to locate the affected steps.
- Around line 129-154: The preflight repair flow for api-int only deletes
entries that already match sno_master0_ip and the verification step only checks
for any DNS answer, so stale mappings can remain undetected. Update the shell
logic in the api-int repair task to remove existing api-int entries by
hostname/FQDN regardless of IP, then make the dns_apiint_verify check confirm
the resolved result includes sno_master0_ip rather than just being non-empty.
Use the existing preflight tasks and variables like sno_libvirt_network,
sno_master0_ip, and sno_cluster_domain to locate and adjust the repair and
verification steps.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml`:
- Around line 24-31: The dnsmasq config in the update-dns task is incorrectly
pinning the wildcard apps record to sno_all_node_ips[0], which can vary because
node ordering is not stable. Update the logic in the sno_dnsmasq_lines
generation to use a stable ingress target such as a VIP/load balancer, or
otherwise emit all intended ingress node IPs instead of selecting the first node
IP. Make the change in the set_fact block that builds the DNS entries so the
*.apps record does not depend on array order.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml`:
- Around line 28-48: The wait-etcd task only checks for 3 Ready pods, while the
etcd membership and endpoint health probes in the same flow still suppress
failures via failed_when: false and fallback echo text. Update the wait logic in
wait-etcd.yml so the shell checks using oc exec and etcdctl member list /
endpoint health are allowed to fail the task, or fold their success into the
retry condition, so the role does not proceed to topology.yml until etcd quorum
and health are actually good.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml`:
- Around line 4-21: The CSR approval logic in the wait-nodes task is too broad
because it approves every pending request returned by oc get csr. Update the
pending-CSR collection and approval loop in the shell block so it only selects
kubelet client/server CSRs associated with sno_new_nodes, and then approve only
those entries before calling oc adm certificate approve. Use the existing
PENDING/CSR handling in wait-nodes.yml as the place to narrow the filter.
In `@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml`:
- Around line 2-5: The kubeconfig path in sno_kubeconfig_resolved is incorrectly
derived from the default sno_cluster_name, so it can point to the wrong cluster
before discovery happens. Update the mutable-topology SNO-to-3node vars so
sno_cluster_name is not assumed here; instead require sno_cluster_name or
sno_kubeconfig to be provided up front, or change the resolution logic to use a
source that does not depend on the default value. Keep the fix localized to the
sno_kubeconfig_resolved expression and related variable handling.
---
Outside diff comments:
In `@deploy/Makefile`:
- Around line 122-130: Update the Makefile help output so the new `sno-ipi` and
`sno-agent` targets are listed alongside the existing deployment entries in
`make help`. Add matching `@echo` lines in the help block near the
`sno-to-3node` entry, using the same wording/style as the other target
descriptions so the new entrypoints are discoverable.
In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml`:
- Around line 26-32: The missing-AGENT_E2E_TEST_SCENARIO failure message in the
config validation task should be updated to include the SNO option now supported
by the method-prefix logic. Adjust the fail message in the config validation
flow so it lists the correct remediation for all supported prefixes, including
the new sno case alongside the existing agent scenarios, and make sure the
guidance matches the prefix handling used in the task that selects by method.
In `@deploy/openshift-clusters/scripts/deploy-cluster.sh`:
- Around line 62-69: The validation in deploy-cluster.sh currently permits every
METHOD with TOPOLOGY=sno, which allows an unsupported sno + kcli path to
continue into the kcli install flow. Update the argument checks around the
TOPOLOGY/METHOD validation blocks to explicitly reject the unsupported
combination before any install logic runs, and keep the failure consistent with
the existing validation messages. Reference the validation section and the
kcli-install path so the guard is placed where the topology and method are first
checked.
---
Nitpick comments:
In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh`:
- Around line 1-2: The clean-mutable-topology.sh wrapper uses the wrong Bash
shebang, so update the script header to match the repository standard. Replace
the current shebang at the top of the file with the repo-required Bash
interpreter line, keeping the rest of the script unchanged; this is the only fix
needed in the script’s startup header.
In `@deploy/openshift-clusters/scripts/sno-to-3node.sh`:
- Around line 1-2: The wrapper script currently uses the non-standard Bash
shebang, so update the script header to match the repository convention by
changing the shebang at the top of sno-to-3node.sh to the repo-standard Bash
path. Keep the rest of the script unchanged, and verify the file still starts
with the expected shebang followed by set -euo pipefail.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: d1bb47f1-0e94-4c52-bf4c-ea4c9efe1020
📒 Files selected for processing (27)
deploy/Makefiledeploy/openshift-clusters/clean-mutable-topology.ymldeploy/openshift-clusters/roles/dev-scripts/install-dev/files/.gitignoredeploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.shdeploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.ymldeploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.ymldeploy/openshift-clusters/roles/dev-scripts/install-dev/vars/main.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/defaults/main.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/main.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/auto-install.ign.j2deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/master.ign.j2deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.ymldeploy/openshift-clusters/scripts/clean-mutable-topology.shdeploy/openshift-clusters/scripts/deploy-cluster.shdeploy/openshift-clusters/scripts/sno-to-3node.shdeploy/openshift-clusters/sno-to-3node.yml
| tasks: | ||
| - name: Run mutable topology cleanup | ||
| import_role: | ||
| name: mutable-topology/sno-to-3node | ||
| tasks_from: clean.yml |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Refuse cleanup once the cluster is already HA unless the caller forces it.
This imports tasks/clean.yml unconditionally, and that role deletes master-1/master-2 plus their disks and DHCP state. On a successful 3-node cluster, removing two control-plane nodes here can drop etcd from 3 members to 1 and take the cluster down. Please gate this on the current control-plane topology, or require an explicit force_cleanup=true override before running the destructive role.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/openshift-clusters/clean-mutable-topology.yml` around lines 35 - 39,
The cleanup playbook currently imports mutable-topology/sno-to-3node
tasks/clean.yml unconditionally, which can destroy a healthy HA control plane.
Add a topology check in clean-mutable-topology.yml before the import_role so it
only runs when the cluster is not already HA, or require an explicit
force_cleanup=true override to proceed. Use the existing import_role task as the
entry point and gate it with a condition or precheck that prevents destructive
cleanup unless the override is set.
| - name: "[preflight] Add api-int to libvirt network DNS" | ||
| # Using virsh net-update (not lineinfile) so that api-int survives addnhosts | ||
| # regeneration when create-vms.yml calls virsh net-update for master-1/2. | ||
| # libvirt regenerates ostestbm.addnhosts from the network XML on every net-update. | ||
| shell: | | ||
| # Remove any stale api-int entry (idempotent, ignore errors) | ||
| sudo virsh net-update {{ sno_libvirt_network }} delete dns-host \ | ||
| "<host ip='{{ sno_master0_ip }}'><hostname>api-int.{{ sno_cluster_domain }}</hostname></host>" \ | ||
| --live --config 2>/dev/null || true | ||
| # Add the correct entry - triggers addnhosts regeneration with api-int included | ||
| sudo virsh net-update {{ sno_libvirt_network }} add dns-host \ | ||
| "<host ip='{{ sno_master0_ip }}'><hostname>api-int.{{ sno_cluster_domain }}</hostname></host>" \ | ||
| --live --config | ||
| when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout | ||
| changed_when: true | ||
|
|
||
| - name: "[preflight] Verify api-int DNS after fix" | ||
| shell: | | ||
| dig +short api-int.{{ sno_cluster_domain }} @192.168.111.1 | ||
| register: dns_apiint_verify | ||
| changed_when: false | ||
| failed_when: dns_apiint_verify.stdout | trim == "" | ||
| retries: 3 | ||
| delay: 2 | ||
| until: dns_apiint_verify.stdout | trim != "" | ||
| when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
The api-int repair path does not remove or verify stale mappings correctly.
The delete call matches api-int on {{ sno_master0_ip }}, so it only removes a record that already points at the desired IP. If api-int is stale and points somewhere else, the old XML entry survives, and the follow-up dig still passes because it only checks for any answer, not for {{ sno_master0_ip }}. Remove existing api-int entries by hostname/FQDN and verify the resolved answers include the expected IP.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`
around lines 129 - 154, The preflight repair flow for api-int only deletes
entries that already match sno_master0_ip and the verification step only checks
for any DNS answer, so stale mappings can remain undetected. Update the shell
logic in the api-int repair task to remove existing api-int entries by
hostname/FQDN regardless of IP, then make the dns_apiint_verify check confirm
the resolved result includes sno_master0_ip rather than just being non-empty.
Use the existing preflight tasks and variables like sno_libvirt_network,
sno_master0_ip, and sno_cluster_domain to locate and adjust the repair and
verification steps.
| - name: "[dns] Build NM dnsmasq config" | ||
| set_fact: | ||
| sno_dnsmasq_lines: | | ||
| {% for ip in sno_all_node_ips %} | ||
| address=/api.{{ sno_cluster_domain }}/{{ ip }} | ||
| address=/api-int.{{ sno_cluster_domain }}/{{ ip }} | ||
| {% endfor %} | ||
| address=/.apps.{{ sno_cluster_domain }}/{{ sno_all_node_ips[0] }} |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Map the target file and nearby role files.
git ls-files 'deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/**' \
'deploy/openshift-clusters/roles/**/tasks/**' \
'deploy/openshift-clusters/roles/**/files/**' \
'deploy/openshift-clusters/roles/**/templates/**' | sed -n '1,250p'
echo
echo '--- outline ---'
ast-grep outline deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml --view expanded || true
echo
echo '--- contents ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.ymlRepository: openshift-eng/two-node-toolbox
Length of output: 9700
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Map the target file and nearby role files.
git ls-files 'deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/**' \
'deploy/openshift-clusters/roles/**/tasks/**' \
'deploy/openshift-clusters/roles/**/files/**' \
'deploy/openshift-clusters/roles/**/templates/**' | sed -n '1,250p'
echo
echo '--- outline ---'
ast-grep outline deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml --view expanded || true
echo
echo '--- contents ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.ymlRepository: openshift-eng/two-node-toolbox
Length of output: 9700
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect neighboring topology and DNS logic for the intended apps target.
rg -n "sno_all_node_ips|apps\\.|dnsmasq|ingress|router|api-int|api\\." \
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
deploy/openshift-clusters/roles/common \
deploy/openshift-clusters/roles/dev-scripts/install-dev \
deploy/openshift-clusters/roles/kcli/kcli-install \
deploy/openshift-clusters/roles/assisted/assisted-spoke \
deploy/openshift-clusters/roles/assisted/acm-install
echo
echo '--- vars/main.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml
echo
echo '--- topology.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.yml
echo
echo '--- verify.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.ymlRepository: openshift-eng/two-node-toolbox
Length of output: 16571
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect neighboring topology and DNS logic for the intended apps target.
rg -n "sno_all_node_ips|apps\\.|dnsmasq|ingress|router|api-int|api\\." \
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
deploy/openshift-clusters/roles/common \
deploy/openshift-clusters/roles/dev-scripts/install-dev \
deploy/openshift-clusters/roles/kcli/kcli-install \
deploy/openshift-clusters/roles/assisted/assisted-spoke \
deploy/openshift-clusters/roles/assisted/acm-install
echo
echo '--- vars/main.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml
echo
echo '--- topology.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.yml
echo
echo '--- verify.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.ymlRepository: openshift-eng/two-node-toolbox
Length of output: 16571
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect neighboring topology and DNS logic for the intended apps target.
rg -n "sno_all_node_ips|apps\\.|dnsmasq|ingress|router|api-int|api\\." \
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
deploy/openshift-clusters/roles/common \
deploy/openshift-clusters/roles/dev-scripts/install-dev \
deploy/openshift-clusters/roles/kcli/kcli-install \
deploy/openshift-clusters/roles/assisted/assisted-spoke \
deploy/openshift-clusters/roles/assisted/acm-install
echo
echo '--- vars/main.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml
echo
echo '--- topology.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.yml
echo
echo '--- verify.yml ---'
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.ymlRepository: openshift-eng/two-node-toolbox
Length of output: 16571
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check how ingress/router is handled for the SNO-to-3node transition.
rg -n "ingress|router|ingresscontroller|router pod|default ingress|apps\\." \
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
deploy/openshift-clusters/docs \
deploy/openshift-clusters/README.md \
docs \
README.md
echo
echo '--- file list around mutable-topology docs ---'
git ls-files 'deploy/openshift-clusters/**README*' 'deploy/openshift-clusters/**/*.md' | rg 'mutable-topology|sno-to-3node|openshift-clusters'
echo
echo '--- update-dns task with context already known ---'
sed -n '1,120p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.ymlRepository: openshift-eng/two-node-toolbox
Length of output: 1012
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "IngressController|ingresscontroller|router-default|worker|master.*router|taint|schedul" \
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
deploy/openshift-clusters/roles
echo
echo '--- mutable-topology role files ---'
git ls-files deploy/openshift-clusters/roles/mutable-topology/sno-to-3node
echo
echo '--- README candidates ---'
fd -a 'README.md' deploy/openshift-clusters/roles/mutable-topology deploy/openshift-clusters | sed -n '1,200p'Repository: openshift-eng/two-node-toolbox
Length of output: 4257
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "IngressController|ingresscontroller|router-default|worker|master.*router|taint|schedul" \
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
deploy/openshift-clusters/roles
echo
echo '--- mutable-topology role files ---'
git ls-files deploy/openshift-clusters/roles/mutable-topology/sno-to-3node
echo
echo '--- README candidates ---'
fd -a 'README.md' deploy/openshift-clusters/roles/mutable-topology deploy/openshift-clusters | sed -n '1,200p'Repository: openshift-eng/two-node-toolbox
Length of output: 4257
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Gather the SNO-to-3node ingress/DNS intent from role docs and nearby logic.
rg -n "apps|ingress|router|vip|load balancer|load-balancer|dnsmasq" \
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
deploy/openshift-clusters/README.md \
deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh
echo
echo '--- resolve_vm_ip.sh ---'
cat -n deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh
echo
echo '--- clean.yml excerpt ---'
sed -n '60,90p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml
echo
echo '--- preflight.yml excerpt ---'
sed -n '118,155p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.ymlRepository: openshift-eng/two-node-toolbox
Length of output: 11379
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Gather the SNO-to-3node ingress/DNS intent from role docs and nearby logic.
rg -n "apps|ingress|router|vip|load balancer|load-balancer|dnsmasq" \
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node \
deploy/openshift-clusters/README.md \
deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh
echo
echo '--- resolve_vm_ip.sh ---'
cat -n deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh
echo
echo '--- clean.yml excerpt ---'
sed -n '60,90p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml
echo
echo '--- preflight.yml excerpt ---'
sed -n '118,155p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.ymlRepository: openshift-eng/two-node-toolbox
Length of output: 11379
Don't pin *.apps to sno_all_node_ips[0].
oc get nodes ordering is not guaranteed, so wildcard routes can jump between control-plane IPs and land on a node that isn't serving ingress. Use a stable ingress VIP/load balancer, or publish all ingress node IPs if that's the intended design.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml`
around lines 24 - 31, The dnsmasq config in the update-dns task is incorrectly
pinning the wildcard apps record to sno_all_node_ips[0], which can vary because
node ordering is not stable. Update the logic in the sno_dnsmasq_lines
generation to use a stable ingress target such as a VIP/load balancer, or
otherwise emit all intended ingress node IPs instead of selecting the first node
IP. Make the change in the set_fact block that builds the DNS entries so the
*.apps record does not depend on array order.
| - name: "[etcd] Display etcd member list" | ||
| shell: | | ||
| ETCD_POD=$(oc get pods -n openshift-etcd -l app=etcd -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) | ||
| oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl member list -w table 2>/dev/null || echo "Could not query etcd members" | ||
| environment: "{{ sno_oc_env }}" | ||
| register: etcd_members | ||
| changed_when: false | ||
| failed_when: false | ||
|
|
||
| - name: "[etcd] Members" | ||
| debug: | ||
| msg: "{{ etcd_members.stdout }}" | ||
|
|
||
| - name: "[etcd] Display etcd endpoint health" | ||
| shell: | | ||
| ETCD_POD=$(oc get pods -n openshift-etcd -l app=etcd -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) | ||
| oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint health --cluster -w table 2>/dev/null || echo "Could not query endpoint health" | ||
| environment: "{{ sno_oc_env }}" | ||
| register: etcd_health | ||
| changed_when: false | ||
| failed_when: false |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '--- wait-etcd.yml ---\n'
nl -ba deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml | sed -n '1,220p'
printf '\n--- role task files ---\n'
fd -a -t f 'tasks' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node | sed -n '1,120p'
printf '\n--- references to topology.yml and wait-etcd ---\n'
rg -n 'topology\.yml|wait-etcd|etcdctl endpoint health|member list|Display etcd' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node -SRepository: openshift-eng/two-node-toolbox
Length of output: 259
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '--- wait-etcd.yml ---\n'
nl -ba deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml | sed -n '1,220p'
printf '\n--- role task files ---\n'
fd -a -t f 'tasks' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node | sed -n '1,120p'
printf '\n--- references to topology.yml and wait-etcd ---\n'
rg -n 'topology\.yml|wait-etcd|etcdctl endpoint health|member list|Display etcd' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node -SRepository: openshift-eng/two-node-toolbox
Length of output: 259
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '--- wait-etcd.yml ---\n'
nl -ba deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml | sed -n '1,220p'
printf '\n--- role task files ---\n'
fd -a -t f 'tasks' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node | sed -n '1,120p'
printf '\n--- references to topology.yml and wait-etcd ---\n'
rg -n 'topology\.yml|wait-etcd|etcdctl endpoint health|member list|Display etcd' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node -SRepository: openshift-eng/two-node-toolbox
Length of output: 259
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '--- wait-etcd.yml ---'
nl -ba deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml | sed -n '1,220p'
echo
echo '--- task files under sno-to-3node ---'
fd -a -t f tasks deploy/openshift-clusters/roles/mutable-topology/sno-to-3node
echo
echo '--- references to topology.yml and etcd checks ---'
rg -n 'topology\.yml|wait-etcd|etcdctl endpoint health|member list|Display etcd' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node -SRepository: openshift-eng/two-node-toolbox
Length of output: 230
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '--- wait-etcd.yml ---'
sed -n '1,220p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml
echo
echo '--- task files under sno-to-3node ---'
find deploy/openshift-clusters/roles/mutable-topology/sno-to-3node -maxdepth 3 -type f | sort
echo
echo '--- references to topology.yml and etcd checks ---'
grep -RInE 'topology\.yml|wait-etcd|etcdctl endpoint health|member list|Display etcd' deploy/openshift-clusters/roles/mutable-topology/sno-to-3nodeRepository: openshift-eng/two-node-toolbox
Length of output: 4602
Gate the transition on etcd health. The loop only waits for 3 Ready pods; the member list and endpoint health steps still ignore failures (failed_when: false / || echo ...). Since topology.yml runs immediately after this task, the role can proceed while etcd membership or quorum is still unhealthy. Make those checks fail the task or include them in the retry condition.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml`
around lines 28 - 48, The wait-etcd task only checks for 3 Ready pods, while the
etcd membership and endpoint health probes in the same flow still suppress
failures via failed_when: false and fallback echo text. Update the wait logic in
wait-etcd.yml so the shell checks using oc exec and etcdctl member list /
endpoint health are allowed to fail the task, or fold their success into the
retry condition, so the role does not proceed to topology.yml until etcd quorum
and health are actually good.
| PENDING=$(oc get csr -o json 2>/dev/null | python3 -c " | ||
| import json, sys | ||
| data = json.load(sys.stdin) | ||
| pending = [] | ||
| for csr in data.get('items', []): | ||
| status = csr.get('status', {}) | ||
| if not status.get('conditions'): | ||
| name = csr['metadata']['name'] | ||
| pending.append(name) | ||
| for name in pending: | ||
| print(name) | ||
| " 2>/dev/null || true) | ||
|
|
||
| if [ -n "$PENDING" ]; then | ||
| for CSR in $PENDING; do | ||
| oc adm certificate approve "$CSR" 2>/dev/null || true | ||
| echo "Approved CSR: $CSR" | ||
| done |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the target file and nearby context.
git ls-files 'deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml'
wc -l deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml
# Find related uses of CSR approval / sno_new_nodes in this role and repo.
rg -n --hidden --no-ignore-vcs 'oc adm certificate approve|pending CSR|sno_new_nodes|csr' deploy/openshift-clusters/roles/mutable-topology -SRepository: openshift-eng/two-node-toolbox
Length of output: 6346
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Map the role directory to locate adjacent tasks/files.
find deploy/openshift-clusters/roles/mutable-topology/sno-to-3node -maxdepth 3 -type f | sort
# Show the full task file with line numbers if small enough.
stat -c '%n %s bytes %n' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml
sed -n '1,220p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.ymlRepository: openshift-eng/two-node-toolbox
Length of output: 1520
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the variables that define the new nodes and nearby role flow.
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml
printf '\n---\n'
sed -n '1,220p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/main.ymlRepository: openshift-eng/two-node-toolbox
Length of output: 1943
🏁 Script executed:
#!/bin/bash
set -euo pipefail
cat -n deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml
printf '\n---\n'
sed -n '1,220p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/main.ymlRepository: openshift-eng/two-node-toolbox
Length of output: 1943
Restrict CSR approval to the new masters only
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml approves every pending CSR returned by oc get csr. During the migration window, this can approve unrelated or stale requests. Limit approval to the kubelet client/server CSRs for sno_new_nodes before calling oc adm certificate approve.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml`
around lines 4 - 21, The CSR approval logic in the wait-nodes task is too broad
because it approves every pending request returned by oc get csr. Update the
pending-CSR collection and approval loop in the shell block so it only selects
kubelet client/server CSRs associated with sno_new_nodes, and then approve only
those entries before calling oc adm certificate approve. Use the existing
PENDING/CSR handling in wait-nodes.yml as the place to narrow the filter.
| sno_dev_scripts_path: "{{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}" | ||
| sno_kubeconfig_resolved: >- | ||
| {{ sno_kubeconfig if sno_kubeconfig | ||
| else sno_dev_scripts_path ~ '/ocp/' ~ sno_cluster_name ~ '/auth/kubeconfig' }} |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
sno_cluster_name is not actually auto-detectable here.
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/defaults/main.yml sets sno_cluster_name to ostest, and this file immediately bakes that into sno_kubeconfig_resolved. On any non-ostest deployment, the first oc call will use the wrong kubeconfig path and fail before the role can discover the real cluster identity. Require sno_cluster_name/sno_kubeconfig up front, or resolve kubeconfig from a source that does not assume the cluster name.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml`
around lines 2 - 5, The kubeconfig path in sno_kubeconfig_resolved is
incorrectly derived from the default sno_cluster_name, so it can point to the
wrong cluster before discovery happens. Update the mutable-topology SNO-to-3node
vars so sno_cluster_name is not assumed here; instead require sno_cluster_name
or sno_kubeconfig to be provided up front, or change the resolution logic to use
a source that does not depend on the default value. Keep the fix localized to
the sno_kubeconfig_resolved expression and related variable handling.
Summary by CodeRabbit
New Features
Bug Fixes