docs(cozystack-upgrade): add KubeVirt 1.6→1.8 VM cold-restart workflow#7
docs(cozystack-upgrade): add KubeVirt 1.6→1.8 VM cold-restart workflow#7
Conversation
Cozystack release-1.4 will bump KubeVirt from 1.6.3 to 1.8.2 (cozystack PR #2502). Every VM that was running before the upgrade then fails to live-migrate because the in-memory QEMU device state can't be reloaded by the new QEMU on the target launcher (kubevirt/kubevirt#16386, virtio-net specifically). Add a known-failures entry covering: - pre-upgrade: set workloadUpdateMethods=[] and suspend the kubevirt HelmRelease - post-upgrade: paced cold-restart of all running VMs (with an exclusion list for tenants who can't take the downtime window) - steady state: re-enable workloadUpdateMethods once the cluster is uniformly on the new launcher image Also add a SKILL.md red-flag row and a top-level "KubeVirt 1.6.x → 1.8.x special handling" note so the operator catches this before running helm upgrade. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request adds documentation and a detailed recovery procedure for KubeVirt upgrades from version 1.6.x to 1.8.x, addressing a known issue where live-migrations fail due to a QEMU version bump. The feedback suggests improving the provided bash scripts by using a generic default for the namespace exclusion list and adding a status filter to ensure only running pods are targeted during the phased cold-restart process.
| ```bash | ||
| # 6. Build the worklist of VMIs to restart. Excludes any that the operator | ||
| # must leave alone (replace EXCLUDED_NS as needed). | ||
| EXCLUDED_NS=tenant-edoors # comma-separated if more than one; adjust grep below |
There was a problem hiding this comment.
The example value tenant-edoors is very specific and might be accidentally used if the user copy-pastes the block. It's better to provide an empty default. Also, the comment mentions "adjust grep below" but the implementation uses awk.
| EXCLUDED_NS=tenant-edoors # comma-separated if more than one; adjust grep below | |
| EXCLUDED_NS="" # comma-separated list of namespaces to exclude |
| pod=$(kubectl -n "$ns" get pods -l kubevirt.io=virt-launcher,vm.kubevirt.io/name="$vmi" \ | ||
| -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) |
There was a problem hiding this comment.
To ensure the script targets the active workload and avoids issues with pods in Terminating or Failed states (which might exist if a VM is undergoing issues), it's safer to filter for Running pods.
| pod=$(kubectl -n "$ns" get pods -l kubevirt.io=virt-launcher,vm.kubevirt.io/name="$vmi" \ | |
| -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) | |
| pod=$(kubectl -n "$ns" get pods -l kubevirt.io=virt-launcher,vm.kubevirt.io/name="$vmi" \ | |
| --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) |
Summary
Adds a procedure to the
cozystack-upgradeskill for the KubeVirt 1.6.x → 1.8.x bump that's coming with Cozystackrelease-1.4(cozystack/cozystack#2502).When that upgrade is applied via
helm upgrade cozystack, every VM that was running pre-upgrade fails to live-migrate afterwards because the new QEMU can't reload the old in-memoryvirtio-netdevice state (kubevirt/kubevirt#16386). KubeVirt'sworkloadUpdateMethodskeeps retrying, the cluster ends up flapping.Validated end-to-end on staging (hidora-hikube-lab) and production (hidora-hikube): 161 running VMs, ~85 minutes total, no customer-visible incidents.
Changes
references/known-failures.md— new entry #8 with the exact pre-upgrade prep (workloadUpdateMethods: [], suspend thekubevirtHR), the paced cold-restart loop, post-upgrade verification, and the steady-state cleanup.SKILL.md— adds a red-flag table row and a top-level "KubeVirt 1.6.x → 1.8.x special handling" note so the skill catches this before runninghelm upgrade.The flow is built around the conventional Cozystack upgrade path (
helm upgrade cozystack ...), not ad-hocmake apply. Coordination with VM owners is the main requirement: every non-excluded VM gets ~30-60s downtime in a controlled order.Why "do not merge"
Blocked on cozystack/cozystack#2502 (the actual KubeVirt 1.8.2 bump). This skill change describes the workflow for a Cozystack release that doesn't exist yet — merging earlier would point users at a procedure they don't need.
Merge condition: merge once cozystack/cozystack#2502 lands in a Cozystack release (currently targeted at
release-1.4). If a better upstream fix appears for kubevirt/kubevirt#16386 before then (e.g. a way to pin per-VMI launcher images so existing VMs don't need cold-restart), revisit this PR — the workflow may no longer be needed.