Skip to content

docs(cozystack-upgrade): add KubeVirt 1.6→1.8 VM cold-restart workflow#7

Draft
kvaps wants to merge 1 commit intomainfrom
feat/kubevirt-1.6-to-1.8-vm-restart
Draft

docs(cozystack-upgrade): add KubeVirt 1.6→1.8 VM cold-restart workflow#7
kvaps wants to merge 1 commit intomainfrom
feat/kubevirt-1.6-to-1.8-vm-restart

Conversation

@kvaps
Copy link
Copy Markdown
Member

@kvaps kvaps commented Apr 28, 2026

Summary

Adds a procedure to the cozystack-upgrade skill for the KubeVirt 1.6.x → 1.8.x bump that's coming with Cozystack release-1.4 (cozystack/cozystack#2502).

When that upgrade is applied via helm upgrade cozystack, every VM that was running pre-upgrade fails to live-migrate afterwards because the new QEMU can't reload the old in-memory virtio-net device state (kubevirt/kubevirt#16386). KubeVirt's workloadUpdateMethods keeps retrying, the cluster ends up flapping.

Validated end-to-end on staging (hidora-hikube-lab) and production (hidora-hikube): 161 running VMs, ~85 minutes total, no customer-visible incidents.

Changes

  • references/known-failures.md — new entry #8 with the exact pre-upgrade prep (workloadUpdateMethods: [], suspend the kubevirt HR), the paced cold-restart loop, post-upgrade verification, and the steady-state cleanup.
  • SKILL.md — adds a red-flag table row and a top-level "KubeVirt 1.6.x → 1.8.x special handling" note so the skill catches this before running helm upgrade.

The flow is built around the conventional Cozystack upgrade path (helm upgrade cozystack ...), not ad-hoc make apply. Coordination with VM owners is the main requirement: every non-excluded VM gets ~30-60s downtime in a controlled order.

Why "do not merge"

Blocked on cozystack/cozystack#2502 (the actual KubeVirt 1.8.2 bump). This skill change describes the workflow for a Cozystack release that doesn't exist yet — merging earlier would point users at a procedure they don't need.

Merge condition: merge once cozystack/cozystack#2502 lands in a Cozystack release (currently targeted at release-1.4). If a better upstream fix appears for kubevirt/kubevirt#16386 before then (e.g. a way to pin per-VMI launcher images so existing VMs don't need cold-restart), revisit this PR — the workflow may no longer be needed.

Cozystack release-1.4 will bump KubeVirt from 1.6.3 to 1.8.2 (cozystack PR
#2502). Every VM that was running before the upgrade then fails to live-migrate
because the in-memory QEMU device state can't be reloaded by the new QEMU on
the target launcher (kubevirt/kubevirt#16386, virtio-net specifically).

Add a known-failures entry covering:
- pre-upgrade: set workloadUpdateMethods=[] and suspend the kubevirt HelmRelease
- post-upgrade: paced cold-restart of all running VMs (with an exclusion list
  for tenants who can't take the downtime window)
- steady state: re-enable workloadUpdateMethods once the cluster is uniformly
  on the new launcher image

Also add a SKILL.md red-flag row and a top-level "KubeVirt 1.6.x → 1.8.x
special handling" note so the operator catches this before running helm upgrade.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps kvaps added documentation Improvements or additions to documentation do not merge Do not merge until linked dependency is resolved labels Apr 28, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9ab9f06b-5433-477c-99ed-b0a2f69f841f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/kubevirt-1.6-to-1.8-vm-restart

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds documentation and a detailed recovery procedure for KubeVirt upgrades from version 1.6.x to 1.8.x, addressing a known issue where live-migrations fail due to a QEMU version bump. The feedback suggests improving the provided bash scripts by using a generic default for the namespace exclusion list and adding a status filter to ensure only running pods are targeted during the phased cold-restart process.

```bash
# 6. Build the worklist of VMIs to restart. Excludes any that the operator
# must leave alone (replace EXCLUDED_NS as needed).
EXCLUDED_NS=tenant-edoors # comma-separated if more than one; adjust grep below
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example value tenant-edoors is very specific and might be accidentally used if the user copy-pastes the block. It's better to provide an empty default. Also, the comment mentions "adjust grep below" but the implementation uses awk.

Suggested change
EXCLUDED_NS=tenant-edoors # comma-separated if more than one; adjust grep below
EXCLUDED_NS="" # comma-separated list of namespaces to exclude

Comment on lines +330 to +331
pod=$(kubectl -n "$ns" get pods -l kubevirt.io=virt-launcher,vm.kubevirt.io/name="$vmi" \
-o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure the script targets the active workload and avoids issues with pods in Terminating or Failed states (which might exist if a VM is undergoing issues), it's safer to filter for Running pods.

Suggested change
pod=$(kubectl -n "$ns" get pods -l kubevirt.io=virt-launcher,vm.kubevirt.io/name="$vmi" \
-o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
pod=$(kubectl -n "$ns" get pods -l kubevirt.io=virt-launcher,vm.kubevirt.io/name="$vmi" \
--field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge Do not merge until linked dependency is resolved documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant