[docs] Add vGPU setup guide for GPU sharing between VMs by lexfrei · Pull Request #467 · cozystack/website

lexfrei · 2026-04-02T12:22:16Z

What this PR does

Adds a practical guide for running VMs with NVIDIA vGPU on Cozystack to the existing GPU passthrough page.

The guide covers the SR-IOV vGPU path used by current data-centre GPUs (L4, L40, L40S, B100) on the vGPU 20.x driver branch. The mediated-devices path used by older GPUs (Pascal–Ampere) is explicitly out of scope and the reader is pointed at upstream NVIDIA docs.

Steps documented:

Build the proprietary vGPU Manager container image from github.com/NVIDIA/gpu-driver-container (the older gitlab.com/nvidia/container-images/driver is archived).
Deploy GPU Operator with the vgpu variant via Package CR (depends on [gpu-operator] Update to v26.3.1 and add experimental vGPU variant cozystack#2323).
Assign vGPU profiles to SR-IOV VFs (current_vgpu_type sysfs), with a periodic profile-loader DaemonSet skeleton and an explicit experimental warning.
Configure DLS licensing via ClientConfigToken (the legacy NLS / ServerAddress= / ServerPort=7070 flow no longer applies).
Patch the KubeVirt CR with permittedHostDevices.pciHostDevices (after vGPU: SRIOV support kubevirt/kubevirt#16890; first stable KubeVirt release with the patch is targeted at v1.9.0).
Sample VirtualMachine (raw kubevirt.io/v1) using a CDI DataVolume so the rootfs has room for in-VM driver install, with a cloudInitNoCloud disk that drops the licensing token, gridd.conf, an SSH key, and the build dependencies.
vGPU profile reference table for L40S with the Q/A/B suffix taxonomy.
Warning about the 2.4 GiB containerDisk root overflow during in-VM driver install (we observed SIGBUS from a write into an mmap of a file the kernel could no longer extend; the new example sidesteps it via DataVolume).
Talos is explicitly noted as not recommended for vGPU; passthrough on Talos is unaffected.

Release note

NONE

netlify · 2026-04-02T12:22:22Z

✅ Deploy Preview for cozystack ready!

Name	Link
🔨 Latest commit	`4aaabe3`
🔍 Latest deploy log	https://app.netlify.com/projects/cozystack/deploys/69f342cc00c09300076d1f2d
😎 Deploy Preview	https://deploy-preview-467--cozystack.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai · 2026-04-02T12:22:24Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7aa47ec4-301c-4e56-8f23-74afce92ea71

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Documentation replaces the previous GPU-sharing overview with a focused vGPU (mediated device) guide covering prerequisites, NVIDIA vGPU licensing, GPU Operator vgpu variant, vGPU Manager image build, NVIDIA License Server wiring, KubeVirt mediated device config, VM examples, and vGPU profile details (≤50 words).

Changes

Cohort / File(s)	Summary
GPU vGPU Documentation `content/en/docs/v1/virtualization/gpu.md`	Rewrote GPU sharing section into a full vGPU (mdev) guide: new prerequisites and licensing notes; instructions to build/publish vGPU Manager driver container; GPU Operator `variant: vgpu` installation and node labeling; NVIDIA License Server Secret/ConfigMap example and Package wiring; KubeVirt mediatedDevices configuration; VM example requesting `nvidia.com/<profile>` and verification; added L40S vGPU profiles table; clarified open-source vGPU wording.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped through docs with glee today,
Tucked mdev notes and profiles away,
Built driver images, licenses in tow,
VMs now share GPUs—watch them go! 🎉

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding a comprehensive vGPU setup guide for GPU sharing between virtual machines, which aligns with the expanded documentation covering vGPU Manager deployment, licensing, KubeVirt configuration, and VM examples.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/gpu-vgpu-setup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request provides comprehensive documentation for configuring NVIDIA vGPU sharing for virtual machines, including prerequisites, image building, operator installation, and licensing setup. The review feedback suggests clarifying the FeatureType parameter in the licensing configuration and updating the example command prompt to maintain consistency with the platform's virtual machine naming conventions.

gemini-code-assist · 2026-04-02T12:23:22Z

+  gridd.conf: |
+    ServerAddress=nls.example.com
+    ServerPort=443
+    FeatureType=1


It is helpful to clarify what the FeatureType value represents to assist users in customizing their configuration. In the NVIDIA Grid configuration, 1 corresponds to the "NVIDIA vGPU" (vPC/vWS) feature, while 2 is for "NVIDIA Virtual Compute Server" (vCS).

Suggested change

FeatureType=1

FeatureType=1 # 1 for vGPU

gemini-code-assist · 2026-04-02T12:23:22Z

+```
+
+```console
+ubuntu@gpu-vgpu:~$ nvidia-smi


For consistency with the GPU passthrough example (line 194) and Cozystack's default naming convention for virtual machine instances, the hostname in the command prompt should include the virtual-machine- prefix. Since the VM name is defined as gpu-vgpu on line 352, the resulting instance name is virtual-machine-gpu-vgpu.

Suggested change

ubuntu@gpu-vgpu:~$ nvidia-smi

ubuntu@virtual-machine-gpu-vgpu:~$ nvidia-smi

coderabbitai

🧹 Nitpick comments (1)

content/en/docs/v1/virtualization/gpu.md (1)
360-385: Name the manifest file explicitly before the apply command.

kubectl apply -f vmi-vgpu.yaml appears without first labeling the YAML block as vmi-vgpu.yaml (unlike the earlier passthrough example). Adding a filename label right above the manifest would remove ambiguity for copy/paste users.
✏️ Suggested doc tweak
 ### 5. Create a Virtual Machine with vGPU

+**vmi-vgpu.yaml**:
+
 ```yaml
 apiVersion: apps.cozystack.io/v1alpha1
 appVersion: '*'
 kind: VirtualMachine
 ...
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @content/en/docs/v1/virtualization/gpu.md around lines 360 - 385, Add an
explicit filename label above the YAML manifest block so users know the file
name to save before running kubectl; specifically, annotate the VirtualMachine
manifest block with "vmi-vgpu.yaml" (the same name used in the kubectl apply -f
vmi-vgpu.yaml command) by placing the filename line immediately before the
invocation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@content/en/docs/v1/virtualization/gpu.md`:
- Around line 360-385: Add an explicit filename label above the YAML manifest
block so users know the file name to save before running kubectl; specifically,
annotate the VirtualMachine manifest block with "vmi-vgpu.yaml" (the same name
used in the kubectl apply -f vmi-vgpu.yaml command) by placing the filename line
immediately before the ```yaml fence, ensuring consistency between the manifest
and the kubectl apply invocation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 150cd704-7d01-4c3d-86f5-81294e0a31b5

📥 Commits

Reviewing files that changed from the base of the PR and between 624a38c and 468dd7b.

📒 Files selected for processing (1)

content/en/docs/v1/virtualization/gpu.md

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

content/en/docs/v1/virtualization/gpu.md (1)

279-288: Make the imagePullSecrets snippet fully qualified to prevent misplacement.

At Line 279, the snippet is context-trimmed and can be pasted under the wrong key. Please show the full values path to avoid broken Package configuration.

Proposed doc patch

-    ```yaml
-    gpu-operator:
-      vgpuManager:
-        repository: registry.example.com/nvidia
-        version: "550.90.05"
-        imagePullSecrets:
-        - name: nvidia-registry-secret
-    ```
+    ```yaml
+    components:
+      gpu-operator:
+        values:
+          gpu-operator:
+            vgpuManager:
+              repository: registry.example.com/nvidia
+              version: "550.90.05"
+              imagePullSecrets:
+              - name: nvidia-registry-secret
+    ```

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@content/en/docs/v1/virtualization/gpu.md` around lines 279 - 288, The snippet
for imagePullSecrets is too context-trimmed and can be pasted under the wrong
key; update the example so it shows the full values path (wrap the existing
gpu-operator.vgpuManager block under components -> gpu-operator -> values ->
gpu-operator -> vgpuManager) so users see the complete hierarchy and the
imagePullSecrets entry (refer to symbols: components, gpu-operator, values,
gpu-operator, vgpuManager, imagePullSecrets) and replace the trimmed snippet
with this fully-qualified version.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@content/en/docs/v1/virtualization/gpu.md`:
- Around line 358-391: Add an explicit readiness check for the
VirtualMachineInstance before calling "virtctl console": after applying
vmi-vgpu.yaml (kubectl apply -f vmi-vgpu.yaml) add a step that waits for the VMI
to become Ready (e.g., using "kubectl get vmi -n tenant-example -w" or "kubectl
wait --for=condition=Ready vmi/gpu-vgpu -n tenant-example") so the subsequent
"virtctl console virtual-machine-gpu-vgpu" call won't fail intermittently.

---

Nitpick comments:
In `@content/en/docs/v1/virtualization/gpu.md`:
- Around line 279-288: The snippet for imagePullSecrets is too context-trimmed
and can be pasted under the wrong key; update the example so it shows the full
values path (wrap the existing gpu-operator.vgpuManager block under components
-> gpu-operator -> values -> gpu-operator -> vgpuManager) so users see the
complete hierarchy and the imagePullSecrets entry (refer to symbols: components,
gpu-operator, values, gpu-operator, vgpuManager, imagePullSecrets) and replace
the trimmed snippet with this fully-qualified version.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7d6caa76-d2a5-4856-97eb-524345257f6b

📥 Commits

Reviewing files that changed from the base of the PR and between 468dd7b and 492f318.

📒 Files selected for processing (1)

content/en/docs/v1/virtualization/gpu.md

coderabbitai · 2026-04-02T13:05:39Z

+### 5. Create a Virtual Machine with vGPU
+
+```yaml
+apiVersion: apps.cozystack.io/v1alpha1
+appVersion: '*'
+kind: VirtualMachine
+metadata:
+  name: gpu-vgpu
+  namespace: tenant-example
+spec:
+  running: true
+  instanceProfile: ubuntu
+  instanceType: u1.medium
+  systemDisk:
+    image: ubuntu
+    storage: 5Gi
+    storageClass: replicated
+  gpus:
+  - name: nvidia.com/NVIDIA_L40S-24Q
+  cloudInit: |
+    #cloud-config
+    password: ubuntu
+    chpasswd: { expire: False }
+```
+
+```bash
+kubectl apply -f vmi-vgpu.yaml
+```
+
+Once the VM is running, log in and verify the vGPU is available:
+
+```bash
+virtctl console virtual-machine-gpu-vgpu
+```


⚠️ Potential issue | 🟡 Minor

Add an explicit VM readiness check before opening console.

After Line 384, jumping directly to virtctl console can fail intermittently if the VM/VMI is not ready yet. Add a wait/check step to keep the flow deterministic.

Proposed doc patch

```bash kubectl apply -f vmi-vgpu.yaml

+Wait until the VM instance is ready:
+
+bash +kubectl get vmi -n tenant-example -w +
+
Once the VM is running, log in and verify the vGPU is available:

</details> <details> <summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @content/en/docs/v1/virtualization/gpu.md around lines 358 - 391, Add an
explicit readiness check for the VirtualMachineInstance before calling "virtctl
console": after applying vmi-vgpu.yaml (kubectl apply -f vmi-vgpu.yaml) add a
step that waits for the VMI to become Ready (e.g., using "kubectl get vmi -n
tenant-example -w" or "kubectl wait --for=condition=Ready vmi/gpu-vgpu -n
tenant-example") so the subsequent "virtctl console virtual-machine-gpu-vgpu"
call won't fail intermittently.

</details>  

…ording Mirroring the corresponding cleanup in cozystack/website#467 review: * KubeVirt v1.9.0 ETA: drop the 'July 2026' date — KubeVirt does not publish hard release dates and the in-page wording was inconsistent across the file. Replace with 'Targeted at the next minor release (v1.9.0); track the PR for the actual release tag', matching the website doc. * Kernel-headers timing: 'The build downloads kernel headers at container start time' conflated docker build with pod start; the Dockerfile's entrypoint downloads them at runtime. Reword. * Replace guillemets with ASCII quotes for consistency with the rest of the file and the upstream chart docs. * nvidia-smi -q license check: drop the unnecessary -A 1 — License Status value is on the same line as the field name. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Add practical instructions for deploying GPU Operator with vGPU variant: - Building proprietary vGPU Manager container image - Deploying with vgpu variant via Package CR - NLS license server configuration - KubeVirt mediatedDeviceTypes setup - vGPU profile reference table for L40S - VM creation example with vGPU resource Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Replace simplified Containerfile with NVIDIA's Makefile-based build system from gitlab.com/nvidia/container-images/driver. The GPU Operator expects pre-compiled kernel modules, not a raw .run file. Add EULA warning about public redistribution of vGPU driver images. Add note about NLS ServerPort being deployment-dependent. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

- Switch licensing config from ConfigMap to Secret (configMapName deprecated) - Add FeatureType comment explaining values (1=vGPU, 2=vCS) - Fix console hostname to match Cozystack naming convention (virtual-machine- prefix) Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

The original draft documented mediated devices (mdev) — that path does not apply to current NVIDIA data-centre GPUs (L4, L40, L40S, B100). On the vGPU 17/20 driver branch these GPUs use SR-IOV with per-VF NVIDIA sysfs; KubeVirt advertises VFs via permittedHostDevices.pciHostDevices, not mediatedDevices. This commit rewrites the vGPU section accordingly: - Add a driver-model note at the top making the mdev/SR-IOV split explicit. Pascal–Ampere readers are sent to upstream NVIDIA docs. - Replace mediatedDeviceTypes / mediatedDevices / gpus: with pciHostDevices / hostDevices throughout. - Document KubeVirt v1.9.0 (ETA July 2026, kubevirt/kubevirt#16890 shipping in that release) as a hard prerequisite — current released tags v1.6.x/1.7.x/1.8.x do not include the patch and backports are not planned. - Replace the legacy NLS block (ServerAddress=, ServerPort=7070, FeatureType=1) with the DLS ClientConfigToken flow. - Switch build instructions to the upstream NVIDIA repo (github.com/NVIDIA/gpu-driver-container) — the gitlab repo is archived. - Note explicitly that Talos is not recommended for vGPU (NVIDIA redistribution restrictions, siderolabs/extensions#461 closed won't-fix); passthrough on Talos is unaffected. - Add a warning that a 2.4 GiB containerDisk overlay is too small to install the GRID guest driver in-place — recommend a CDI DataVolume of 20 GiB+ for non-throwaway tests. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…stack#2323 fixes Mirrors the corrections that landed in cozystack/cozystack#2323 docs/gpu-vgpu.md: * Lead the vgpu install section with an experimental-status alert. The vGPU Manager DaemonSet works, but profile assignment is currently out-of-band (echo <id> > /sys/.../current_vgpu_type per VF) and resets on every reboot — without an automated mechanism the variant silently advertises zero allocatable resources. * Note that the vgpu variant now sets sandboxWorkloads.defaultWorkload to vm-vgpu out of the box (cozystack#2323), so per-node labelling is optional rather than required. * Move imagePullSecrets under vgpuManager: in the example. The chart reads it per-component (vgpuManager.imagePullSecrets, driver.imagePullSecrets, …); placing it at the package root would silently render-no-op and the DaemonSet would ImagePullBackOff without an obvious error. * Split B100 (Blackwell) from L4/L40/L40S (Ada Lovelace) — the driver-model commentary applies to both architectures, but they should be named correctly. * Note that on L40S the SR-IOV VFs share the PF device ID (10de:26b9), so a single pciVendorSelector matches both. Add lspci sanity-check guidance for other generations where PF/VF IDs may differ. * Add a migration note in the Licensing section: chart v25 → v26 deprecated driver.licensingConfig.configMapName in favour of secretName; old key still works but emits a deprecation marker in the CRD. SR-IOV vGPU does not consume the host-side licensing knob at all — relevant only for passthrough operators carrying old configMapName overrides. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Six fixes from review: * Add a top-of-section warning that the entire vgpu workflow depends on cozystack/cozystack#2323 (still in draft) and KubeVirt v1.9.0 (unreleased) — neither is in any current Cozystack release. Without this banner the page reads like a supported workflow. * Drop the incorrect claim that gpus: expects an mdev resource. The upstream KubeVirt VirtualMachine spec accepts both PCI and mdev resource names under either gpus: or hostDevices:. Restate the hostDevices vs gpus distinction as convention, not a hard constraint, and explain the cozystack/v1alpha1 wrapper trade-off. * Drop the self-contradicting parenthetical that called pciVendorSelector 'the device ID, not the vendor:device pair' — the example right above already uses the vendor:device tuple (10DE:26B9), and the upstream KubeVirt godoc documents the field exactly as that tuple. * Re-quote siderolabs/extensions#461. The issue was closed with stateReason=COMPLETED, not not_planned; the literal phrase 'won't fix' does not appear in the closure. Replace with the substantive reason from rothgar's closing comment. * Fix pre-existing PCI ID error in the passthrough lspci sample output: A10 is 10de:2236 (verified against pci-ids.ucw.cz), the prior text said 10de:26b9 which is L40S. The rest of the passthrough section already uses 2236 correctly, so this was a copy-paste artefact. * Soften the KubeVirt v1.9.0 ETA wording. KubeVirt does not publish hard release dates; replace 'ETA July 2026' with 'targeted for the v1.9.0 release; track the PR for the actual release tag'. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…ofile-loader DaemonSet skeleton Round 2 of review fixes for the vgpu section. Blocking changes: * Rewrite the sample VirtualMachine as a runnable end-to-end example. The previous manifest used a 2.4 GiB containerDisk (too small for in-VM driver install), had no cloud-init credentials, and provided no path for the licensing token to land on the guest filesystem. The new manifest uses a CDI DataVolume sized at 20 GiB, ships a cloudInitNoCloud disk that drops the ClientConfigToken and gridd.conf into /etc/nvidia/, installs build-essential / dkms / linux-headers-generic via cloud-init packages, and wires an SSH key for virtctl ssh access. Move the rootfs-overflow warning above the manifest where the reader can act on it. * Add a minimal profile-loader DaemonSet snippet to step 3. Previously the doc warned that profile assignment is out-of-band but offered no concrete bridge from 'manual kubectl exec' to something repeatable. The new ConfigMap + DaemonSet skeleton reads bus-id=profile-id pairs and writes them through host /sys, so a reader has a starting point that survives node reboots. * Tighten the gpus: vs hostDevices: claim. The KubeVirt API accepts both for PCI and mdev resource names; the runtime semantics differ (gpus: adds virtio-vga display, hostDevices: does not). State this honestly instead of asserting strict equivalence. * Clarify that whether the cozystack apps.cozystack.io/v1alpha1 wrapper's gpus: field correctly resolves SR-IOV vGPU resource names is not yet validated; raw kubevirt.io/v1 is the safe path for now. Polish: * KubeVirt v1.9.0 wording: drop the 'July 2026' guess (KubeVirt does not publish hard release dates) and unify across the page. * registry.example.com: call out as RFC 2606 placeholder so readers do not copy-paste it. * Kernel-headers timing: 'build downloads' was wrong — the entrypoint downloads them at pod start, not docker build. * nvidia-smi license grep: drop unnecessary -A 1. * PCI ID example for A10 in the passthrough section: was 10de:26b9 (which is L40S); fixed to 10de:2236. * Re-quote siderolabs/extensions#461 closure with the actual reason (NVIDIA does not publicly distribute the vGPU guest driver) rather than the editorial 'won't fix' phrasing. * Replace guillemets with ASCII quotes for consistency with the rest of the docs site. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…ures Round 3 of review fixes for the vgpu guide. Blocking fixes: * Profile-loader DaemonSet skeleton: drop fragile parsing that broke on whitespace lines / inline ConfigMap comments, replace 'sleep infinity' with a periodic re-apply loop so the skeleton copes with PCIe re-enumeration during the pod's lifetime, mark the volumes block read-only, and call out in the prose that production-grade implementations need additional safeguards. * Replace ClientConfigToken file mode 0600 with 0744 per the NVIDIA Licensing User Guide. nvidia-gridd does not necessarily run as the file owner; 0600 silently breaks license activation. * Cloud-init users: block: containerDisks/ubuntu pre-provisions the ubuntu user, so redefining it via users: silently ignores the ssh_authorized_keys block. Use top-level ssh_authorized_keys instead and document the gotcha inline. * Replace the literal '1155' profile id in the example with a <profile-id> placeholder and warn that numeric IDs are driver-version-dependent and must be read from /sys/.../creatable_vgpu_types on the actual hardware. Worth-fixing items: * Inline imagePullSecrets as a commented-out variant inside the main Package CR snippet rather than a disconnected second snippet, so readers do not paste the first block and miss the auth. * Add a one-sentence merge note for 'kubectl edit kubevirt' so readers do not overwrite passthrough pciHostDevices when adding vGPU entries. * Add concrete .run delivery instructions (virtctl scp + virtctl ssh) so the guest-driver install step is reproducible. * Document the runStrategy: Always vs running: true difference between raw kubevirt.io/v1 and the cozystack wrapper. * Explain the Q/A/B profile suffix taxonomy in the profile table intro so readers know why the table only shows Q variants. * De-duplicate the v1.9.0 / kubevirt#16890 paragraph that appeared verbatim in two places — keep the canonical version in Prerequisites, link to it from the top-of-section banner. Polish: * Split the Talos non-recommendation off into its own bullet under Prerequisites instead of a paragraph-long mixed sentence. * Drop the RFC 2606 meta-commentary on registry.example.com — just tell the reader to replace it. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

… to vGPU 20.x Round 4 of review fixes for the vgpu section. Blocking fixes: * Front-matter title was 'Running VMs with GPU Passthrough' but the page now also covers vGPU as a co-equal section. Update title / linkTitle / description / lead paragraph to reflect both flows and add a jump-link to the vGPU section. * Unify ClientConfigToken file mode at 0744 across both sections (step 4 still had 0600, contradicting step 6's 0744 with the NVIDIA-cited justification). 0600 silently breaks license activation when nvidia-gridd runs as a service user. * Harden the profile-loader DaemonSet against copy-paste failures: read-before-write so manual out-of-band changes are visible (and so the kernel does not reject writes while a VM holds the VF and the script logs an error every minute on every busy VF), strip trailing #-comments from ConfigMap lines, log on malformed lines instead of an opaque 'failed to set'. * Add an explicit side-effect callout: while this DaemonSet runs, manual kubectl exec changes to current_vgpu_type are reverted within 60 s. Edit the ConfigMap rather than the sysfs file. * Pin the SR-IOV path to vGPU 20.x explicitly. The previous '17/20' framing was misleading because the SR-IOV / pciHostDevices flow pairs specifically with branch 20 (driver 595.x); the example build is a 595.x .run, and a reader on a 17.x subscription would build the wrong manager image. Worth-fixing: * Reword the cozystack/cozystack#2323 reference time-stable so the text does not need editing once the upstream PR merges. * Tighten the wrapper gpus: claim. The Cozystack VirtualMachine wrapper passes deviceName straight through to KubeVirt; what's missing for SR-IOV vGPU is the hostDevices: surface for headless setups and end-to-end exercise on real hardware. Be precise about which gap is which. * Add (passthrough only) to the licensing migration alert so a vGPU operator does not 'fix' their setup by switching to secretName and then wonder why nothing changed. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…e for production Round 5 of review fixes for the gpu page. Blocking fixes: * Fix the passthrough example kind: it was VirtualMachine, but the Cozystack apps.cozystack.io/v1alpha1 wrapper kind is VMInstance (verified against vm-instance-rd in cozystack/cozystack and against every other doc in content/en/docs/v1/virtualization/). The vGPU section anchors its commentary on the wrapper to this example, so the wrong kind there propagated into the new section by reference. Drop appVersion: '*' which is not a VMInstance field. * Replace the fabricated 'gpus: adds virtio-vga display semantics' claim with the actual difference: gpus: carries an optional virtualGPUOptions field whose display.enabled defaults to true. hostDevices: has no such field. Verified against kubevirt/kubevirt staging/src/kubevirt.io/api/core/v1/schema.go. * DataVolume in the vgpu example was missing storageClassName; on a cluster without a default StorageClass the import pod hangs Pending. Pin to 'replicated' (matching the passthrough example) with a comment that operators should adjust. * Add a systemctl enable --now nvidia-gridd step after the .run install. The .run installs the unit but does not necessarily start it on first boot, so a reader following the doc end-to-end saw 'Unlicensed' even with a correct token — not a token issue, the daemon was simply not running. * Profile-loader DaemonSet hardening: - SIGTERM/SIGINT trap so kubelet does not need to SIGKILL after terminationGracePeriodSeconds on rolling updates. - 'sleep & wait' so the trap fires immediately mid-sleep. - printf '%s' instead of echo so no trailing newline reaches sysfs (some driver versions reject 'invalid argument'). * docker build: add --platform linux/amd64 explicitly. Apple Silicon build hosts produce arm64 images native, and the kubelet pull on amd64 GPU nodes fails with 'no matching manifest'. Also add a one-line docker login note. * virtctl scp / virtctl ssh: pass --namespace tenant-example explicitly. virtctl defaults to 'default', not the VM's namespace, on a multi-tenant cluster the commands as previously written fail with 'not found in namespace default'. Worth-fixing: * Soften the Q/A/B profile family description: not all suffix variants are available on all GPUs; partition sizes vary per GPU and per family. * VF count caveat: '16 VFs per L40S' was misleading because Ada Lovelace VF count is profile-dependent (24Q → 2 VFs). State the framebuffer-division relationship explicitly. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…VIDIA upstream NVIDIA's gpu-driver-container repository owns the build path; keeping a parallel set of docker build / push commands in this docs page duplicates upstream documentation and goes stale every time their Dockerfile or build args change. Replace the in-line build snippet with a paragraph linking to the upstream repo and the NVIDIA Licensing Portal as the sources of truth; keep only the Cozystack-specific bits (private registry / EULA note). This also resolves a class of review feedback that asked for incrementally more build-side detail in the doc (--platform, docker login, --build-arg names, etc.) — none of that belongs here when NVIDIA already documents it in their repo. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

… pod spec Round 6 of review fixes for the vgpu section. * Rename '### Prerequisites' (vGPU section) to '### vGPU Prerequisites' so Hugo does not collide its slug with '## Prerequisites' from the passthrough section above. The previous link from the top-of-section alert silently landed on passthrough prerequisites; now it points at #vgpu-prerequisites where it belongs. * Drop the 'running: true vs runStrategy: Always — both mean the same thing' aside. The Cozystack VMInstance wrapper publishes runStrategy in its OpenAPI schema; running: true survives only as an undocumented compat fallback. Documenting the equivalence locks readers into a deprecated path. The two examples already use different fields because they are different kinds, which speaks for itself. * Add a 'Last verified' line in the experimental-status alert with concrete versions: KubeVirt main nightly, cozystack#2323 head, vGPU 20.0 host driver, GRID 595.58.03 guest driver. Future readers can tell whether the guide still applies to their cluster. * Profile-loader DaemonSet: - Add resources.requests / limits (cpu 10m, memory 16Mi/32Mi) so a privileged pod on every GPU node has at least minimal kubelet protection against OOM under host memory pressure. - Add terminationGracePeriodSeconds: 5 so the SIGTERM trap added in the previous round actually has time to run before kubelet SIGKILLs the pod on rolling updates. * Soften the 'variant CR will be rejected' claim. The actual failure mode on a current Cozystack release is silent: kubectl edit kubevirt is accepted, but the released virt-handler does not advertise SR-IOV VFs, so allocatable resources stay at zero with no error. * Fix the VF count caveat. '16 VFs at -1Q, 2 at -24Q — total framebuffer divided by per-profile framebuffer' was wrong because PCIe SR-IOV VF count is the upper bound (hardware-dependent) and framebuffer division is a separate per-profile cap. State both correctly. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…et for GPU-tainted nodes Round 7 of review fixes. After this round the iterations have hit diminishing returns — each pass surfaces a fresh batch of documentation-quality concerns, none of which block the technical correctness of the guide. Captured here are the substantive items. * Add a citation URL for NVIDIA's Licensing User Guide on the ClientConfigToken file mode (0744). Two snippets in the doc had the same recommendation; both now link the source. * Profile-loader DaemonSet additions: - tolerations matching the common nvidia.com/gpu NoSchedule taint so the DS actually schedules on GPU-tainted nodes (operators adjust the key to their tainting scheme). - priorityClassName: system-node-critical so the DS is not the first to be evicted under host memory pressure — losing it means VMs lose their GPU on the next reboot when current_vgpu_type resets. - First-failure logging for write rejections. Earlier the script silently swallowed all failures (rationale: 'VM holds the VF') which also hid persistent ConfigMap typos. Log once per bus, clear the flag on success so a recurring real failure surfaces again later. * Align the passthrough VMInstance example to runStrategy: Always (was running: true). The Cozystack vm-instance chart accepts both, but runStrategy is the canonical schema field; running: true is a legacy fallback. Two examples in the same page should not demonstrate two different conventions for the same field. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…unify title * After 'kind: VirtualMachine' → 'kind: VMInstance' a few rounds back, the example output around the apply / kubectl get vmi / virtctl console blocks still showed the old 'virtualmachines.apps.cozystack.io' resource path and 'virtual-machine-gpu' VM name. Cozystack's VMInstance kind produces 'vminstances.apps.cozystack.io/<name>' on apply and the underlying VirtualMachineInstance is named 'vm-instance-<name>' (matching vm-image.md / vm-instance.md). Update the apply output and every virtual-machine-gpu reference in the shell prompts and console targets accordingly. * Front-matter title used 'or', linkTitle used 'and' for the same page. Unify on 'and' across title / linkTitle / description. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

lexfrei added this to Cozystack Roadmap Apr 2, 2026

gemini-code-assist Bot reviewed Apr 2, 2026

View reviewed changes

lexfrei self-assigned this Apr 2, 2026

lexfrei marked this pull request as ready for review April 2, 2026 12:51

lexfrei requested review from kvaps and lllamnyp as code owners April 2, 2026 12:51

coderabbitai Bot reviewed Apr 2, 2026

View reviewed changes

IvanHunters self-assigned this Apr 9, 2026

lexfrei added the do-not-merge label Apr 12, 2026

lexfrei marked this pull request as draft April 12, 2026 11:42

IvanHunters removed their assignment Apr 22, 2026

lexfrei added 14 commits April 30, 2026 14:53

lexfrei force-pushed the docs/gpu-vgpu-setup branch from 6ece350 to 4aaabe3 Compare April 30, 2026 11:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Add vGPU setup guide for GPU sharing between VMs#467

[docs] Add vGPU setup guide for GPU sharing between VMs#467
lexfrei wants to merge 14 commits intomainfrom
docs/gpu-vgpu-setup

lexfrei commented Apr 2, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 2, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	ubuntu@gpu-vgpu:~$ nvidia-smi
	ubuntu@virtual-machine-gpu-vgpu:~$ nvidia-smi

Conversation

lexfrei commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Release note

Uh oh!

netlify Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cozystack ready!

Uh oh!

coderabbitai Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lexfrei commented Apr 2, 2026 •

edited

Loading

netlify Bot commented Apr 2, 2026 •

edited

Loading

coderabbitai Bot commented Apr 2, 2026 •

edited

Loading