[docs] Add vGPU setup guide for GPU sharing between VMs#467
[docs] Add vGPU setup guide for GPU sharing between VMs#467
Conversation
✅ Deploy Preview for cozystack ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughDocumentation replaces the previous GPU-sharing overview with a focused vGPU (mediated device) guide covering prerequisites, NVIDIA vGPU licensing, GPU Operator vgpu variant, vGPU Manager image build, NVIDIA License Server wiring, KubeVirt mediated device config, VM examples, and vGPU profile details (≤50 words). Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request provides comprehensive documentation for configuring NVIDIA vGPU sharing for virtual machines, including prerequisites, image building, operator installation, and licensing setup. The review feedback suggests clarifying the FeatureType parameter in the licensing configuration and updating the example command prompt to maintain consistency with the platform's virtual machine naming conventions.
| gridd.conf: | | ||
| ServerAddress=nls.example.com | ||
| ServerPort=443 | ||
| FeatureType=1 |
There was a problem hiding this comment.
It is helpful to clarify what the FeatureType value represents to assist users in customizing their configuration. In the NVIDIA Grid configuration, 1 corresponds to the "NVIDIA vGPU" (vPC/vWS) feature, while 2 is for "NVIDIA Virtual Compute Server" (vCS).
| FeatureType=1 | |
| FeatureType=1 # 1 for vGPU |
| ``` | ||
|
|
||
| ```console | ||
| ubuntu@gpu-vgpu:~$ nvidia-smi |
There was a problem hiding this comment.
For consistency with the GPU passthrough example (line 194) and Cozystack's default naming convention for virtual machine instances, the hostname in the command prompt should include the virtual-machine- prefix. Since the VM name is defined as gpu-vgpu on line 352, the resulting instance name is virtual-machine-gpu-vgpu.
| ubuntu@gpu-vgpu:~$ nvidia-smi | |
| ubuntu@virtual-machine-gpu-vgpu:~$ nvidia-smi |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
content/en/docs/v1/virtualization/gpu.md (1)
360-385: Name the manifest file explicitly before the apply command.
kubectl apply -f vmi-vgpu.yamlappears without first labeling the YAML block asvmi-vgpu.yaml(unlike the earlier passthrough example). Adding a filename label right above the manifest would remove ambiguity for copy/paste users.✏️ Suggested doc tweak
### 5. Create a Virtual Machine with vGPU +**vmi-vgpu.yaml**: + ```yaml apiVersion: apps.cozystack.io/v1alpha1 appVersion: '*' kind: VirtualMachine ...</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@content/en/docs/v1/virtualization/gpu.mdaround lines 360 - 385, Add an
explicit filename label above the YAML manifest block so users know the file
name to save before running kubectl; specifically, annotate the VirtualMachine
manifest block with "vmi-vgpu.yaml" (the same name used in the kubectl apply -f
vmi-vgpu.yaml command) by placing the filename line immediately before theinvocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@content/en/docs/v1/virtualization/gpu.md`:
- Around line 360-385: Add an explicit filename label above the YAML manifest
block so users know the file name to save before running kubectl; specifically,
annotate the VirtualMachine manifest block with "vmi-vgpu.yaml" (the same name
used in the kubectl apply -f vmi-vgpu.yaml command) by placing the filename line
immediately before the ```yaml fence, ensuring consistency between the manifest
and the kubectl apply invocation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 150cd704-7d01-4c3d-86f5-81294e0a31b5
📒 Files selected for processing (1)
content/en/docs/v1/virtualization/gpu.md
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
content/en/docs/v1/virtualization/gpu.md (1)
279-288: Make theimagePullSecretssnippet fully qualified to prevent misplacement.At Line 279, the snippet is context-trimmed and can be pasted under the wrong key. Please show the full values path to avoid broken Package configuration.
Proposed doc patch
- ```yaml - gpu-operator: - vgpuManager: - repository: registry.example.com/nvidia - version: "550.90.05" - imagePullSecrets: - - name: nvidia-registry-secret - ``` + ```yaml + components: + gpu-operator: + values: + gpu-operator: + vgpuManager: + repository: registry.example.com/nvidia + version: "550.90.05" + imagePullSecrets: + - name: nvidia-registry-secret + ```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@content/en/docs/v1/virtualization/gpu.md` around lines 279 - 288, The snippet for imagePullSecrets is too context-trimmed and can be pasted under the wrong key; update the example so it shows the full values path (wrap the existing gpu-operator.vgpuManager block under components -> gpu-operator -> values -> gpu-operator -> vgpuManager) so users see the complete hierarchy and the imagePullSecrets entry (refer to symbols: components, gpu-operator, values, gpu-operator, vgpuManager, imagePullSecrets) and replace the trimmed snippet with this fully-qualified version.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@content/en/docs/v1/virtualization/gpu.md`:
- Around line 358-391: Add an explicit readiness check for the
VirtualMachineInstance before calling "virtctl console": after applying
vmi-vgpu.yaml (kubectl apply -f vmi-vgpu.yaml) add a step that waits for the VMI
to become Ready (e.g., using "kubectl get vmi -n tenant-example -w" or "kubectl
wait --for=condition=Ready vmi/gpu-vgpu -n tenant-example") so the subsequent
"virtctl console virtual-machine-gpu-vgpu" call won't fail intermittently.
---
Nitpick comments:
In `@content/en/docs/v1/virtualization/gpu.md`:
- Around line 279-288: The snippet for imagePullSecrets is too context-trimmed
and can be pasted under the wrong key; update the example so it shows the full
values path (wrap the existing gpu-operator.vgpuManager block under components
-> gpu-operator -> values -> gpu-operator -> vgpuManager) so users see the
complete hierarchy and the imagePullSecrets entry (refer to symbols: components,
gpu-operator, values, gpu-operator, vgpuManager, imagePullSecrets) and replace
the trimmed snippet with this fully-qualified version.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7d6caa76-d2a5-4856-97eb-524345257f6b
📒 Files selected for processing (1)
content/en/docs/v1/virtualization/gpu.md
| ### 5. Create a Virtual Machine with vGPU | ||
|
|
||
| ```yaml | ||
| apiVersion: apps.cozystack.io/v1alpha1 | ||
| appVersion: '*' | ||
| kind: VirtualMachine | ||
| metadata: | ||
| name: gpu-vgpu | ||
| namespace: tenant-example | ||
| spec: | ||
| running: true | ||
| instanceProfile: ubuntu | ||
| instanceType: u1.medium | ||
| systemDisk: | ||
| image: ubuntu | ||
| storage: 5Gi | ||
| storageClass: replicated | ||
| gpus: | ||
| - name: nvidia.com/NVIDIA_L40S-24Q | ||
| cloudInit: | | ||
| #cloud-config | ||
| password: ubuntu | ||
| chpasswd: { expire: False } | ||
| ``` | ||
|
|
||
| ```bash | ||
| kubectl apply -f vmi-vgpu.yaml | ||
| ``` | ||
|
|
||
| Once the VM is running, log in and verify the vGPU is available: | ||
|
|
||
| ```bash | ||
| virtctl console virtual-machine-gpu-vgpu | ||
| ``` |
There was a problem hiding this comment.
Add an explicit VM readiness check before opening console.
After Line 384, jumping directly to virtctl console can fail intermittently if the VM/VMI is not ready yet. Add a wait/check step to keep the flow deterministic.
Proposed doc patch
```bash
kubectl apply -f vmi-vgpu.yaml+Wait until the VM instance is ready:
+
+bash +kubectl get vmi -n tenant-example -w +
+
Once the VM is running, log in and verify the vGPU is available:
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @content/en/docs/v1/virtualization/gpu.md around lines 358 - 391, Add an
explicit readiness check for the VirtualMachineInstance before calling "virtctl
console": after applying vmi-vgpu.yaml (kubectl apply -f vmi-vgpu.yaml) add a
step that waits for the VMI to become Ready (e.g., using "kubectl get vmi -n
tenant-example -w" or "kubectl wait --for=condition=Ready vmi/gpu-vgpu -n
tenant-example") so the subsequent "virtctl console virtual-machine-gpu-vgpu"
call won't fail intermittently.
</details>
<!-- fingerprinting:phantom:triton:hawk:8b73ebdb-e38a-4b41-bcc3-3277f8a35b52 -->
<!-- This is an auto-generated comment by CodeRabbit -->
…ording Mirroring the corresponding cleanup in cozystack/website#467 review: * KubeVirt v1.9.0 ETA: drop the 'July 2026' date — KubeVirt does not publish hard release dates and the in-page wording was inconsistent across the file. Replace with 'Targeted at the next minor release (v1.9.0); track the PR for the actual release tag', matching the website doc. * Kernel-headers timing: 'The build downloads kernel headers at container start time' conflated docker build with pod start; the Dockerfile's entrypoint downloads them at runtime. Reword. * Replace guillemets with ASCII quotes for consistency with the rest of the file and the upstream chart docs. * nvidia-smi -q license check: drop the unnecessary -A 1 — License Status value is on the same line as the field name. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ording Mirroring the corresponding cleanup in cozystack/website#467 review: * KubeVirt v1.9.0 ETA: drop the 'July 2026' date — KubeVirt does not publish hard release dates and the in-page wording was inconsistent across the file. Replace with 'Targeted at the next minor release (v1.9.0); track the PR for the actual release tag', matching the website doc. * Kernel-headers timing: 'The build downloads kernel headers at container start time' conflated docker build with pod start; the Dockerfile's entrypoint downloads them at runtime. Reword. * Replace guillemets with ASCII quotes for consistency with the rest of the file and the upstream chart docs. * nvidia-smi -q license check: drop the unnecessary -A 1 — License Status value is on the same line as the field name. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Add practical instructions for deploying GPU Operator with vGPU variant: - Building proprietary vGPU Manager container image - Deploying with vgpu variant via Package CR - NLS license server configuration - KubeVirt mediatedDeviceTypes setup - vGPU profile reference table for L40S - VM creation example with vGPU resource Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Replace simplified Containerfile with NVIDIA's Makefile-based build system from gitlab.com/nvidia/container-images/driver. The GPU Operator expects pre-compiled kernel modules, not a raw .run file. Add EULA warning about public redistribution of vGPU driver images. Add note about NLS ServerPort being deployment-dependent. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
- Switch licensing config from ConfigMap to Secret (configMapName deprecated) - Add FeatureType comment explaining values (1=vGPU, 2=vCS) - Fix console hostname to match Cozystack naming convention (virtual-machine- prefix) Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
The original draft documented mediated devices (mdev) — that path does not apply to current NVIDIA data-centre GPUs (L4, L40, L40S, B100). On the vGPU 17/20 driver branch these GPUs use SR-IOV with per-VF NVIDIA sysfs; KubeVirt advertises VFs via permittedHostDevices.pciHostDevices, not mediatedDevices. This commit rewrites the vGPU section accordingly: - Add a driver-model note at the top making the mdev/SR-IOV split explicit. Pascal–Ampere readers are sent to upstream NVIDIA docs. - Replace mediatedDeviceTypes / mediatedDevices / gpus: with pciHostDevices / hostDevices throughout. - Document KubeVirt v1.9.0 (ETA July 2026, kubevirt/kubevirt#16890 shipping in that release) as a hard prerequisite — current released tags v1.6.x/1.7.x/1.8.x do not include the patch and backports are not planned. - Replace the legacy NLS block (ServerAddress=, ServerPort=7070, FeatureType=1) with the DLS ClientConfigToken flow. - Switch build instructions to the upstream NVIDIA repo (github.com/NVIDIA/gpu-driver-container) — the gitlab repo is archived. - Note explicitly that Talos is not recommended for vGPU (NVIDIA redistribution restrictions, siderolabs/extensions#461 closed won't-fix); passthrough on Talos is unaffected. - Add a warning that a 2.4 GiB containerDisk overlay is too small to install the GRID guest driver in-place — recommend a CDI DataVolume of 20 GiB+ for non-throwaway tests. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…stack#2323 fixes Mirrors the corrections that landed in cozystack/cozystack#2323 docs/gpu-vgpu.md: * Lead the vgpu install section with an experimental-status alert. The vGPU Manager DaemonSet works, but profile assignment is currently out-of-band (echo <id> > /sys/.../current_vgpu_type per VF) and resets on every reboot — without an automated mechanism the variant silently advertises zero allocatable resources. * Note that the vgpu variant now sets sandboxWorkloads.defaultWorkload to vm-vgpu out of the box (cozystack#2323), so per-node labelling is optional rather than required. * Move imagePullSecrets under vgpuManager: in the example. The chart reads it per-component (vgpuManager.imagePullSecrets, driver.imagePullSecrets, …); placing it at the package root would silently render-no-op and the DaemonSet would ImagePullBackOff without an obvious error. * Split B100 (Blackwell) from L4/L40/L40S (Ada Lovelace) — the driver-model commentary applies to both architectures, but they should be named correctly. * Note that on L40S the SR-IOV VFs share the PF device ID (10de:26b9), so a single pciVendorSelector matches both. Add lspci sanity-check guidance for other generations where PF/VF IDs may differ. * Add a migration note in the Licensing section: chart v25 → v26 deprecated driver.licensingConfig.configMapName in favour of secretName; old key still works but emits a deprecation marker in the CRD. SR-IOV vGPU does not consume the host-side licensing knob at all — relevant only for passthrough operators carrying old configMapName overrides. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Six fixes from review: * Add a top-of-section warning that the entire vgpu workflow depends on cozystack/cozystack#2323 (still in draft) and KubeVirt v1.9.0 (unreleased) — neither is in any current Cozystack release. Without this banner the page reads like a supported workflow. * Drop the incorrect claim that gpus: expects an mdev resource. The upstream KubeVirt VirtualMachine spec accepts both PCI and mdev resource names under either gpus: or hostDevices:. Restate the hostDevices vs gpus distinction as convention, not a hard constraint, and explain the cozystack/v1alpha1 wrapper trade-off. * Drop the self-contradicting parenthetical that called pciVendorSelector 'the device ID, not the vendor:device pair' — the example right above already uses the vendor:device tuple (10DE:26B9), and the upstream KubeVirt godoc documents the field exactly as that tuple. * Re-quote siderolabs/extensions#461. The issue was closed with stateReason=COMPLETED, not not_planned; the literal phrase 'won't fix' does not appear in the closure. Replace with the substantive reason from rothgar's closing comment. * Fix pre-existing PCI ID error in the passthrough lspci sample output: A10 is 10de:2236 (verified against pci-ids.ucw.cz), the prior text said 10de:26b9 which is L40S. The rest of the passthrough section already uses 2236 correctly, so this was a copy-paste artefact. * Soften the KubeVirt v1.9.0 ETA wording. KubeVirt does not publish hard release dates; replace 'ETA July 2026' with 'targeted for the v1.9.0 release; track the PR for the actual release tag'. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ofile-loader DaemonSet skeleton Round 2 of review fixes for the vgpu section. Blocking changes: * Rewrite the sample VirtualMachine as a runnable end-to-end example. The previous manifest used a 2.4 GiB containerDisk (too small for in-VM driver install), had no cloud-init credentials, and provided no path for the licensing token to land on the guest filesystem. The new manifest uses a CDI DataVolume sized at 20 GiB, ships a cloudInitNoCloud disk that drops the ClientConfigToken and gridd.conf into /etc/nvidia/, installs build-essential / dkms / linux-headers-generic via cloud-init packages, and wires an SSH key for virtctl ssh access. Move the rootfs-overflow warning above the manifest where the reader can act on it. * Add a minimal profile-loader DaemonSet snippet to step 3. Previously the doc warned that profile assignment is out-of-band but offered no concrete bridge from 'manual kubectl exec' to something repeatable. The new ConfigMap + DaemonSet skeleton reads bus-id=profile-id pairs and writes them through host /sys, so a reader has a starting point that survives node reboots. * Tighten the gpus: vs hostDevices: claim. The KubeVirt API accepts both for PCI and mdev resource names; the runtime semantics differ (gpus: adds virtio-vga display, hostDevices: does not). State this honestly instead of asserting strict equivalence. * Clarify that whether the cozystack apps.cozystack.io/v1alpha1 wrapper's gpus: field correctly resolves SR-IOV vGPU resource names is not yet validated; raw kubevirt.io/v1 is the safe path for now. Polish: * KubeVirt v1.9.0 wording: drop the 'July 2026' guess (KubeVirt does not publish hard release dates) and unify across the page. * registry.example.com: call out as RFC 2606 placeholder so readers do not copy-paste it. * Kernel-headers timing: 'build downloads' was wrong — the entrypoint downloads them at pod start, not docker build. * nvidia-smi license grep: drop unnecessary -A 1. * PCI ID example for A10 in the passthrough section: was 10de:26b9 (which is L40S); fixed to 10de:2236. * Re-quote siderolabs/extensions#461 closure with the actual reason (NVIDIA does not publicly distribute the vGPU guest driver) rather than the editorial 'won't fix' phrasing. * Replace guillemets with ASCII quotes for consistency with the rest of the docs site. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ures Round 3 of review fixes for the vgpu guide. Blocking fixes: * Profile-loader DaemonSet skeleton: drop fragile parsing that broke on whitespace lines / inline ConfigMap comments, replace 'sleep infinity' with a periodic re-apply loop so the skeleton copes with PCIe re-enumeration during the pod's lifetime, mark the volumes block read-only, and call out in the prose that production-grade implementations need additional safeguards. * Replace ClientConfigToken file mode 0600 with 0744 per the NVIDIA Licensing User Guide. nvidia-gridd does not necessarily run as the file owner; 0600 silently breaks license activation. * Cloud-init users: block: containerDisks/ubuntu pre-provisions the ubuntu user, so redefining it via users: silently ignores the ssh_authorized_keys block. Use top-level ssh_authorized_keys instead and document the gotcha inline. * Replace the literal '1155' profile id in the example with a <profile-id> placeholder and warn that numeric IDs are driver-version-dependent and must be read from /sys/.../creatable_vgpu_types on the actual hardware. Worth-fixing items: * Inline imagePullSecrets as a commented-out variant inside the main Package CR snippet rather than a disconnected second snippet, so readers do not paste the first block and miss the auth. * Add a one-sentence merge note for 'kubectl edit kubevirt' so readers do not overwrite passthrough pciHostDevices when adding vGPU entries. * Add concrete .run delivery instructions (virtctl scp + virtctl ssh) so the guest-driver install step is reproducible. * Document the runStrategy: Always vs running: true difference between raw kubevirt.io/v1 and the cozystack wrapper. * Explain the Q/A/B profile suffix taxonomy in the profile table intro so readers know why the table only shows Q variants. * De-duplicate the v1.9.0 / kubevirt#16890 paragraph that appeared verbatim in two places — keep the canonical version in Prerequisites, link to it from the top-of-section banner. Polish: * Split the Talos non-recommendation off into its own bullet under Prerequisites instead of a paragraph-long mixed sentence. * Drop the RFC 2606 meta-commentary on registry.example.com — just tell the reader to replace it. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
… to vGPU 20.x Round 4 of review fixes for the vgpu section. Blocking fixes: * Front-matter title was 'Running VMs with GPU Passthrough' but the page now also covers vGPU as a co-equal section. Update title / linkTitle / description / lead paragraph to reflect both flows and add a jump-link to the vGPU section. * Unify ClientConfigToken file mode at 0744 across both sections (step 4 still had 0600, contradicting step 6's 0744 with the NVIDIA-cited justification). 0600 silently breaks license activation when nvidia-gridd runs as a service user. * Harden the profile-loader DaemonSet against copy-paste failures: read-before-write so manual out-of-band changes are visible (and so the kernel does not reject writes while a VM holds the VF and the script logs an error every minute on every busy VF), strip trailing #-comments from ConfigMap lines, log on malformed lines instead of an opaque 'failed to set'. * Add an explicit side-effect callout: while this DaemonSet runs, manual kubectl exec changes to current_vgpu_type are reverted within 60 s. Edit the ConfigMap rather than the sysfs file. * Pin the SR-IOV path to vGPU 20.x explicitly. The previous '17/20' framing was misleading because the SR-IOV / pciHostDevices flow pairs specifically with branch 20 (driver 595.x); the example build is a 595.x .run, and a reader on a 17.x subscription would build the wrong manager image. Worth-fixing: * Reword the cozystack/cozystack#2323 reference time-stable so the text does not need editing once the upstream PR merges. * Tighten the wrapper gpus: claim. The Cozystack VirtualMachine wrapper passes deviceName straight through to KubeVirt; what's missing for SR-IOV vGPU is the hostDevices: surface for headless setups and end-to-end exercise on real hardware. Be precise about which gap is which. * Add (passthrough only) to the licensing migration alert so a vGPU operator does not 'fix' their setup by switching to secretName and then wonder why nothing changed. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…e for production
Round 5 of review fixes for the gpu page.
Blocking fixes:
* Fix the passthrough example kind: it was VirtualMachine, but the
Cozystack apps.cozystack.io/v1alpha1 wrapper kind is VMInstance
(verified against vm-instance-rd in cozystack/cozystack and against
every other doc in content/en/docs/v1/virtualization/). The vGPU
section anchors its commentary on the wrapper to this example, so
the wrong kind there propagated into the new section by reference.
Drop appVersion: '*' which is not a VMInstance field.
* Replace the fabricated 'gpus: adds virtio-vga display semantics'
claim with the actual difference: gpus: carries an optional
virtualGPUOptions field whose display.enabled defaults to true.
hostDevices: has no such field. Verified against
kubevirt/kubevirt staging/src/kubevirt.io/api/core/v1/schema.go.
* DataVolume in the vgpu example was missing storageClassName; on a
cluster without a default StorageClass the import pod hangs
Pending. Pin to 'replicated' (matching the passthrough example)
with a comment that operators should adjust.
* Add a systemctl enable --now nvidia-gridd step after the .run
install. The .run installs the unit but does not necessarily
start it on first boot, so a reader following the doc end-to-end
saw 'Unlicensed' even with a correct token — not a token issue,
the daemon was simply not running.
* Profile-loader DaemonSet hardening:
- SIGTERM/SIGINT trap so kubelet does not need to SIGKILL after
terminationGracePeriodSeconds on rolling updates.
- 'sleep & wait' so the trap fires immediately mid-sleep.
- printf '%s' instead of echo so no trailing newline reaches
sysfs (some driver versions reject 'invalid argument').
* docker build: add --platform linux/amd64 explicitly. Apple Silicon
build hosts produce arm64 images native, and the kubelet pull on
amd64 GPU nodes fails with 'no matching manifest'. Also add a
one-line docker login note.
* virtctl scp / virtctl ssh: pass --namespace tenant-example
explicitly. virtctl defaults to 'default', not the VM's namespace,
on a multi-tenant cluster the commands as previously written
fail with 'not found in namespace default'.
Worth-fixing:
* Soften the Q/A/B profile family description: not all suffix
variants are available on all GPUs; partition sizes vary per
GPU and per family.
* VF count caveat: '16 VFs per L40S' was misleading because Ada
Lovelace VF count is profile-dependent (24Q → 2 VFs). State the
framebuffer-division relationship explicitly.
Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…VIDIA upstream NVIDIA's gpu-driver-container repository owns the build path; keeping a parallel set of docker build / push commands in this docs page duplicates upstream documentation and goes stale every time their Dockerfile or build args change. Replace the in-line build snippet with a paragraph linking to the upstream repo and the NVIDIA Licensing Portal as the sources of truth; keep only the Cozystack-specific bits (private registry / EULA note). This also resolves a class of review feedback that asked for incrementally more build-side detail in the doc (--platform, docker login, --build-arg names, etc.) — none of that belongs here when NVIDIA already documents it in their repo. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
… pod spec
Round 6 of review fixes for the vgpu section.
* Rename '### Prerequisites' (vGPU section) to '### vGPU Prerequisites'
so Hugo does not collide its slug with '## Prerequisites' from the
passthrough section above. The previous link from the top-of-section
alert silently landed on passthrough prerequisites; now it points
at #vgpu-prerequisites where it belongs.
* Drop the 'running: true vs runStrategy: Always — both mean the same
thing' aside. The Cozystack VMInstance wrapper publishes runStrategy
in its OpenAPI schema; running: true survives only as an
undocumented compat fallback. Documenting the equivalence locks
readers into a deprecated path. The two examples already use
different fields because they are different kinds, which speaks
for itself.
* Add a 'Last verified' line in the experimental-status alert with
concrete versions: KubeVirt main nightly, cozystack#2323 head,
vGPU 20.0 host driver, GRID 595.58.03 guest driver. Future readers
can tell whether the guide still applies to their cluster.
* Profile-loader DaemonSet:
- Add resources.requests / limits (cpu 10m, memory 16Mi/32Mi) so
a privileged pod on every GPU node has at least minimal kubelet
protection against OOM under host memory pressure.
- Add terminationGracePeriodSeconds: 5 so the SIGTERM trap added
in the previous round actually has time to run before kubelet
SIGKILLs the pod on rolling updates.
* Soften the 'variant CR will be rejected' claim. The actual failure
mode on a current Cozystack release is silent: kubectl edit kubevirt
is accepted, but the released virt-handler does not advertise
SR-IOV VFs, so allocatable resources stay at zero with no error.
* Fix the VF count caveat. '16 VFs at -1Q, 2 at -24Q — total
framebuffer divided by per-profile framebuffer' was wrong because
PCIe SR-IOV VF count is the upper bound (hardware-dependent) and
framebuffer division is a separate per-profile cap. State both
correctly.
Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…et for GPU-tainted nodes
Round 7 of review fixes. After this round the iterations have hit
diminishing returns — each pass surfaces a fresh batch of
documentation-quality concerns, none of which block the technical
correctness of the guide. Captured here are the substantive items.
* Add a citation URL for NVIDIA's Licensing User Guide on the
ClientConfigToken file mode (0744). Two snippets in the doc had
the same recommendation; both now link the source.
* Profile-loader DaemonSet additions:
- tolerations matching the common nvidia.com/gpu NoSchedule taint
so the DS actually schedules on GPU-tainted nodes (operators
adjust the key to their tainting scheme).
- priorityClassName: system-node-critical so the DS is not the
first to be evicted under host memory pressure — losing it
means VMs lose their GPU on the next reboot when
current_vgpu_type resets.
- First-failure logging for write rejections. Earlier the script
silently swallowed all failures (rationale: 'VM holds the VF')
which also hid persistent ConfigMap typos. Log once per bus,
clear the flag on success so a recurring real failure surfaces
again later.
* Align the passthrough VMInstance example to runStrategy: Always
(was running: true). The Cozystack vm-instance chart accepts both,
but runStrategy is the canonical schema field; running: true is a
legacy fallback. Two examples in the same page should not
demonstrate two different conventions for the same field.
Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…unify title * After 'kind: VirtualMachine' → 'kind: VMInstance' a few rounds back, the example output around the apply / kubectl get vmi / virtctl console blocks still showed the old 'virtualmachines.apps.cozystack.io' resource path and 'virtual-machine-gpu' VM name. Cozystack's VMInstance kind produces 'vminstances.apps.cozystack.io/<name>' on apply and the underlying VirtualMachineInstance is named 'vm-instance-<name>' (matching vm-image.md / vm-instance.md). Update the apply output and every virtual-machine-gpu reference in the shell prompts and console targets accordingly. * Front-matter title used 'or', linkTitle used 'and' for the same page. Unify on 'and' across title / linkTitle / description. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
6ece350 to
4aaabe3
Compare
What this PR does
Adds a practical guide for running VMs with NVIDIA vGPU on Cozystack to the existing GPU passthrough page.
The guide covers the SR-IOV vGPU path used by current data-centre GPUs (L4, L40, L40S, B100) on the vGPU 20.x driver branch. The mediated-devices path used by older GPUs (Pascal–Ampere) is explicitly out of scope and the reader is pointed at upstream NVIDIA docs.
Steps documented:
github.com/NVIDIA/gpu-driver-container(the oldergitlab.com/nvidia/container-images/driveris archived).vgpuvariant via Package CR (depends on [gpu-operator] Update to v26.3.1 and add experimental vGPU variant cozystack#2323).current_vgpu_typesysfs), with a periodic profile-loader DaemonSet skeleton and an explicit experimental warning.ServerAddress=/ServerPort=7070flow no longer applies).permittedHostDevices.pciHostDevices(after vGPU: SRIOV support kubevirt/kubevirt#16890; first stable KubeVirt release with the patch is targeted at v1.9.0).VirtualMachine(rawkubevirt.io/v1) using a CDIDataVolumeso the rootfs has room for in-VM driver install, with acloudInitNoClouddisk that drops the licensing token,gridd.conf, an SSH key, and the build dependencies.SIGBUSfrom a write into an mmap of a file the kernel could no longer extend; the new example sidesteps it viaDataVolume).Release note