[gpu-operator] Update to v26.3.1 and add experimental vGPU variant#2323
[gpu-operator] Update to v26.3.1 and add experimental vGPU variant#2323
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds vGPU support to the GPU Operator (new Changes
Sequence Diagram(s)sequenceDiagram
participant Admin
participant BuildServer
participant Registry
participant GPU_Operator
participant NLS as "NVIDIA License Server"
participant KubeVirt
Admin->>BuildServer: Provide NVIDIA .run files & EULA
BuildServer->>Registry: Build & push vgpu-manager driver image
Admin->>GPU_Operator: Install Package CR with variant=vgpu (values-talos-vgpu)
GPU_Operator->>Registry: Pull vgpu-manager image
GPU_Operator->>NLS: Reference licensing Secret (gridd.conf) -> configure licensing
Admin->>KubeVirt: Create VirtualMachine spec referencing vGPU resource
KubeVirt->>GPU_Operator: Request mediated device allocation on host
GPU_Operator->>Registry: Ensure driver/kernel modules loaded (vgpuManager)
GPU_Operator->>KubeVirt: Expose vGPU resource to VM (permittedHostDevices/mediatedDeviceTypes)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces NVIDIA vGPU support, updates the GPU Operator to version v26.3.0, and upgrades Node Feature Discovery (NFD) to 0.18.3. Key changes include the addition of a vgpu package variant, transition to HTTP-based health probes in NFD, and the adoption of a dedicated utility for CRD management. Feedback suggests updating the documentation to use Secrets for licensing configuration to align with security best practices and removing the deprecated cdi.default field from the ClusterPolicy template.
| 1. Create a ConfigMap with the NLS client configuration in the `cozy-gpu-operator` namespace: | ||
|
|
||
| ```yaml | ||
| apiVersion: v1 | ||
| kind: ConfigMap | ||
| metadata: | ||
| name: licensing-config | ||
| namespace: cozy-gpu-operator | ||
| data: | ||
| gridd.conf: | | ||
| ServerAddress=nls.example.com | ||
| ServerPort=443 | ||
| FeatureType=1 | ||
| ``` | ||
|
|
||
| 2. Reference the ConfigMap in the Package values: | ||
|
|
||
| ```yaml | ||
| gpu-operator: | ||
| vgpuManager: | ||
| repository: registry.example.com/nvidia | ||
| version: "550.90.05" | ||
| driver: | ||
| licensingConfig: | ||
| configMapName: licensing-config | ||
| ``` |
There was a problem hiding this comment.
The example uses ConfigMap and configMapName for licensing configuration. However, the updated CRDs (nvidia.com_clusterpolicies.yaml and nvidia.com_nvidiadrivers.yaml) mark configMapName as deprecated in favor of secretName. It is recommended to update the documentation to use a Secret for licensing configuration to align with the new best practices for security.
| {{- if .Values.cdi.default }} | ||
| default: {{ .Values.cdi.default }} | ||
| {{- end }} |
There was a problem hiding this comment.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/gpu-vgpu.md`:
- Around line 81-109: The doc incorrectly instructs creating a ConfigMap and
using driver.licensingConfig.configMapName; update it to create a Secret (kind:
Secret) containing the gridd.conf under stringData and change the package values
example to use driver.licensingConfig.secretName: licensing-config so the
chart/schema picks up the NLS licensing configuration.
In
`@packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/post-delete-job.yaml`:
- Around line 97-100: The resources block is being rendered at the Pod spec
level using .Values.master.resources but must be moved into the nfd-master
container spec; update the template so the resources: {{- toYaml
.Values.master.resources }} block is placed inside the container definition for
"nfd-master" (e.g., immediately after that container's fields) rather than after
tolerations, ensuring the resources stanza is nested under the nfd-master
container.
In
`@packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml`:
- Around line 917-924: The CRD docs for the `operator` field currently list
`Lt`/`Gt` without noting they require the Kubernetes feature gate; update the
`operator` description under the `operator` field (in the NVIDIADriver CRD) to
state that `Lt` and `Gt` require Kubernetes v1.35+ with the
TaintTolerationComparisonOperators feature gate enabled, or alternatively update
the chart's `kubeVersion` constraint to >=1.35.0 so that the chart and the
`NVIDIADriver` spec only allow these operators on supported Kubernetes versions;
pick one approach and make the matching change to either the `operator`
description or the chart `kubeVersion`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 51f9626b-f0b3-498a-82e0-c845bb0b24ad
⛔ Files ignored due to path filters (1)
packages/system/gpu-operator/charts/gpu-operator/Chart.lockis excluded by!**/*.lock
📒 Files selected for processing (32)
docs/gpu-vgpu.mdpackages/core/platform/sources/gpu-operator.yamlpackages/system/gpu-operator/Makefilepackages/system/gpu-operator/charts/gpu-operator/Chart.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/Chart.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/README.mdpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/crds/nfd-api-crds.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/_helpers.tplpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/master-pdb.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/master.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/nfd-gc-pdb.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/nfd-gc.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/post-delete-job.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/prometheus.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/topologyupdater.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/worker.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/values.yamlpackages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_clusterpolicies.yamlpackages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_nvidiadrivers.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/cleanup_crd.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/clusterpolicy.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/extra-objects.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/nodefeaturerules.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/nvidiadriver.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/operator.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/role.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/rolebinding.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/serviceaccount.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/upgrade_crd.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/validations.yamlpackages/system/gpu-operator/charts/gpu-operator/values.yamlpackages/system/gpu-operator/values-talos-vgpu.yaml
| {{- with .Values.master.resources }} | ||
| resources: | ||
| {{- toYaml . | nindent 8 }} | ||
| {{- end }} |
There was a problem hiding this comment.
Resources block is misplaced outside the container spec.
The resources block is rendered at the pod spec level (after tolerations) rather than inside the nfd-master container definition. This would cause a Kubernetes API validation error since resources is not a valid pod spec field.
🐛 Proposed fix to move resources inside container spec
The resources block should be inside the container definition (after line 83, inside the nfd-master container):
args:
- "-prune"
{{- if .Values.master.instance | empty | not }}
- "-instance={{ .Values.master.instance }}"
{{- end }}
+ {{- with .Values.master.resources }}
+ resources:
+ {{- toYaml . | nindent 12 }}
+ {{- end }}
restartPolicy: Never
{{- with .Values.master.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.master.affinity }}
affinity:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.master.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
- {{- with .Values.master.resources }}
- resources:
- {{- toYaml . | nindent 8 }}
- {{- end }}
{{- end }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/post-delete-job.yaml`
around lines 97 - 100, The resources block is being rendered at the Pod spec
level using .Values.master.resources but must be moved into the nfd-master
container spec; update the template so the resources: {{- toYaml
.Values.master.resources }} block is placed inside the container definition for
"nfd-master" (e.g., immediately after that container's fields) rather than after
tolerations, ensuring the resources stanza is nested under the nfd-master
container.
| operator: | ||
| description: |- | ||
| Operator represents a key's relationship to the value. | ||
| Valid operators are Exists and Equal. Defaults to Equal. | ||
| Valid operators are Exists, Equal, Lt, and Gt. Defaults to Equal. | ||
| Exists is equivalent to wildcard for value, so that a pod can | ||
| tolerate all taints of a particular category. | ||
| Lt and Gt perform numeric comparisons (requires feature gate TaintTolerationComparisonOperators). | ||
| type: string |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Do Kubernetes Pod tolerations support LtandGtoperators behind theTaintTolerationComparisonOperators feature gate, and in which Kubernetes versions is that feature available or enabled by default?
💡 Result:
Yes, Kubernetes Pod tolerations support the Lt (Less Than) and Gt (Greater Than) operators behind the TaintTolerationComparisonOperators feature gate. These operators enable numeric comparisons between taint values and toleration values (both must be valid 64-bit positive integers), allowing threshold-based scheduling. The feature is available starting in Kubernetes v1.35 as an alpha feature (disabled by default). It must be explicitly enabled via the --feature-gates=TaintTolerationComparisonOperators=true flag on the API server and kube-scheduler. As of 2026-04-02 (during v1.36 development), it remains alpha and is not enabled by default in any version.
Citations:
- 1: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration
- 2: https://github.com/kubernetes/api/blob/master/core/v1/toleration.go
- 3: https://kubernetes.io/blog/2026/01/05/kubernetes-v1-35-numeric-toleration-operators/
- 4: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
🏁 Script executed:
#!/bin/bash
# Search for Kubernetes version constraints in the repository
fd -t f -e yaml -e txt -e md | xargs grep -l "kubernetes" | head -20
# Also search for version info in common locations
if [ -f "Chart.yaml" ]; then echo "=== Chart.yaml ===" && cat Chart.yaml; fi
if [ -f "requirements.yaml" ]; then echo "=== requirements.yaml ===" && cat requirements.yaml; fi
if [ -f "README.md" ]; then echo "=== README.md (first 50 lines) ===" && head -50 README.md; fi
if [ -f "go.mod" ]; then echo "=== go.mod ===" && cat go.mod; fi
# Search for version constraints in the gpu-operator chart
if [ -d "packages/system/gpu-operator" ]; then
echo "=== Files in gpu-operator ==="
ls -la packages/system/gpu-operator/ | head -20
fiRepository: cozystack/cozystack
Length of output: 10781
🏁 Script executed:
#!/bin/bash
# Search for references to TaintTolerationComparisonOperators or kubernetes version requirements
rg -i "kubernetes|k8s|version|feature.*gate|taint" --type yaml -A 2 -B 2 | head -100Repository: cozystack/cozystack
Length of output: 13930
🏁 Script executed:
#!/bin/bash
# Check if this is a vendored chart by looking for upstream references
if [ -d "packages/system/gpu-operator/charts" ]; then
echo "=== GPU Operator Chart Structure ==="
find packages/system/gpu-operator/charts -type f -name "Chart.yaml" -o -name "values.yaml" | xargs head -20
fiRepository: cozystack/cozystack
Length of output: 2228
Verify Lt/Gt toleration operators are supported in supported Kubernetes versions.
The GPU operator chart supports Kubernetes >= 1.16.0, but the TaintTolerationComparisonOperators feature gate is available only in Kubernetes v1.35+ and is disabled by default. Users can define valid NVIDIADriver specs with Lt/Gt operators that will be rejected by the API server on Kubernetes versions 1.16–1.34 or on v1.35+ without the feature gate explicitly enabled. Either update the CRD documentation to clarify this requirement or update the chart's kubeVersion constraint to reflect the actual compatibility.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml`
around lines 917 - 924, The CRD docs for the `operator` field currently list
`Lt`/`Gt` without noting they require the Kubernetes feature gate; update the
`operator` description under the `operator` field (in the NVIDIADriver CRD) to
state that `Lt` and `Gt` require Kubernetes v1.35+ with the
TaintTolerationComparisonOperators feature gate enabled, or alternatively update
the chart's `kubeVersion` constraint to >=1.35.0 so that the chart and the
`NVIDIADriver` spec only allow these operators on supported Kubernetes versions;
pick one approach and make the matching change to either the `operator`
description or the chart `kubeVersion`.
Testing Report: gpu-operator v26.3.0 on Talos LinuxTested this PR on a Talos v1.12.6 cluster (3× VM nodes with NVIDIA A10 GPUs, Kubernetes v1.34.3). Critical: Talos OS version parsing failureThe ClusterPolicy enters This is a known bug in gpu-operator v26.3.0 (NVIDIA/gpu-operator#2239). The Already fixed upstream in NVIDIA/gpu-operator#2244, released as v26.3.1 on 2026-04-18. The vendored chart should be bumped from v26.3.0 to v26.3.1. This affects all Talos deployments regardless of variant ( Secondary: nvidia-operator-validator permission errorOn nodes where the validator pod was re-rolled, the This happens during SummaryThe primary blocker is the OS version parsing bug — any Talos cluster with GPU nodes will hit it. Bumping the vendored chart to v26.3.1 should resolve this. The validator permission issue needs verification after the bump. |
783b51b to
5a61a33
Compare
…stack#2323 fixes Mirrors the corrections that landed in cozystack/cozystack#2323 docs/gpu-vgpu.md: * Lead the vgpu install section with an experimental-status alert. The vGPU Manager DaemonSet works, but profile assignment is currently out-of-band (echo <id> > /sys/.../current_vgpu_type per VF) and resets on every reboot — without an automated mechanism the variant silently advertises zero allocatable resources. * Note that the vgpu variant now sets sandboxWorkloads.defaultWorkload to vm-vgpu out of the box (cozystack#2323), so per-node labelling is optional rather than required. * Move imagePullSecrets under vgpuManager: in the example. The chart reads it per-component (vgpuManager.imagePullSecrets, driver.imagePullSecrets, …); placing it at the package root would silently render-no-op and the DaemonSet would ImagePullBackOff without an obvious error. * Split B100 (Blackwell) from L4/L40/L40S (Ada Lovelace) — the driver-model commentary applies to both architectures, but they should be named correctly. * Note that on L40S the SR-IOV VFs share the PF device ID (10de:26b9), so a single pciVendorSelector matches both. Add lspci sanity-check guidance for other generations where PF/VF IDs may differ. * Add a migration note in the Licensing section: chart v25 → v26 deprecated driver.licensingConfig.configMapName in favour of secretName; old key still works but emits a deprecation marker in the CRD. SR-IOV vGPU does not consume the host-side licensing knob at all — relevant only for passthrough operators carrying old configMapName overrides. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Six fixes from review: * Add a top-of-section warning that the entire vgpu workflow depends on cozystack/cozystack#2323 (still in draft) and KubeVirt v1.9.0 (unreleased) — neither is in any current Cozystack release. Without this banner the page reads like a supported workflow. * Drop the incorrect claim that gpus: expects an mdev resource. The upstream KubeVirt VirtualMachine spec accepts both PCI and mdev resource names under either gpus: or hostDevices:. Restate the hostDevices vs gpus distinction as convention, not a hard constraint, and explain the cozystack/v1alpha1 wrapper trade-off. * Drop the self-contradicting parenthetical that called pciVendorSelector 'the device ID, not the vendor:device pair' — the example right above already uses the vendor:device tuple (10DE:26B9), and the upstream KubeVirt godoc documents the field exactly as that tuple. * Re-quote siderolabs/extensions#461. The issue was closed with stateReason=COMPLETED, not not_planned; the literal phrase 'won't fix' does not appear in the closure. Replace with the substantive reason from rothgar's closing comment. * Fix pre-existing PCI ID error in the passthrough lspci sample output: A10 is 10de:2236 (verified against pci-ids.ucw.cz), the prior text said 10de:26b9 which is L40S. The rest of the passthrough section already uses 2236 correctly, so this was a copy-paste artefact. * Soften the KubeVirt v1.9.0 ETA wording. KubeVirt does not publish hard release dates; replace 'ETA July 2026' with 'targeted for the v1.9.0 release; track the PR for the actual release tag'. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
… to vGPU 20.x Round 4 of review fixes for the vgpu section. Blocking fixes: * Front-matter title was 'Running VMs with GPU Passthrough' but the page now also covers vGPU as a co-equal section. Update title / linkTitle / description / lead paragraph to reflect both flows and add a jump-link to the vGPU section. * Unify ClientConfigToken file mode at 0744 across both sections (step 4 still had 0600, contradicting step 6's 0744 with the NVIDIA-cited justification). 0600 silently breaks license activation when nvidia-gridd runs as a service user. * Harden the profile-loader DaemonSet against copy-paste failures: read-before-write so manual out-of-band changes are visible (and so the kernel does not reject writes while a VM holds the VF and the script logs an error every minute on every busy VF), strip trailing #-comments from ConfigMap lines, log on malformed lines instead of an opaque 'failed to set'. * Add an explicit side-effect callout: while this DaemonSet runs, manual kubectl exec changes to current_vgpu_type are reverted within 60 s. Edit the ConfigMap rather than the sysfs file. * Pin the SR-IOV path to vGPU 20.x explicitly. The previous '17/20' framing was misleading because the SR-IOV / pciHostDevices flow pairs specifically with branch 20 (driver 595.x); the example build is a 595.x .run, and a reader on a 17.x subscription would build the wrong manager image. Worth-fixing: * Reword the cozystack/cozystack#2323 reference time-stable so the text does not need editing once the upstream PR merges. * Tighten the wrapper gpus: claim. The Cozystack VirtualMachine wrapper passes deviceName straight through to KubeVirt; what's missing for SR-IOV vGPU is the hostDevices: surface for headless setups and end-to-end exercise on real hardware. Be precise about which gap is which. * Add (passthrough only) to the licensing migration alert so a vGPU operator does not 'fix' their setup by switching to secretName and then wonder why nothing changed. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Update vendored upstream chart to v26.3.0. Key changes: - CDI enabled by default - Driver version 580.126.20 - NFD subchart updated to 0.18.3 - Component versions bumped (toolkit v1.19.0, device-plugin v0.19.0, dcgm-exporter 4.5.2) - Pin chart version in Makefile for reproducibility Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Add vGPU mode as a new variant alongside the existing passthrough (default) variant. This enables sharing a single physical GPU across multiple VMs using NVIDIA mediated devices. Changes: - Add values-talos-vgpu.yaml with vGPU Manager and vGPU Device Manager enabled - Add 'vgpu' variant to PackageSource - Add documentation for vGPU setup (driver image, NLS licensing, profiles, KubeVirt) Users select the vgpu variant via Package CR and provide their proprietary vGPU Manager image coordinates through component values. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Replace simplified Containerfile with NVIDIA's Makefile-based build system from gitlab.com/nvidia/container-images/driver. The GPU Operator expects pre-compiled kernel modules, not a raw .run file. Add EULA notice about public redistribution of vGPU driver images. Add note about NLS ServerPort being deployment-dependent. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
ConfigMap-based licensing configuration (configMapName) is deprecated in GPU Operator v26.x. Switch to Secret-based configuration (secretName) as recommended by the updated CRD schema. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Pull latest patch release: GDRCopy v2.5.2, Kata Sandbox Device Plugin v0.0.3, multi-arch digest fix for KubeVirt GPU Device Plugin (no longer amd64-only), and OS release name detection fix. NFD remains v0.18.3. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ncy, DLS-based licensing Rewrites docs/gpu-vgpu.md to match the actual NVIDIA driver model on current data-centre GPUs. Key changes: - Drop the implicit assumption that vGPU uses mediated devices. Ada Lovelace and newer (L4, L40, L40S, B100) on the vGPU 17/20 branch use SR-IOV with per-VF NVIDIA sysfs (current_vgpu_type); KubeVirt advertises them via permittedHostDevices.pciHostDevices. - Document KubeVirt PR #16890 as a hard dependency for the SR-IOV path. The PR is in main since 2026-04-10 and will ship in v1.9.0 (ETA July 2026); released tags v1.6.x/1.7.x/1.8.x do not include it and backports are not planned. - Replace the legacy NLS configuration block (ServerAddress=, ServerPort=7070, FeatureType=1) with the current DLS flow: ClientConfigToken file in /etc/nvidia/ClientConfigToken/, gridd.conf with FeatureType=0 (auto-detect). - Switch the build instructions from the archived gitlab.com/nvidia/container-images/driver to the upstream github.com/NVIDIA/gpu-driver-container. - Switch sample VirtualMachine spec from gpus: (mdev resource) to hostDevices: (PCI resource). - Note explicitly that Talos is not recommended for the vGPU variant — the proprietary .run cannot be redistributed in a Talos system extension, and siderolabs/extensions#461 was closed won't-fix. Passthrough on Talos is unaffected. Rename packages/system/gpu-operator/values-talos-vgpu.yaml to values-vgpu.yaml — the overrides (sandbox workloads, pod-side driver and device plugin disabled) are not Talos-specific, they apply to any VM-only GPU node. Update PackageSource accordingly. Default vgpuDeviceManager.enabled: false. The DaemonSet walks /sys/class/mdev_bus/, which does not exist on Ada+ — keeping it on by default would crashloop modern hardware. Operators of older GPUs (Pascal/Volta/Turing/Ampere) can flip the flag back on. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ng example
Addresses follow-up review of the vGPU variant + chart bump.
* values.yaml: pin ccManager.enabled=false and vgpuDeviceManager.enabled=false.
Upstream chart v26.x flipped both to true. ccManager auto-enables CC
mode on Hopper-class hardware after upgrade — a non-trivial behavior
change that should be opt-in. vgpuDeviceManager walks
/sys/class/mdev_bus/, which does not exist on Ada+ (L4/L40/L40S/B100
with vGPU 17/20). Pinning both off keeps the default footprint as it
was on v25.x; users explicitly opt in.
* hack/dcgm-default-counters.csv: refresh from dcgm-exporter tag
4.5.1-4.8.0 (matches the image bumped earlier in this PR). The
previous snapshot was pinned to 4.1.1-4.0.4. Adds DCGM_FI_DEV_FB_RESERVED
and other fields introduced upstream. Header comment updated.
hack/check-gpu-recording-rules.bats passes against the refreshed
fixture.
* docs/gpu-vgpu.md (Licensing section):
- Drop the wrong recommendation to wire the licensing Secret through
gpu-operator's driver.licensingConfig.secretName. That mounts into
the host driver pod, where SR-IOV vGPU does not consume a license.
Document guest-side delivery (cloud-init write_files for
/etc/nvidia/ClientConfigToken/ and /etc/nvidia/gridd.conf) end to
end.
- Correct FeatureType comment. Per the NVIDIA Virtual GPU Software
Licensing User Guide, FeatureType=0 is the unlicensed state, not
auto-detect; FeatureType=1 is the value that asks the driver to
auto-select the license type from the configured vGPU profile.
Switch the example to FeatureType=1 and list all four valid values
with their meaning.
- Add nvidia-smi verification snippet inside the guest.
* docs/gpu-vgpu.md (KubeVirt section): note that L40S SR-IOV VFs share
the PF device ID (10de:26b9), so a single pciVendorSelector matches
both. Recommend lspci -nn -d 10de: as a sanity check on other
generations where PF/VF IDs may differ.
Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…path, mark vgpu variant experimental Addresses second review pass. Three blockers + one polish: * values-vgpu.yaml: set sandboxWorkloads.defaultWorkload=vm-vgpu. Without this the upstream default 'container' applies on any node without the per-node nvidia.com/gpu.workload.config label, the operator selects the container workload (driver/devicePlugin DaemonSets), and those are disabled in this variant — so nothing runs on the node and vgpu-manager never starts. The cluster-wide default makes the variant functional out of the box; per-node labels still override it for mixed clusters. * docs/gpu-vgpu.md: imagePullSecrets is per-component on this chart (vgpuManager.imagePullSecrets, driver.imagePullSecrets, …). The Package CR example previously placed the field at gpu-operator: root where it would be silently ignored at render. Move it under vgpuManager: where the chart actually reads it. * docs/gpu-vgpu.md, release note: mark the vgpu variant as experimental and call out the missing profile-assignment loop explicitly. Until an SR-IOV-aware controller writes current_vgpu_type per VF (and re-applies after every reboot — PCIe re-enumeration resets the value), the variant is not safe for production. The docs now lead with that warning instead of burying it in a caveat. * Rename packages/system/gpu-operator/values-talos.yaml to values-passthrough.yaml. The overrides (sandboxWorkloads.enabled, driver and devicePlugin disabled) apply to any host OS using the vfio-pci passthrough path — not Talos specifically. Update PackageSource ref accordingly. Comment in the file documents the rename so future readers understand the prior name. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
… Ada/Blackwell * examples/README.md and examples/values-native-talos.yaml header comment still pointed at values-talos.yaml after the rename to values-passthrough.yaml. Update both to the new name. * values-vgpu.yaml: drop the leftover vgpuDeviceManager.config block. The chart only renders the config when vgpuDeviceManager.enabled is true, and we keep it false on Ada+; the orphan config.default key was scaffolding without effect. * docs/gpu-vgpu.md and values.yaml comment: stop grouping B100 under Ada Lovelace. B100 is Blackwell; the SR-IOV vGPU mechanism applies to both architectures but they should be named correctly. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…licy
Adds hack/check-gpu-operator-variants.bats. The test renders both
package variants with helm template and asserts the resulting
ClusterPolicy spec for the values that this PR cares about:
* ccManager.enabled=false (pinned in values.yaml; upstream v26 default
was flipped to true)
* vgpuDeviceManager.enabled=false (same — and on Ada+/Blackwell the
DaemonSet would crashloop because /sys/class/mdev_bus is empty)
* sandboxWorkloads.defaultWorkload — vm-passthrough on the default
variant, vm-vgpu on the vgpu variant; without these the upstream
'container' default makes either variant a silent no-op on
unlabelled GPU nodes
* vgpuManager.enabled=true on the vgpu variant; driver / devicePlugin
enabled=false on both variants (proprietary host driver path is
out of scope for both passthrough and SR-IOV vGPU)
Negative-tested locally: deleting any of these settings makes the
corresponding test fail, so it actually guards regressions on the
next chart bump.
Also addresses last-pass review comments:
* values-passthrough.yaml: set sandboxWorkloads.defaultWorkload=
vm-passthrough so the default variant is symmetric with the vgpu
variant — neither relies on per-node labels for the variant to
activate. Without this the upstream default 'container' would
apply on unlabelled GPU nodes and vfio-manager would never start.
* values.yaml: rewrite the vgpuDeviceManager comment to lead with
the load-bearing reason (mdev path missing on Ada+/Blackwell)
instead of the secondary 'stays at zero pods on passthrough'
rationalisation.
* docs/gpu-vgpu.md: add a 'Migrating from chart v25.x' subsection
in the Licensing section pointing at the configMapName ->
secretName rename. Old key still works upstream but emits a
deprecation warning; PR body release note already calls this out
but the docs were missing the migration breadcrumb.
Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ote siderolabs#461 Two technical inaccuracies caught during review: * The claim 'gpus: expects an mdev resource and will not match a pciHostDevices entry' is wrong. Upstream KubeVirt (pkg/virt-launcher/virtwrap/device/hostdevice/gpu/hostdev.go) iterates both the PCI pool (v1.PCIResourcePrefix) and the mdev pool (v1.MDevResourcePrefix) and accepts whichever resolves. Reword to describe gpus: vs hostDevices: as a style convention rather than a hard technical constraint. * siderolabs/extensions#461 was closed with stateReason=COMPLETED, not not_planned. Quoting the closure as 'won't fix' is editorial recasting that does not match the actual closing comment. Replace with the substantive reason from rothgar's closing comment: NVIDIA does not publicly distribute the vGPU guest driver, requires Enterprise Portal access, and Sidero cannot support vGPU until NVIDIA changes its licensing terms. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ording Mirroring the corresponding cleanup in cozystack/website#467 review: * KubeVirt v1.9.0 ETA: drop the 'July 2026' date — KubeVirt does not publish hard release dates and the in-page wording was inconsistent across the file. Replace with 'Targeted at the next minor release (v1.9.0); track the PR for the actual release tag', matching the website doc. * Kernel-headers timing: 'The build downloads kernel headers at container start time' conflated docker build with pod start; the Dockerfile's entrypoint downloads them at runtime. Reword. * Replace guillemets with ASCII quotes for consistency with the rest of the file and the upstream chart docs. * nvidia-smi -q license check: drop the unnecessary -A 1 — License Status value is on the same line as the field name. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
NVIDIA's License System User Guide recommends 744 on the ClientConfigToken file so nvidia-gridd (which does not necessarily run as the file owner) can read it. Previous example used 0600, which can silently break license activation when nvidia-gridd runs as a service user. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Apple Silicon / arm64 build hosts produce arm64 images native; GPU nodes are amd64 and the kubelet pull fails with 'no matching manifest' when the image is built without --platform. Document the flag and add a one-line docker login note for private registries. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ream NVIDIA's gpu-driver-container repository owns the build path — keeping a parallel set of build instructions in our docs duplicates upstream documentation and goes stale every time the upstream Dockerfile or build args change. Replace the docker build / docker push snippet with a paragraph pointing at the upstream repo and the NVIDIA Licensing Portal as the sources of truth, keep only the items that are Cozystack-specific (private-registry / EULA note). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Pin a concrete reference set so future readers can tell whether the guide is still accurate against the cluster they're running: KubeVirt main nightly 20260429_74d7c52588, this PR's vgpu variant, NVIDIA vGPU 20.0 host driver 595.58.02, GRID guest driver 595.58.03. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Free-floating 'NVIDIA recommends' claims are flagged as unverifiable in review. Add the doc URL inline so future readers can verify the permission model decision. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
0e2dfad to
b5c1e95
Compare
…stack#2323 fixes Mirrors the corrections that landed in cozystack/cozystack#2323 docs/gpu-vgpu.md: * Lead the vgpu install section with an experimental-status alert. The vGPU Manager DaemonSet works, but profile assignment is currently out-of-band (echo <id> > /sys/.../current_vgpu_type per VF) and resets on every reboot — without an automated mechanism the variant silently advertises zero allocatable resources. * Note that the vgpu variant now sets sandboxWorkloads.defaultWorkload to vm-vgpu out of the box (cozystack#2323), so per-node labelling is optional rather than required. * Move imagePullSecrets under vgpuManager: in the example. The chart reads it per-component (vgpuManager.imagePullSecrets, driver.imagePullSecrets, …); placing it at the package root would silently render-no-op and the DaemonSet would ImagePullBackOff without an obvious error. * Split B100 (Blackwell) from L4/L40/L40S (Ada Lovelace) — the driver-model commentary applies to both architectures, but they should be named correctly. * Note that on L40S the SR-IOV VFs share the PF device ID (10de:26b9), so a single pciVendorSelector matches both. Add lspci sanity-check guidance for other generations where PF/VF IDs may differ. * Add a migration note in the Licensing section: chart v25 → v26 deprecated driver.licensingConfig.configMapName in favour of secretName; old key still works but emits a deprecation marker in the CRD. SR-IOV vGPU does not consume the host-side licensing knob at all — relevant only for passthrough operators carrying old configMapName overrides. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Six fixes from review: * Add a top-of-section warning that the entire vgpu workflow depends on cozystack/cozystack#2323 (still in draft) and KubeVirt v1.9.0 (unreleased) — neither is in any current Cozystack release. Without this banner the page reads like a supported workflow. * Drop the incorrect claim that gpus: expects an mdev resource. The upstream KubeVirt VirtualMachine spec accepts both PCI and mdev resource names under either gpus: or hostDevices:. Restate the hostDevices vs gpus distinction as convention, not a hard constraint, and explain the cozystack/v1alpha1 wrapper trade-off. * Drop the self-contradicting parenthetical that called pciVendorSelector 'the device ID, not the vendor:device pair' — the example right above already uses the vendor:device tuple (10DE:26B9), and the upstream KubeVirt godoc documents the field exactly as that tuple. * Re-quote siderolabs/extensions#461. The issue was closed with stateReason=COMPLETED, not not_planned; the literal phrase 'won't fix' does not appear in the closure. Replace with the substantive reason from rothgar's closing comment. * Fix pre-existing PCI ID error in the passthrough lspci sample output: A10 is 10de:2236 (verified against pci-ids.ucw.cz), the prior text said 10de:26b9 which is L40S. The rest of the passthrough section already uses 2236 correctly, so this was a copy-paste artefact. * Soften the KubeVirt v1.9.0 ETA wording. KubeVirt does not publish hard release dates; replace 'ETA July 2026' with 'targeted for the v1.9.0 release; track the PR for the actual release tag'. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
… to vGPU 20.x Round 4 of review fixes for the vgpu section. Blocking fixes: * Front-matter title was 'Running VMs with GPU Passthrough' but the page now also covers vGPU as a co-equal section. Update title / linkTitle / description / lead paragraph to reflect both flows and add a jump-link to the vGPU section. * Unify ClientConfigToken file mode at 0744 across both sections (step 4 still had 0600, contradicting step 6's 0744 with the NVIDIA-cited justification). 0600 silently breaks license activation when nvidia-gridd runs as a service user. * Harden the profile-loader DaemonSet against copy-paste failures: read-before-write so manual out-of-band changes are visible (and so the kernel does not reject writes while a VM holds the VF and the script logs an error every minute on every busy VF), strip trailing #-comments from ConfigMap lines, log on malformed lines instead of an opaque 'failed to set'. * Add an explicit side-effect callout: while this DaemonSet runs, manual kubectl exec changes to current_vgpu_type are reverted within 60 s. Edit the ConfigMap rather than the sysfs file. * Pin the SR-IOV path to vGPU 20.x explicitly. The previous '17/20' framing was misleading because the SR-IOV / pciHostDevices flow pairs specifically with branch 20 (driver 595.x); the example build is a 595.x .run, and a reader on a 17.x subscription would build the wrong manager image. Worth-fixing: * Reword the cozystack/cozystack#2323 reference time-stable so the text does not need editing once the upstream PR merges. * Tighten the wrapper gpus: claim. The Cozystack VirtualMachine wrapper passes deviceName straight through to KubeVirt; what's missing for SR-IOV vGPU is the hostDevices: surface for headless setups and end-to-end exercise on real hardware. Be precise about which gap is which. * Add (passthrough only) to the licensing migration alert so a vGPU operator does not 'fix' their setup by switching to secretName and then wonder why nothing changed. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Build job in CI failed because the test relied on bats-specific machinery the in-repo cozytest.sh runner does not provide: * BATS_TEST_DIRNAME is undefined when the file is sourced as a plain shell script (cozytest.sh's mode); switch to BATS_TEST_FILENAME with a $0 fallback, matching hack/check-gpu-recording-rules.bats (REPO_ROOT pattern). * setup()/teardown() are bats lifecycle hooks; cozytest.sh runs each @test in a fresh subshell with set -u and never invokes them. With TMP only assigned in setup(), every test exited immediately on 'TMP: unbound variable'. Inline the mktemp + trap inside each @test so both runners work the same. Verified locally that bash hack/cozytest.sh hack/check-gpu-operator-variants.bats now reports all six tests OK. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
What this PR does
Updates the NVIDIA GPU Operator from v25.3.0 to v26.3.1 and adds an experimental
vgpuvariant for SR-IOV vGPU on Ada Lovelace and newer GPUs.Changes:
values-vgpu.yamlfor the vGPU variant withdefaultWorkload: vm-vgpuso the chart activates the vGPU Manager DaemonSet without per-node labelling.vgpuvariant to PackageSource alongside the existingdefault(passthrough).values-talos.yaml→values-passthrough.yamland pindefaultWorkload: vm-passthroughfor symmetry with the vgpu variant. The overrides apply to any host OS using vfio-pci, not Talos specifically.hack/dcgm-default-counters.csvto match the new dcgm-exporter image tag (4.5.1-4.8.0) — addsDCGM_FI_DEV_FB_RESERVEDand other fields introduced upstream.hack/check-gpu-recording-rules.batspasses against the new fixture.hack/check-gpu-operator-variants.batsthat renders both variants viahelm templateand asserts the pinned ClusterPolicy values, so the next chart bump cannot silently un-pin them.The
vgpuvariant assumes the user supplies a proprietary vGPU Manager image via Package CR component values — the .run cannot be redistributed.vgpuvariant is experimentalNVIDIA's
vgpu-device-managerwalks/sys/class/mdev_bus/, which does not exist on Ada Lovelace and Blackwell — the DaemonSet errors with "no parent devices found" and is disabled by default. Until an SR-IOV-aware controller is shipped, profile assignment is an out-of-band step (echo <id> > /sys/.../current_vgpu_typeper VF) that must be re-applied after every node reboot. Without it,permittedHostDevices.pciHostDevicesreports zero allocatable resources. The variant ships so the chart bump and the underlying skeleton (DaemonSet + image plumbing + docs) land together; production deployments need an automated profile-assignment mechanism in addition to this PR.KubeVirt requirement
KubeVirt advertises VFs via
permittedHostDevices.pciHostDevicesrather thanmediatedDevices. This requires kubevirt/kubevirt#16890 ("vGPU: SRIOV support", merged 2026-04-10) which ismain-only — released tagsv1.6.x/v1.7.x/v1.8.xdo not include it. The patch is targeted at the next minor KubeVirt release (v1.9.0); track the PR for the actual release tag. Backports to released branches are not planned.Behavior changes inherited from upstream chart v25 → v26
The upstream chart flipped two defaults that are pinned back to
falsehere, sodefault(passthrough) deployments do not silently change footprint on upgrade:ccManager.enabledwas flipped totrueupstream — would auto-enable Confidential Computing on Hopper hardware (H100/H200/B100) after upgrade.vgpuDeviceManager.enabledwas flipped totrueupstream — DaemonSet would render even in passthrough mode.Both stay opt-in via Package CR overrides. The new bats test guards the pin.
Other inherited changes that are user-visible and worth pinning down before upgrade:
driver.licensingConfig.configMapNameis deprecated upstream in favour ofdriver.licensingConfig.secretName. The deprecation marker lives in the CRD'sdescriptionfield (visible viakubectl explain) — there is no helm-render-time warning, the chart template silently preferssecretNameand falls back toconfigMapName. Operators with existing licensing overrides should migrate the key on the next chart upgrade.driver.versionupstream default jumped from570.xto580.x. Invisible on thedefaultvariant (which setsdriver.enabled: false), user-visible for anyone running the upstream Ubuntu driver path.vfioManager.imageswitched fromnvcr.io/nvidia/cuda:12.8.1-base-ubi9tonvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.10.0. Affects everydefault(passthrough) deployment on upgrade — different image, different layer, different image-pull surface. Worth pre-mirroring for air-gapped clusters.dcgmExporter.imageswitched to the-distrolessflavour. Smaller surface, fewer base-image CVEs, but no shell — operators who used tokubectl execinto the exporter for ad-hoc debugging will need to switch tokubectl debugor container ephemeral images.CDI(Container Device Interface) is enabled by default in v26.3.1; this is upstream's recommended setting and is left as-is.node-feature-discoverysubchart source switched from the GitHub Pages Helm repo tooci://registry.k8s.io/nfd/charts. Air-gapped installations need to mirror the new OCI URL.Talos
Talos is not recommended for the vGPU variant. NVIDIA does not publicly distribute the vGPU guest driver — it requires NVIDIA Enterprise Portal access — and Sidero closed siderolabs/extensions#461 noting that they cannot support vGPU "unless NVIDIA changes their licensing terms or provides us a way to obtain, test, and distribute the software". Building a Talos system extension that bakes in the vGPU driver is therefore not feasible without a private fork that violates the EULA. Passthrough (
defaultvariant) on Talos is unaffected.Release note