[gpu-operator] Update to v26.3.1 and add experimental vGPU variant by lexfrei · Pull Request #2323 · cozystack/cozystack

lexfrei · 2026-04-02T12:22:12Z

What this PR does

Updates the NVIDIA GPU Operator from v25.3.0 to v26.3.1 and adds an experimental vgpu variant for SR-IOV vGPU on Ada Lovelace and newer GPUs.

Changes:

Update vendored upstream chart to v26.3.1 (CDI enabled by default, component version bumps).
Pin chart version in Makefile for reproducibility.
Add values-vgpu.yaml for the vGPU variant with defaultWorkload: vm-vgpu so the chart activates the vGPU Manager DaemonSet without per-node labelling.
Add vgpu variant to PackageSource alongside the existing default (passthrough).
Rename values-talos.yaml → values-passthrough.yaml and pin defaultWorkload: vm-passthrough for symmetry with the vgpu variant. The overrides apply to any host OS using vfio-pci, not Talos specifically.
Refresh hack/dcgm-default-counters.csv to match the new dcgm-exporter image tag (4.5.1-4.8.0) — adds DCGM_FI_DEV_FB_RESERVED and other fields introduced upstream. hack/check-gpu-recording-rules.bats passes against the new fixture.
Add hack/check-gpu-operator-variants.bats that renders both variants via helm template and asserts the pinned ClusterPolicy values, so the next chart bump cannot silently un-pin them.
Add internal documentation for the vGPU bring-up.

The vgpu variant assumes the user supplies a proprietary vGPU Manager image via Package CR component values — the .run cannot be redistributed.

`vgpu` variant is experimental

NVIDIA's vgpu-device-manager walks /sys/class/mdev_bus/, which does not exist on Ada Lovelace and Blackwell — the DaemonSet errors with "no parent devices found" and is disabled by default. Until an SR-IOV-aware controller is shipped, profile assignment is an out-of-band step (echo <id> > /sys/.../current_vgpu_type per VF) that must be re-applied after every node reboot. Without it, permittedHostDevices.pciHostDevices reports zero allocatable resources. The variant ships so the chart bump and the underlying skeleton (DaemonSet + image plumbing + docs) land together; production deployments need an automated profile-assignment mechanism in addition to this PR.

KubeVirt requirement

KubeVirt advertises VFs via permittedHostDevices.pciHostDevices rather than mediatedDevices. This requires kubevirt/kubevirt#16890 ("vGPU: SRIOV support", merged 2026-04-10) which is main-only — released tags v1.6.x / v1.7.x / v1.8.x do not include it. The patch is targeted at the next minor KubeVirt release (v1.9.0); track the PR for the actual release tag. Backports to released branches are not planned.

Behavior changes inherited from upstream chart v25 → v26

The upstream chart flipped two defaults that are pinned back to false here, so default (passthrough) deployments do not silently change footprint on upgrade:

ccManager.enabled was flipped to true upstream — would auto-enable Confidential Computing on Hopper hardware (H100/H200/B100) after upgrade.
vgpuDeviceManager.enabled was flipped to true upstream — DaemonSet would render even in passthrough mode.

Both stay opt-in via Package CR overrides. The new bats test guards the pin.

Other inherited changes that are user-visible and worth pinning down before upgrade:

driver.licensingConfig.configMapName is deprecated upstream in favour of driver.licensingConfig.secretName. The deprecation marker lives in the CRD's description field (visible via kubectl explain) — there is no helm-render-time warning, the chart template silently prefers secretName and falls back to configMapName. Operators with existing licensing overrides should migrate the key on the next chart upgrade.
driver.version upstream default jumped from 570.x to 580.x. Invisible on the default variant (which sets driver.enabled: false), user-visible for anyone running the upstream Ubuntu driver path.
vfioManager.image switched from nvcr.io/nvidia/cuda:12.8.1-base-ubi9 to nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.10.0. Affects every default (passthrough) deployment on upgrade — different image, different layer, different image-pull surface. Worth pre-mirroring for air-gapped clusters.
dcgmExporter.image switched to the -distroless flavour. Smaller surface, fewer base-image CVEs, but no shell — operators who used to kubectl exec into the exporter for ad-hoc debugging will need to switch to kubectl debug or container ephemeral images.
CDI (Container Device Interface) is enabled by default in v26.3.1; this is upstream's recommended setting and is left as-is.
node-feature-discovery subchart source switched from the GitHub Pages Helm repo to oci://registry.k8s.io/nfd/charts. Air-gapped installations need to mirror the new OCI URL.

Talos

Talos is not recommended for the vGPU variant. NVIDIA does not publicly distribute the vGPU guest driver — it requires NVIDIA Enterprise Portal access — and Sidero closed siderolabs/extensions#461 noting that they cannot support vGPU "unless NVIDIA changes their licensing terms or provides us a way to obtain, test, and distribute the software". Building a Talos system extension that bakes in the vGPU driver is therefore not feasible without a private fork that violates the EULA. Passthrough (default variant) on Talos is unaffected.

Release note

[gpu-operator] Update NVIDIA GPU Operator to v26.3.1 and add experimental `vgpu` variant for SR-IOV vGPU on Ada Lovelace / Blackwell (L4, L40, L40S, B100; Ubuntu host; requires KubeVirt v1.9.0+ for `pciHostDevices` advertising). The variant ships skeleton support — profile assignment (`current_vgpu_type` per VF) is currently out of band and must be automated by the operator before production use. On the existing `default` variant: `ccManager` and `vgpuDeviceManager` upstream defaults are pinned back to `false` to preserve prior behavior on chart upgrade; `defaultWorkload` is now pinned to `vm-passthrough` (was unset, falling back to `container`); `driver.licensingConfig.configMapName` is deprecated by upstream in favour of `secretName`; `driver.version` upstream default moves 570 → 580; `vfioManager.image` switches from CUDA UBI9 to k8s-driver-manager (relevant for air-gapped mirrors); `dcgmExporter` image switched to the `-distroless` flavour. NFD subchart source switched to `oci://registry.k8s.io/nfd/charts`.

coderabbitai · 2026-04-02T12:22:20Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e706f581-6ce4-4e61-b9b0-1626595c2314

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds vGPU support to the GPU Operator (new vgpu variant and Talos vGPU values), upgrades GPU Operator to v26.3.0 and Node Feature Discovery to v0.18.3, migrates NFD probes from gRPC to HTTP, enhances multiple CRDs/templates for licensing, vGPU/kernel-module config, CDI/NRI and kata sandbox integration, and adds documentation.

Changes

Cohort / File(s)	Summary
Documentation `docs/gpu-vgpu.md`	New end-to-end guide for configuring GPU Operator with NVIDIA vGPU: prerequisites, driver image build/publish, Package CR deployment, NLS licensing Secret, vGPU profile selection, vgpuDeviceManager config, and KubeVirt integration examples.
Platform Variant `packages/core/platform/sources/gpu-operator.yaml`	Added a new `vgpu` spec.variant referencing `values-talos-vgpu.yaml`, enabling a vGPU-specific installation path alongside the existing `default` variant.
Chart Pinning & Metadata `packages/system/gpu-operator/Makefile`, `packages/system/gpu-operator/charts/gpu-operator/Chart.yaml`	Pinned Helm pull to `--version v26.3.0`; bumped chart `appVersion`/`version` to `v26.3.0` and updated NFD dependency repo/version to `oci://registry.k8s.io/nfd/charts` `0.18.3`.
Node Feature Discovery (NFD) chart `packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/...`	Migrated NFD components from gRPC probes to HTTP `/healthz`, consolidated metrics/health ports into single `port`, added dnsPolicy values, imagePullSecrets helper, PodDisruptionBudget templates, and updated values/README. CRDs extended with `Ge/Le/GeLe`, `type`, and rule `vars`/`varsTemplate`.
CRD Extensions `.../crds/nvidia.com_clusterpolicies.yaml`, `.../crds/nvidia.com_nvidiadrivers.yaml`	Extended ClusterPolicy and NVIDIADrivers CRDs: added vgpuManager/kernelModuleConfig, licensing `secretName` (deprecated ConfigMap), `secretEnv`, `podSecurityContext`, kataSandboxDevicePlugin, sandboxWorkloads.mode, NRI plugin/CI-related fields, expanded toleration operators, and added `disabled` state.
Templates & Rendering Logic `packages/system/gpu-operator/charts/gpu-operator/templates/*.yaml`	Numerous template changes: namespace-scoped RBAC/SA, use `manage-crds` binary for CRD apply/delete in hooks, added `extraObjects` template, refined ClusterPolicy rendering for new fields (vgpu/kata/CDI), conditional NVIDIADriver/license secret handling, simplified nodefeaturerules to kernel-module rules, and added validation preventing NRI enablement when CDI is disabled.
Values & Talos vGPU overrides `packages/system/gpu-operator/charts/gpu-operator/values.yaml`, `packages/system/gpu-operator/values-talos-vgpu.yaml`	Major values updates: enabled CDI by default with `nriPluginEnabled: false`, introduced `sandboxWorkloads.mode`, switched many env lists to `env: []`, added `extraObjects`, consolidated NFD ports to 8080, added PodDisruptionBudgets, driver/licensing secret wiring, and added `values-talos-vgpu.yaml` enabling `vgpuManager`/`vgpuDeviceManager` and disabling the standard driver/devicePlugin for Talos vGPU setups.

Sequence Diagram(s)

sequenceDiagram
participant Admin
participant BuildServer
participant Registry
participant GPU_Operator
participant NLS as "NVIDIA License Server"
participant KubeVirt

Admin->>BuildServer: Provide NVIDIA .run files & EULA
BuildServer->>Registry: Build & push vgpu-manager driver image
Admin->>GPU_Operator: Install Package CR with variant=vgpu (values-talos-vgpu)
GPU_Operator->>Registry: Pull vgpu-manager image
GPU_Operator->>NLS: Reference licensing Secret (gridd.conf) -> configure licensing
Admin->>KubeVirt: Create VirtualMachine spec referencing vGPU resource
KubeVirt->>GPU_Operator: Request mediated device allocation on host
GPU_Operator->>Registry: Ensure driver/kernel modules loaded (vgpuManager)
GPU_Operator->>KubeVirt: Expose vGPU resource to VM (permittedHostDevices/mediatedDeviceTypes)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐇 I nibble bytes and build a crate,
I stitch vGPU to the cluster gate.
From driver image to license key,
I hop through pods so VMs roam free.
Hooray — kernels, secrets, configs done; let rabbits dance beneath the sun!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check	✅ Passed	The title accurately summarizes the main changes: updating GPU Operator to v26.3.0 and adding a vGPU variant for GPU sharing, which are the primary objectives of this PR.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/update-gpu-operator-v26.3.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces NVIDIA vGPU support, updates the GPU Operator to version v26.3.0, and upgrades Node Feature Discovery (NFD) to 0.18.3. Key changes include the addition of a vgpu package variant, transition to HTTP-based health probes in NFD, and the adoption of a dedicated utility for CRD management. Feedback suggests updating the documentation to use Secrets for licensing configuration to align with security best practices and removing the deprecated cdi.default field from the ClusterPolicy template.

gemini-code-assist · 2026-04-02T12:25:24Z

+1. Create a ConfigMap with the NLS client configuration in the `cozy-gpu-operator` namespace:
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: licensing-config
+  namespace: cozy-gpu-operator
+data:
+  gridd.conf: |
+    ServerAddress=nls.example.com
+    ServerPort=443
+    FeatureType=1
+```
+
+2. Reference the ConfigMap in the Package values:
+
+```yaml
+gpu-operator:
+  vgpuManager:
+    repository: registry.example.com/nvidia
+    version: "550.90.05"
+  driver:
+    licensingConfig:
+      configMapName: licensing-config
+```


The example uses ConfigMap and configMapName for licensing configuration. However, the updated CRDs (nvidia.com_clusterpolicies.yaml and nvidia.com_nvidiadrivers.yaml) mark configMapName as deprecated in favor of secretName. It is recommended to update the documentation to use a Secret for licensing configuration to align with the new best practices for security.

gemini-code-assist · 2026-04-02T12:25:24Z

+    {{- if .Values.cdi.default }}
    default: {{ .Values.cdi.default }}
+    {{- end }}


The cdi.default field is deprecated in the ClusterPolicy CRD and is no longer used by the GPU Operator (setting cdi.enabled=true now automatically configures it as the default). This block should be removed to align with the updated CRD schema and avoid confusion.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/gpu-vgpu.md`:
- Around line 81-109: The doc incorrectly instructs creating a ConfigMap and
using driver.licensingConfig.configMapName; update it to create a Secret (kind:
Secret) containing the gridd.conf under stringData and change the package values
example to use driver.licensingConfig.secretName: licensing-config so the
chart/schema picks up the NLS licensing configuration.

In
`@packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/post-delete-job.yaml`:
- Around line 97-100: The resources block is being rendered at the Pod spec
level using .Values.master.resources but must be moved into the nfd-master
container spec; update the template so the resources: {{- toYaml
.Values.master.resources }} block is placed inside the container definition for
"nfd-master" (e.g., immediately after that container's fields) rather than after
tolerations, ensuring the resources stanza is nested under the nfd-master
container.

In
`@packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml`:
- Around line 917-924: The CRD docs for the `operator` field currently list
`Lt`/`Gt` without noting they require the Kubernetes feature gate; update the
`operator` description under the `operator` field (in the NVIDIADriver CRD) to
state that `Lt` and `Gt` require Kubernetes v1.35+ with the
TaintTolerationComparisonOperators feature gate enabled, or alternatively update
the chart's `kubeVersion` constraint to >=1.35.0 so that the chart and the
`NVIDIADriver` spec only allow these operators on supported Kubernetes versions;
pick one approach and make the matching change to either the `operator`
description or the chart `kubeVersion`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 51f9626b-f0b3-498a-82e0-c845bb0b24ad

📥 Commits

Reviewing files that changed from the base of the PR and between 73ec5a5 and ab5c618.

⛔ Files ignored due to path filters (1)

packages/system/gpu-operator/charts/gpu-operator/Chart.lock is excluded by !**/*.lock

📒 Files selected for processing (32)

docs/gpu-vgpu.md
packages/core/platform/sources/gpu-operator.yaml
packages/system/gpu-operator/Makefile
packages/system/gpu-operator/charts/gpu-operator/Chart.yaml
packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/Chart.yaml
packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/README.md
packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/crds/nfd-api-crds.yaml
packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/_helpers.tpl
packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/master-pdb.yaml
packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/master.yaml
packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/nfd-gc-pdb.yaml
packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/nfd-gc.yaml
packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/post-delete-job.yaml
packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/prometheus.yaml
packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/topologyupdater.yaml
packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/worker.yaml
packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/values.yaml
packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_clusterpolicies.yaml
packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml
packages/system/gpu-operator/charts/gpu-operator/templates/cleanup_crd.yaml
packages/system/gpu-operator/charts/gpu-operator/templates/clusterpolicy.yaml
packages/system/gpu-operator/charts/gpu-operator/templates/extra-objects.yaml
packages/system/gpu-operator/charts/gpu-operator/templates/nodefeaturerules.yaml
packages/system/gpu-operator/charts/gpu-operator/templates/nvidiadriver.yaml
packages/system/gpu-operator/charts/gpu-operator/templates/operator.yaml
packages/system/gpu-operator/charts/gpu-operator/templates/role.yaml
packages/system/gpu-operator/charts/gpu-operator/templates/rolebinding.yaml
packages/system/gpu-operator/charts/gpu-operator/templates/serviceaccount.yaml
packages/system/gpu-operator/charts/gpu-operator/templates/upgrade_crd.yaml
packages/system/gpu-operator/charts/gpu-operator/templates/validations.yaml
packages/system/gpu-operator/charts/gpu-operator/values.yaml
packages/system/gpu-operator/values-talos-vgpu.yaml

coderabbitai · 2026-04-02T13:06:28Z

+      {{- with .Values.master.resources }}
+      resources:
+        {{- toYaml . | nindent 8 }}
+      {{- end }}


⚠️ Potential issue | 🔴 Critical

Resources block is misplaced outside the container spec.

The resources block is rendered at the pod spec level (after tolerations) rather than inside the nfd-master container definition. This would cause a Kubernetes API validation error since resources is not a valid pod spec field.

🐛 Proposed fix to move resources inside container spec

The resources block should be inside the container definition (after line 83, inside the nfd-master container):

args: - "-prune" {{- if .Values.master.instance | empty | not }} - "-instance={{ .Values.master.instance }}" {{- end }} + {{- with .Values.master.resources }} + resources: + {{- toYaml . | nindent 12 }} + {{- end }} restartPolicy: Never {{- with .Values.master.nodeSelector }} nodeSelector: {{- toYaml . | nindent 8 }} {{- end }} {{- with .Values.master.affinity }} affinity: {{- toYaml . | nindent 8 }} {{- end }} {{- with .Values.master.tolerations }} tolerations: {{- toYaml . | nindent 8 }} {{- end }} - {{- with .Values.master.resources }} - resources: - {{- toYaml . | nindent 8 }} - {{- end }} {{- end }}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/post-delete-job.yaml` around lines 97 - 100, The resources block is being rendered at the Pod spec level using .Values.master.resources but must be moved into the nfd-master container spec; update the template so the resources: {{- toYaml .Values.master.resources }} block is placed inside the container definition for "nfd-master" (e.g., immediately after that container's fields) rather than after tolerations, ensuring the resources stanza is nested under the nfd-master container.

coderabbitai · 2026-04-02T13:06:28Z

                    operator:
                      description: |-
                        Operator represents a key's relationship to the value.
-                        Valid operators are Exists and Equal. Defaults to Equal.
+                        Valid operators are Exists, Equal, Lt, and Gt. Defaults to Equal.
                        Exists is equivalent to wildcard for value, so that a pod can
                        tolerate all taints of a particular category.
+                        Lt and Gt perform numeric comparisons (requires feature gate TaintTolerationComparisonOperators).
                      type: string


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Do Kubernetes Pod tolerations support LtandGtoperators behind theTaintTolerationComparisonOperators feature gate, and in which Kubernetes versions is that feature available or enabled by default?

💡 Result:

Yes, Kubernetes Pod tolerations support the Lt (Less Than) and Gt (Greater Than) operators behind the TaintTolerationComparisonOperators feature gate. These operators enable numeric comparisons between taint values and toleration values (both must be valid 64-bit positive integers), allowing threshold-based scheduling. The feature is available starting in Kubernetes v1.35 as an alpha feature (disabled by default). It must be explicitly enabled via the --feature-gates=TaintTolerationComparisonOperators=true flag on the API server and kube-scheduler. As of 2026-04-02 (during v1.36 development), it remains alpha and is not enabled by default in any version.

Citations:

1: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration

2: https://github.com/kubernetes/api/blob/master/core/v1/toleration.go

3: https://kubernetes.io/blog/2026/01/05/kubernetes-v1-35-numeric-toleration-operators/

4: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/

🏁 Script executed:

#!/bin/bash # Search for Kubernetes version constraints in the repository fd -t f -e yaml -e txt -e md | xargs grep -l "kubernetes" | head -20 # Also search for version info in common locations if [ -f "Chart.yaml" ]; then echo "=== Chart.yaml ===" && cat Chart.yaml; fi if [ -f "requirements.yaml" ]; then echo "=== requirements.yaml ===" && cat requirements.yaml; fi if [ -f "README.md" ]; then echo "=== README.md (first 50 lines) ===" && head -50 README.md; fi if [ -f "go.mod" ]; then echo "=== go.mod ===" && cat go.mod; fi # Search for version constraints in the gpu-operator chart if [ -d "packages/system/gpu-operator" ]; then echo "=== Files in gpu-operator ===" ls -la packages/system/gpu-operator/ | head -20 fi

Repository: cozystack/cozystack

Length of output: 10781

🏁 Script executed:

#!/bin/bash # Search for references to TaintTolerationComparisonOperators or kubernetes version requirements rg -i "kubernetes|k8s|version|feature.*gate|taint" --type yaml -A 2 -B 2 | head -100

Repository: cozystack/cozystack

Length of output: 13930

🏁 Script executed:

#!/bin/bash # Check if this is a vendored chart by looking for upstream references if [ -d "packages/system/gpu-operator/charts" ]; then echo "=== GPU Operator Chart Structure ===" find packages/system/gpu-operator/charts -type f -name "Chart.yaml" -o -name "values.yaml" | xargs head -20 fi

Repository: cozystack/cozystack

Length of output: 2228

Verify Lt/Gt toleration operators are supported in supported Kubernetes versions.

The GPU operator chart supports Kubernetes >= 1.16.0, but the TaintTolerationComparisonOperators feature gate is available only in Kubernetes v1.35+ and is disabled by default. Users can define valid NVIDIADriver specs with Lt/Gt operators that will be rejected by the API server on Kubernetes versions 1.16–1.34 or on v1.35+ without the feature gate explicitly enabled. Either update the CRD documentation to clarify this requirement or update the chart's kubeVersion constraint to reflect the actual compatibility.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml` around lines 917 - 924, The CRD docs for the `operator` field currently list `Lt`/`Gt` without noting they require the Kubernetes feature gate; update the `operator` description under the `operator` field (in the NVIDIADriver CRD) to state that `Lt` and `Gt` require Kubernetes v1.35+ with the TaintTolerationComparisonOperators feature gate enabled, or alternatively update the chart's `kubeVersion` constraint to >=1.35.0 so that the chart and the `NVIDIADriver` spec only allow these operators on supported Kubernetes versions; pick one approach and make the matching change to either the `operator` description or the chart `kubeVersion`.

Arsolitt · 2026-04-20T09:42:58Z

Testing Report: gpu-operator v26.3.0 on Talos Linux

Tested this PR on a Talos v1.12.6 cluster (3× VM nodes with NVIDIA A10 GPUs, Kubernetes v1.34.3).

Critical: Talos OS version parsing failure

The ClusterPolicy enters notReady state immediately after upgrade:

failed to retrieve GPU node OS tag: error processing OS major version v1: strconv.Atoi: parsing "v1": invalid syntax

This is a known bug in gpu-operator v26.3.0 (NVIDIA/gpu-operator#2239). The getGPUNodeOSTag() function in controllers/state_manager.go runs unconditionally during ClusterPolicy controller init for any cluster with GPU-labeled nodes — it is not gated by driver.enabled or any other ClusterPolicy setting. Since Talos sets VERSION_ID=v1.12.6 (with v prefix), strconv.Atoi fails on the major version.

Already fixed upstream in NVIDIA/gpu-operator#2244, released as v26.3.1 on 2026-04-18. The vendored chart should be bumped from v26.3.0 to v26.3.1.

This affects all Talos deployments regardless of variant (default or vgpu).

Secondary: nvidia-operator-validator permission error

On nodes where the validator pod was re-rolled, the driver-validation init container crashes with:

error creating symlink creator: failed to create device node nvidia-modeset: permission denied

This happens during /dev/char symlink creation. Could not fully isolate this issue since the ClusterPolicy failure above prevents normal reconciliation. Might be resolved by v26.3.1 as well, but worth verifying separately. See NVIDIA/gpu-operator#1687 for related Talos context.

Summary

The primary blocker is the OS version parsing bug — any Talos cluster with GPU nodes will hit it. Bumping the vendored chart to v26.3.1 should resolve this. The validator permission issue needs verification after the bump.

…stack#2323 fixes Mirrors the corrections that landed in cozystack/cozystack#2323 docs/gpu-vgpu.md: * Lead the vgpu install section with an experimental-status alert. The vGPU Manager DaemonSet works, but profile assignment is currently out-of-band (echo <id> > /sys/.../current_vgpu_type per VF) and resets on every reboot — without an automated mechanism the variant silently advertises zero allocatable resources. * Note that the vgpu variant now sets sandboxWorkloads.defaultWorkload to vm-vgpu out of the box (cozystack#2323), so per-node labelling is optional rather than required. * Move imagePullSecrets under vgpuManager: in the example. The chart reads it per-component (vgpuManager.imagePullSecrets, driver.imagePullSecrets, …); placing it at the package root would silently render-no-op and the DaemonSet would ImagePullBackOff without an obvious error. * Split B100 (Blackwell) from L4/L40/L40S (Ada Lovelace) — the driver-model commentary applies to both architectures, but they should be named correctly. * Note that on L40S the SR-IOV VFs share the PF device ID (10de:26b9), so a single pciVendorSelector matches both. Add lspci sanity-check guidance for other generations where PF/VF IDs may differ. * Add a migration note in the Licensing section: chart v25 → v26 deprecated driver.licensingConfig.configMapName in favour of secretName; old key still works but emits a deprecation marker in the CRD. SR-IOV vGPU does not consume the host-side licensing knob at all — relevant only for passthrough operators carrying old configMapName overrides. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Six fixes from review: * Add a top-of-section warning that the entire vgpu workflow depends on cozystack/cozystack#2323 (still in draft) and KubeVirt v1.9.0 (unreleased) — neither is in any current Cozystack release. Without this banner the page reads like a supported workflow. * Drop the incorrect claim that gpus: expects an mdev resource. The upstream KubeVirt VirtualMachine spec accepts both PCI and mdev resource names under either gpus: or hostDevices:. Restate the hostDevices vs gpus distinction as convention, not a hard constraint, and explain the cozystack/v1alpha1 wrapper trade-off. * Drop the self-contradicting parenthetical that called pciVendorSelector 'the device ID, not the vendor:device pair' — the example right above already uses the vendor:device tuple (10DE:26B9), and the upstream KubeVirt godoc documents the field exactly as that tuple. * Re-quote siderolabs/extensions#461. The issue was closed with stateReason=COMPLETED, not not_planned; the literal phrase 'won't fix' does not appear in the closure. Replace with the substantive reason from rothgar's closing comment. * Fix pre-existing PCI ID error in the passthrough lspci sample output: A10 is 10de:2236 (verified against pci-ids.ucw.cz), the prior text said 10de:26b9 which is L40S. The rest of the passthrough section already uses 2236 correctly, so this was a copy-paste artefact. * Soften the KubeVirt v1.9.0 ETA wording. KubeVirt does not publish hard release dates; replace 'ETA July 2026' with 'targeted for the v1.9.0 release; track the PR for the actual release tag'. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

… to vGPU 20.x Round 4 of review fixes for the vgpu section. Blocking fixes: * Front-matter title was 'Running VMs with GPU Passthrough' but the page now also covers vGPU as a co-equal section. Update title / linkTitle / description / lead paragraph to reflect both flows and add a jump-link to the vGPU section. * Unify ClientConfigToken file mode at 0744 across both sections (step 4 still had 0600, contradicting step 6's 0744 with the NVIDIA-cited justification). 0600 silently breaks license activation when nvidia-gridd runs as a service user. * Harden the profile-loader DaemonSet against copy-paste failures: read-before-write so manual out-of-band changes are visible (and so the kernel does not reject writes while a VM holds the VF and the script logs an error every minute on every busy VF), strip trailing #-comments from ConfigMap lines, log on malformed lines instead of an opaque 'failed to set'. * Add an explicit side-effect callout: while this DaemonSet runs, manual kubectl exec changes to current_vgpu_type are reverted within 60 s. Edit the ConfigMap rather than the sysfs file. * Pin the SR-IOV path to vGPU 20.x explicitly. The previous '17/20' framing was misleading because the SR-IOV / pciHostDevices flow pairs specifically with branch 20 (driver 595.x); the example build is a 595.x .run, and a reader on a 17.x subscription would build the wrong manager image. Worth-fixing: * Reword the cozystack/cozystack#2323 reference time-stable so the text does not need editing once the upstream PR merges. * Tighten the wrapper gpus: claim. The Cozystack VirtualMachine wrapper passes deviceName straight through to KubeVirt; what's missing for SR-IOV vGPU is the hostDevices: surface for headless setups and end-to-end exercise on real hardware. Be precise about which gap is which. * Add (passthrough only) to the licensing migration alert so a vGPU operator does not 'fix' their setup by switching to secretName and then wonder why nothing changed. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Update vendored upstream chart to v26.3.0. Key changes: - CDI enabled by default - Driver version 580.126.20 - NFD subchart updated to 0.18.3 - Component versions bumped (toolkit v1.19.0, device-plugin v0.19.0, dcgm-exporter 4.5.2) - Pin chart version in Makefile for reproducibility Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Add vGPU mode as a new variant alongside the existing passthrough (default) variant. This enables sharing a single physical GPU across multiple VMs using NVIDIA mediated devices. Changes: - Add values-talos-vgpu.yaml with vGPU Manager and vGPU Device Manager enabled - Add 'vgpu' variant to PackageSource - Add documentation for vGPU setup (driver image, NLS licensing, profiles, KubeVirt) Users select the vgpu variant via Package CR and provide their proprietary vGPU Manager image coordinates through component values. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Replace simplified Containerfile with NVIDIA's Makefile-based build system from gitlab.com/nvidia/container-images/driver. The GPU Operator expects pre-compiled kernel modules, not a raw .run file. Add EULA notice about public redistribution of vGPU driver images. Add note about NLS ServerPort being deployment-dependent. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

ConfigMap-based licensing configuration (configMapName) is deprecated in GPU Operator v26.x. Switch to Secret-based configuration (secretName) as recommended by the updated CRD schema. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Pull latest patch release: GDRCopy v2.5.2, Kata Sandbox Device Plugin v0.0.3, multi-arch digest fix for KubeVirt GPU Device Plugin (no longer amd64-only), and OS release name detection fix. NFD remains v0.18.3. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…ncy, DLS-based licensing Rewrites docs/gpu-vgpu.md to match the actual NVIDIA driver model on current data-centre GPUs. Key changes: - Drop the implicit assumption that vGPU uses mediated devices. Ada Lovelace and newer (L4, L40, L40S, B100) on the vGPU 17/20 branch use SR-IOV with per-VF NVIDIA sysfs (current_vgpu_type); KubeVirt advertises them via permittedHostDevices.pciHostDevices. - Document KubeVirt PR #16890 as a hard dependency for the SR-IOV path. The PR is in main since 2026-04-10 and will ship in v1.9.0 (ETA July 2026); released tags v1.6.x/1.7.x/1.8.x do not include it and backports are not planned. - Replace the legacy NLS configuration block (ServerAddress=, ServerPort=7070, FeatureType=1) with the current DLS flow: ClientConfigToken file in /etc/nvidia/ClientConfigToken/, gridd.conf with FeatureType=0 (auto-detect). - Switch the build instructions from the archived gitlab.com/nvidia/container-images/driver to the upstream github.com/NVIDIA/gpu-driver-container. - Switch sample VirtualMachine spec from gpus: (mdev resource) to hostDevices: (PCI resource). - Note explicitly that Talos is not recommended for the vGPU variant — the proprietary .run cannot be redistributed in a Talos system extension, and siderolabs/extensions#461 was closed won't-fix. Passthrough on Talos is unaffected. Rename packages/system/gpu-operator/values-talos-vgpu.yaml to values-vgpu.yaml — the overrides (sandbox workloads, pod-side driver and device plugin disabled) are not Talos-specific, they apply to any VM-only GPU node. Update PackageSource accordingly. Default vgpuDeviceManager.enabled: false. The DaemonSet walks /sys/class/mdev_bus/, which does not exist on Ada+ — keeping it on by default would crashloop modern hardware. Operators of older GPUs (Pascal/Volta/Turing/Ampere) can flip the flag back on. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…ng example Addresses follow-up review of the vGPU variant + chart bump. * values.yaml: pin ccManager.enabled=false and vgpuDeviceManager.enabled=false. Upstream chart v26.x flipped both to true. ccManager auto-enables CC mode on Hopper-class hardware after upgrade — a non-trivial behavior change that should be opt-in. vgpuDeviceManager walks /sys/class/mdev_bus/, which does not exist on Ada+ (L4/L40/L40S/B100 with vGPU 17/20). Pinning both off keeps the default footprint as it was on v25.x; users explicitly opt in. * hack/dcgm-default-counters.csv: refresh from dcgm-exporter tag 4.5.1-4.8.0 (matches the image bumped earlier in this PR). The previous snapshot was pinned to 4.1.1-4.0.4. Adds DCGM_FI_DEV_FB_RESERVED and other fields introduced upstream. Header comment updated. hack/check-gpu-recording-rules.bats passes against the refreshed fixture. * docs/gpu-vgpu.md (Licensing section): - Drop the wrong recommendation to wire the licensing Secret through gpu-operator's driver.licensingConfig.secretName. That mounts into the host driver pod, where SR-IOV vGPU does not consume a license. Document guest-side delivery (cloud-init write_files for /etc/nvidia/ClientConfigToken/ and /etc/nvidia/gridd.conf) end to end. - Correct FeatureType comment. Per the NVIDIA Virtual GPU Software Licensing User Guide, FeatureType=0 is the unlicensed state, not auto-detect; FeatureType=1 is the value that asks the driver to auto-select the license type from the configured vGPU profile. Switch the example to FeatureType=1 and list all four valid values with their meaning. - Add nvidia-smi verification snippet inside the guest. * docs/gpu-vgpu.md (KubeVirt section): note that L40S SR-IOV VFs share the PF device ID (10de:26b9), so a single pciVendorSelector matches both. Recommend lspci -nn -d 10de: as a sanity check on other generations where PF/VF IDs may differ. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…path, mark vgpu variant experimental Addresses second review pass. Three blockers + one polish: * values-vgpu.yaml: set sandboxWorkloads.defaultWorkload=vm-vgpu. Without this the upstream default 'container' applies on any node without the per-node nvidia.com/gpu.workload.config label, the operator selects the container workload (driver/devicePlugin DaemonSets), and those are disabled in this variant — so nothing runs on the node and vgpu-manager never starts. The cluster-wide default makes the variant functional out of the box; per-node labels still override it for mixed clusters. * docs/gpu-vgpu.md: imagePullSecrets is per-component on this chart (vgpuManager.imagePullSecrets, driver.imagePullSecrets, …). The Package CR example previously placed the field at gpu-operator: root where it would be silently ignored at render. Move it under vgpuManager: where the chart actually reads it. * docs/gpu-vgpu.md, release note: mark the vgpu variant as experimental and call out the missing profile-assignment loop explicitly. Until an SR-IOV-aware controller writes current_vgpu_type per VF (and re-applies after every reboot — PCIe re-enumeration resets the value), the variant is not safe for production. The docs now lead with that warning instead of burying it in a caveat. * Rename packages/system/gpu-operator/values-talos.yaml to values-passthrough.yaml. The overrides (sandboxWorkloads.enabled, driver and devicePlugin disabled) apply to any host OS using the vfio-pci passthrough path — not Talos specifically. Update PackageSource ref accordingly. Comment in the file documents the rename so future readers understand the prior name. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

… Ada/Blackwell * examples/README.md and examples/values-native-talos.yaml header comment still pointed at values-talos.yaml after the rename to values-passthrough.yaml. Update both to the new name. * values-vgpu.yaml: drop the leftover vgpuDeviceManager.config block. The chart only renders the config when vgpuDeviceManager.enabled is true, and we keep it false on Ada+; the orphan config.default key was scaffolding without effect. * docs/gpu-vgpu.md and values.yaml comment: stop grouping B100 under Ada Lovelace. B100 is Blackwell; the SR-IOV vGPU mechanism applies to both architectures but they should be named correctly. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…licy Adds hack/check-gpu-operator-variants.bats. The test renders both package variants with helm template and asserts the resulting ClusterPolicy spec for the values that this PR cares about: * ccManager.enabled=false (pinned in values.yaml; upstream v26 default was flipped to true) * vgpuDeviceManager.enabled=false (same — and on Ada+/Blackwell the DaemonSet would crashloop because /sys/class/mdev_bus is empty) * sandboxWorkloads.defaultWorkload — vm-passthrough on the default variant, vm-vgpu on the vgpu variant; without these the upstream 'container' default makes either variant a silent no-op on unlabelled GPU nodes * vgpuManager.enabled=true on the vgpu variant; driver / devicePlugin enabled=false on both variants (proprietary host driver path is out of scope for both passthrough and SR-IOV vGPU) Negative-tested locally: deleting any of these settings makes the corresponding test fail, so it actually guards regressions on the next chart bump. Also addresses last-pass review comments: * values-passthrough.yaml: set sandboxWorkloads.defaultWorkload= vm-passthrough so the default variant is symmetric with the vgpu variant — neither relies on per-node labels for the variant to activate. Without this the upstream default 'container' would apply on unlabelled GPU nodes and vfio-manager would never start. * values.yaml: rewrite the vgpuDeviceManager comment to lead with the load-bearing reason (mdev path missing on Ada+/Blackwell) instead of the secondary 'stays at zero pods on passthrough' rationalisation. * docs/gpu-vgpu.md: add a 'Migrating from chart v25.x' subsection in the Licensing section pointing at the configMapName -> secretName rename. Old key still works upstream but emits a deprecation warning; PR body release note already calls this out but the docs were missing the migration breadcrumb. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…ote siderolabs#461 Two technical inaccuracies caught during review: * The claim 'gpus: expects an mdev resource and will not match a pciHostDevices entry' is wrong. Upstream KubeVirt (pkg/virt-launcher/virtwrap/device/hostdevice/gpu/hostdev.go) iterates both the PCI pool (v1.PCIResourcePrefix) and the mdev pool (v1.MDevResourcePrefix) and accepts whichever resolves. Reword to describe gpus: vs hostDevices: as a style convention rather than a hard technical constraint. * siderolabs/extensions#461 was closed with stateReason=COMPLETED, not not_planned. Quoting the closure as 'won't fix' is editorial recasting that does not match the actual closing comment. Replace with the substantive reason from rothgar's closing comment: NVIDIA does not publicly distribute the vGPU guest driver, requires Enterprise Portal access, and Sidero cannot support vGPU until NVIDIA changes its licensing terms. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…ording Mirroring the corresponding cleanup in cozystack/website#467 review: * KubeVirt v1.9.0 ETA: drop the 'July 2026' date — KubeVirt does not publish hard release dates and the in-page wording was inconsistent across the file. Replace with 'Targeted at the next minor release (v1.9.0); track the PR for the actual release tag', matching the website doc. * Kernel-headers timing: 'The build downloads kernel headers at container start time' conflated docker build with pod start; the Dockerfile's entrypoint downloads them at runtime. Reword. * Replace guillemets with ASCII quotes for consistency with the rest of the file and the upstream chart docs. * nvidia-smi -q license check: drop the unnecessary -A 1 — License Status value is on the same line as the field name. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

NVIDIA's License System User Guide recommends 744 on the ClientConfigToken file so nvidia-gridd (which does not necessarily run as the file owner) can read it. Previous example used 0600, which can silently break license activation when nvidia-gridd runs as a service user. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Apple Silicon / arm64 build hosts produce arm64 images native; GPU nodes are amd64 and the kubelet pull fails with 'no matching manifest' when the image is built without --platform. Document the flag and add a one-line docker login note for private registries. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…ream NVIDIA's gpu-driver-container repository owns the build path — keeping a parallel set of build instructions in our docs duplicates upstream documentation and goes stale every time the upstream Dockerfile or build args change. Replace the docker build / docker push snippet with a paragraph pointing at the upstream repo and the NVIDIA Licensing Portal as the sources of truth, keep only the items that are Cozystack-specific (private-registry / EULA note). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Pin a concrete reference set so future readers can tell whether the guide is still accurate against the cluster they're running: KubeVirt main nightly 20260429_74d7c52588, this PR's vgpu variant, NVIDIA vGPU 20.0 host driver 595.58.02, GRID guest driver 595.58.03. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Free-floating 'NVIDIA recommends' claims are flagged as unverifiable in review. Add the doc URL inline so future readers can verify the permission model decision. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…stack#2323 fixes Mirrors the corrections that landed in cozystack/cozystack#2323 docs/gpu-vgpu.md: * Lead the vgpu install section with an experimental-status alert. The vGPU Manager DaemonSet works, but profile assignment is currently out-of-band (echo <id> > /sys/.../current_vgpu_type per VF) and resets on every reboot — without an automated mechanism the variant silently advertises zero allocatable resources. * Note that the vgpu variant now sets sandboxWorkloads.defaultWorkload to vm-vgpu out of the box (cozystack#2323), so per-node labelling is optional rather than required. * Move imagePullSecrets under vgpuManager: in the example. The chart reads it per-component (vgpuManager.imagePullSecrets, driver.imagePullSecrets, …); placing it at the package root would silently render-no-op and the DaemonSet would ImagePullBackOff without an obvious error. * Split B100 (Blackwell) from L4/L40/L40S (Ada Lovelace) — the driver-model commentary applies to both architectures, but they should be named correctly. * Note that on L40S the SR-IOV VFs share the PF device ID (10de:26b9), so a single pciVendorSelector matches both. Add lspci sanity-check guidance for other generations where PF/VF IDs may differ. * Add a migration note in the Licensing section: chart v25 → v26 deprecated driver.licensingConfig.configMapName in favour of secretName; old key still works but emits a deprecation marker in the CRD. SR-IOV vGPU does not consume the host-side licensing knob at all — relevant only for passthrough operators carrying old configMapName overrides. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Six fixes from review: * Add a top-of-section warning that the entire vgpu workflow depends on cozystack/cozystack#2323 (still in draft) and KubeVirt v1.9.0 (unreleased) — neither is in any current Cozystack release. Without this banner the page reads like a supported workflow. * Drop the incorrect claim that gpus: expects an mdev resource. The upstream KubeVirt VirtualMachine spec accepts both PCI and mdev resource names under either gpus: or hostDevices:. Restate the hostDevices vs gpus distinction as convention, not a hard constraint, and explain the cozystack/v1alpha1 wrapper trade-off. * Drop the self-contradicting parenthetical that called pciVendorSelector 'the device ID, not the vendor:device pair' — the example right above already uses the vendor:device tuple (10DE:26B9), and the upstream KubeVirt godoc documents the field exactly as that tuple. * Re-quote siderolabs/extensions#461. The issue was closed with stateReason=COMPLETED, not not_planned; the literal phrase 'won't fix' does not appear in the closure. Replace with the substantive reason from rothgar's closing comment. * Fix pre-existing PCI ID error in the passthrough lspci sample output: A10 is 10de:2236 (verified against pci-ids.ucw.cz), the prior text said 10de:26b9 which is L40S. The rest of the passthrough section already uses 2236 correctly, so this was a copy-paste artefact. * Soften the KubeVirt v1.9.0 ETA wording. KubeVirt does not publish hard release dates; replace 'ETA July 2026' with 'targeted for the v1.9.0 release; track the PR for the actual release tag'. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

… to vGPU 20.x Round 4 of review fixes for the vgpu section. Blocking fixes: * Front-matter title was 'Running VMs with GPU Passthrough' but the page now also covers vGPU as a co-equal section. Update title / linkTitle / description / lead paragraph to reflect both flows and add a jump-link to the vGPU section. * Unify ClientConfigToken file mode at 0744 across both sections (step 4 still had 0600, contradicting step 6's 0744 with the NVIDIA-cited justification). 0600 silently breaks license activation when nvidia-gridd runs as a service user. * Harden the profile-loader DaemonSet against copy-paste failures: read-before-write so manual out-of-band changes are visible (and so the kernel does not reject writes while a VM holds the VF and the script logs an error every minute on every busy VF), strip trailing #-comments from ConfigMap lines, log on malformed lines instead of an opaque 'failed to set'. * Add an explicit side-effect callout: while this DaemonSet runs, manual kubectl exec changes to current_vgpu_type are reverted within 60 s. Edit the ConfigMap rather than the sysfs file. * Pin the SR-IOV path to vGPU 20.x explicitly. The previous '17/20' framing was misleading because the SR-IOV / pciHostDevices flow pairs specifically with branch 20 (driver 595.x); the example build is a 595.x .run, and a reader on a 17.x subscription would build the wrong manager image. Worth-fixing: * Reword the cozystack/cozystack#2323 reference time-stable so the text does not need editing once the upstream PR merges. * Tighten the wrapper gpus: claim. The Cozystack VirtualMachine wrapper passes deviceName straight through to KubeVirt; what's missing for SR-IOV vGPU is the hostDevices: surface for headless setups and end-to-end exercise on real hardware. Be precise about which gap is which. * Add (passthrough only) to the licensing migration alert so a vGPU operator does not 'fix' their setup by switching to secretName and then wonder why nothing changed. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

@test

Build job in CI failed because the test relied on bats-specific machinery the in-repo cozytest.sh runner does not provide: * BATS_TEST_DIRNAME is undefined when the file is sourced as a plain shell script (cozytest.sh's mode); switch to BATS_TEST_FILENAME with a $0 fallback, matching hack/check-gpu-recording-rules.bats (REPO_ROOT pattern). * setup()/teardown() are bats lifecycle hooks; cozytest.sh runs each @test in a fresh subshell with set -u and never invokes them. With TMP only assigned in setup(), every test exited immediately on 'TMP: unbound variable'. Inline the mktemp + trap inside each @test so both runners work the same. Verified locally that bash hack/cozytest.sh hack/check-gpu-operator-variants.bats now reports all six tests OK. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

lexfrei added this to Cozystack Roadmap Apr 2, 2026

lexfrei self-assigned this Apr 2, 2026

gemini-code-assist Bot reviewed Apr 2, 2026

View reviewed changes

lexfrei marked this pull request as ready for review April 2, 2026 12:51

lexfrei requested review from IvanHunters, androndo, kvaps, lllamnyp and sircthulhu as code owners April 2, 2026 12:51

dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files kind/feature Categorizes issue or PR as related to a new feature labels Apr 2, 2026

coderabbitai Bot reviewed Apr 2, 2026

View reviewed changes

IvanHunters self-assigned this Apr 9, 2026

lexfrei marked this pull request as draft April 10, 2026 11:46

lexfrei added the do-not-merge label Apr 12, 2026

IvanHunters removed their assignment Apr 22, 2026

github-actions Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress and removed do-not-merge labels Apr 27, 2026

lexfrei force-pushed the feat/update-gpu-operator-v26.3.0 branch from 783b51b to 5a61a33 Compare April 28, 2026 20:05

github-actions Bot added the area/virtualization Issues or PRs related to virtualization (kubevirt, cdi, vmi, vm-import) label Apr 28, 2026

lexfrei changed the title ~~[gpu-operator] Update to v26.3.0 and add vGPU variant~~ [gpu-operator] Update to v26.3.1 and add experimental vGPU variant Apr 30, 2026

lexfrei mentioned this pull request Apr 30, 2026

[docs] Add vGPU setup guide for GPU sharing between VMs cozystack/website#467

Draft

lexfrei added 3 commits April 30, 2026 14:52

lexfrei added 14 commits April 30, 2026 14:52

lexfrei force-pushed the feat/update-gpu-operator-v26.3.0 branch from 0e2dfad to b5c1e95 Compare April 30, 2026 11:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gpu-operator] Update to v26.3.1 and add experimental vGPU variant#2323

[gpu-operator] Update to v26.3.1 and add experimental vGPU variant#2323
lexfrei wants to merge 18 commits intomainfrom
feat/update-gpu-operator-v26.3.0

lexfrei commented Apr 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 2, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 2, 2026

Uh oh!

coderabbitai Bot Apr 2, 2026

Uh oh!

Arsolitt commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lexfrei commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

vgpu variant is experimental

KubeVirt requirement

Behavior changes inherited from upstream chart v25 → v26

Talos

Release note

Uh oh!

coderabbitai Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Arsolitt commented Apr 20, 2026

Testing Report: gpu-operator v26.3.0 on Talos Linux

Critical: Talos OS version parsing failure

Secondary: nvidia-operator-validator permission error

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lexfrei commented Apr 2, 2026 •

edited

Loading

`vgpu` variant is experimental

coderabbitai Bot commented Apr 2, 2026 •

edited

Loading