Skip to content

[gpu-operator] Update to v26.3.1 and add experimental vGPU variant#2323

Draft
lexfrei wants to merge 18 commits intomainfrom
feat/update-gpu-operator-v26.3.0
Draft

[gpu-operator] Update to v26.3.1 and add experimental vGPU variant#2323
lexfrei wants to merge 18 commits intomainfrom
feat/update-gpu-operator-v26.3.0

Conversation

@lexfrei
Copy link
Copy Markdown
Contributor

@lexfrei lexfrei commented Apr 2, 2026

What this PR does

Updates the NVIDIA GPU Operator from v25.3.0 to v26.3.1 and adds an experimental vgpu variant for SR-IOV vGPU on Ada Lovelace and newer GPUs.

Changes:

  • Update vendored upstream chart to v26.3.1 (CDI enabled by default, component version bumps).
  • Pin chart version in Makefile for reproducibility.
  • Add values-vgpu.yaml for the vGPU variant with defaultWorkload: vm-vgpu so the chart activates the vGPU Manager DaemonSet without per-node labelling.
  • Add vgpu variant to PackageSource alongside the existing default (passthrough).
  • Rename values-talos.yamlvalues-passthrough.yaml and pin defaultWorkload: vm-passthrough for symmetry with the vgpu variant. The overrides apply to any host OS using vfio-pci, not Talos specifically.
  • Refresh hack/dcgm-default-counters.csv to match the new dcgm-exporter image tag (4.5.1-4.8.0) — adds DCGM_FI_DEV_FB_RESERVED and other fields introduced upstream. hack/check-gpu-recording-rules.bats passes against the new fixture.
  • Add hack/check-gpu-operator-variants.bats that renders both variants via helm template and asserts the pinned ClusterPolicy values, so the next chart bump cannot silently un-pin them.
  • Add internal documentation for the vGPU bring-up.

The vgpu variant assumes the user supplies a proprietary vGPU Manager image via Package CR component values — the .run cannot be redistributed.

vgpu variant is experimental

NVIDIA's vgpu-device-manager walks /sys/class/mdev_bus/, which does not exist on Ada Lovelace and Blackwell — the DaemonSet errors with "no parent devices found" and is disabled by default. Until an SR-IOV-aware controller is shipped, profile assignment is an out-of-band step (echo <id> > /sys/.../current_vgpu_type per VF) that must be re-applied after every node reboot. Without it, permittedHostDevices.pciHostDevices reports zero allocatable resources. The variant ships so the chart bump and the underlying skeleton (DaemonSet + image plumbing + docs) land together; production deployments need an automated profile-assignment mechanism in addition to this PR.

KubeVirt requirement

KubeVirt advertises VFs via permittedHostDevices.pciHostDevices rather than mediatedDevices. This requires kubevirt/kubevirt#16890 ("vGPU: SRIOV support", merged 2026-04-10) which is main-only — released tags v1.6.x / v1.7.x / v1.8.x do not include it. The patch is targeted at the next minor KubeVirt release (v1.9.0); track the PR for the actual release tag. Backports to released branches are not planned.

Behavior changes inherited from upstream chart v25 → v26

The upstream chart flipped two defaults that are pinned back to false here, so default (passthrough) deployments do not silently change footprint on upgrade:

  • ccManager.enabled was flipped to true upstream — would auto-enable Confidential Computing on Hopper hardware (H100/H200/B100) after upgrade.
  • vgpuDeviceManager.enabled was flipped to true upstream — DaemonSet would render even in passthrough mode.

Both stay opt-in via Package CR overrides. The new bats test guards the pin.

Other inherited changes that are user-visible and worth pinning down before upgrade:

  • driver.licensingConfig.configMapName is deprecated upstream in favour of driver.licensingConfig.secretName. The deprecation marker lives in the CRD's description field (visible via kubectl explain) — there is no helm-render-time warning, the chart template silently prefers secretName and falls back to configMapName. Operators with existing licensing overrides should migrate the key on the next chart upgrade.
  • driver.version upstream default jumped from 570.x to 580.x. Invisible on the default variant (which sets driver.enabled: false), user-visible for anyone running the upstream Ubuntu driver path.
  • vfioManager.image switched from nvcr.io/nvidia/cuda:12.8.1-base-ubi9 to nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.10.0. Affects every default (passthrough) deployment on upgrade — different image, different layer, different image-pull surface. Worth pre-mirroring for air-gapped clusters.
  • dcgmExporter.image switched to the -distroless flavour. Smaller surface, fewer base-image CVEs, but no shell — operators who used to kubectl exec into the exporter for ad-hoc debugging will need to switch to kubectl debug or container ephemeral images.
  • CDI (Container Device Interface) is enabled by default in v26.3.1; this is upstream's recommended setting and is left as-is.
  • node-feature-discovery subchart source switched from the GitHub Pages Helm repo to oci://registry.k8s.io/nfd/charts. Air-gapped installations need to mirror the new OCI URL.

Talos

Talos is not recommended for the vGPU variant. NVIDIA does not publicly distribute the vGPU guest driver — it requires NVIDIA Enterprise Portal access — and Sidero closed siderolabs/extensions#461 noting that they cannot support vGPU "unless NVIDIA changes their licensing terms or provides us a way to obtain, test, and distribute the software". Building a Talos system extension that bakes in the vGPU driver is therefore not feasible without a private fork that violates the EULA. Passthrough (default variant) on Talos is unaffected.

Release note

[gpu-operator] Update NVIDIA GPU Operator to v26.3.1 and add experimental `vgpu` variant for SR-IOV vGPU on Ada Lovelace / Blackwell (L4, L40, L40S, B100; Ubuntu host; requires KubeVirt v1.9.0+ for `pciHostDevices` advertising). The variant ships skeleton support — profile assignment (`current_vgpu_type` per VF) is currently out of band and must be automated by the operator before production use. On the existing `default` variant: `ccManager` and `vgpuDeviceManager` upstream defaults are pinned back to `false` to preserve prior behavior on chart upgrade; `defaultWorkload` is now pinned to `vm-passthrough` (was unset, falling back to `container`); `driver.licensingConfig.configMapName` is deprecated by upstream in favour of `secretName`; `driver.version` upstream default moves 570 → 580; `vfioManager.image` switches from CUDA UBI9 to k8s-driver-manager (relevant for air-gapped mirrors); `dcgmExporter` image switched to the `-distroless` flavour. NFD subchart source switched to `oci://registry.k8s.io/nfd/charts`.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 2, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e706f581-6ce4-4e61-b9b0-1626595c2314

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds vGPU support to the GPU Operator (new vgpu variant and Talos vGPU values), upgrades GPU Operator to v26.3.0 and Node Feature Discovery to v0.18.3, migrates NFD probes from gRPC to HTTP, enhances multiple CRDs/templates for licensing, vGPU/kernel-module config, CDI/NRI and kata sandbox integration, and adds documentation.

Changes

Cohort / File(s) Summary
Documentation
docs/gpu-vgpu.md
New end-to-end guide for configuring GPU Operator with NVIDIA vGPU: prerequisites, driver image build/publish, Package CR deployment, NLS licensing Secret, vGPU profile selection, vgpuDeviceManager config, and KubeVirt integration examples.
Platform Variant
packages/core/platform/sources/gpu-operator.yaml
Added a new vgpu spec.variant referencing values-talos-vgpu.yaml, enabling a vGPU-specific installation path alongside the existing default variant.
Chart Pinning & Metadata
packages/system/gpu-operator/Makefile, packages/system/gpu-operator/charts/gpu-operator/Chart.yaml
Pinned Helm pull to --version v26.3.0; bumped chart appVersion/version to v26.3.0 and updated NFD dependency repo/version to oci://registry.k8s.io/nfd/charts 0.18.3.
Node Feature Discovery (NFD) chart
packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/...
Migrated NFD components from gRPC probes to HTTP /healthz, consolidated metrics/health ports into single port, added dnsPolicy values, imagePullSecrets helper, PodDisruptionBudget templates, and updated values/README. CRDs extended with Ge/Le/GeLe, type, and rule vars/varsTemplate.
CRD Extensions
.../crds/nvidia.com_clusterpolicies.yaml, .../crds/nvidia.com_nvidiadrivers.yaml
Extended ClusterPolicy and NVIDIADrivers CRDs: added vgpuManager/kernelModuleConfig, licensing secretName (deprecated ConfigMap), secretEnv, podSecurityContext, kataSandboxDevicePlugin, sandboxWorkloads.mode, NRI plugin/CI-related fields, expanded toleration operators, and added disabled state.
Templates & Rendering Logic
packages/system/gpu-operator/charts/gpu-operator/templates/*.yaml
Numerous template changes: namespace-scoped RBAC/SA, use manage-crds binary for CRD apply/delete in hooks, added extraObjects template, refined ClusterPolicy rendering for new fields (vgpu/kata/CDI), conditional NVIDIADriver/license secret handling, simplified nodefeaturerules to kernel-module rules, and added validation preventing NRI enablement when CDI is disabled.
Values & Talos vGPU overrides
packages/system/gpu-operator/charts/gpu-operator/values.yaml, packages/system/gpu-operator/values-talos-vgpu.yaml
Major values updates: enabled CDI by default with nriPluginEnabled: false, introduced sandboxWorkloads.mode, switched many env lists to env: [], added extraObjects, consolidated NFD ports to 8080, added PodDisruptionBudgets, driver/licensing secret wiring, and added values-talos-vgpu.yaml enabling vgpuManager/vgpuDeviceManager and disabling the standard driver/devicePlugin for Talos vGPU setups.

Sequence Diagram(s)

sequenceDiagram
participant Admin
participant BuildServer
participant Registry
participant GPU_Operator
participant NLS as "NVIDIA License Server"
participant KubeVirt

Admin->>BuildServer: Provide NVIDIA .run files & EULA
BuildServer->>Registry: Build & push vgpu-manager driver image
Admin->>GPU_Operator: Install Package CR with variant=vgpu (values-talos-vgpu)
GPU_Operator->>Registry: Pull vgpu-manager image
GPU_Operator->>NLS: Reference licensing Secret (gridd.conf) -> configure licensing
Admin->>KubeVirt: Create VirtualMachine spec referencing vGPU resource
KubeVirt->>GPU_Operator: Request mediated device allocation on host
GPU_Operator->>Registry: Ensure driver/kernel modules loaded (vgpuManager)
GPU_Operator->>KubeVirt: Expose vGPU resource to VM (permittedHostDevices/mediatedDeviceTypes)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐇 I nibble bytes and build a crate,
I stitch vGPU to the cluster gate.
From driver image to license key,
I hop through pods so VMs roam free.
Hooray — kernels, secrets, configs done; let rabbits dance beneath the sun!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title accurately summarizes the main changes: updating GPU Operator to v26.3.0 and adding a vGPU variant for GPU sharing, which are the primary objectives of this PR.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/update-gpu-operator-v26.3.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lexfrei lexfrei self-assigned this Apr 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces NVIDIA vGPU support, updates the GPU Operator to version v26.3.0, and upgrades Node Feature Discovery (NFD) to 0.18.3. Key changes include the addition of a vgpu package variant, transition to HTTP-based health probes in NFD, and the adoption of a dedicated utility for CRD management. Feedback suggests updating the documentation to use Secrets for licensing configuration to align with security best practices and removing the deprecated cdi.default field from the ClusterPolicy template.

Comment thread docs/gpu-vgpu.md Outdated
Comment on lines +79 to +104
1. Create a ConfigMap with the NLS client configuration in the `cozy-gpu-operator` namespace:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: licensing-config
namespace: cozy-gpu-operator
data:
gridd.conf: |
ServerAddress=nls.example.com
ServerPort=443
FeatureType=1
```

2. Reference the ConfigMap in the Package values:

```yaml
gpu-operator:
vgpuManager:
repository: registry.example.com/nvidia
version: "550.90.05"
driver:
licensingConfig:
configMapName: licensing-config
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example uses ConfigMap and configMapName for licensing configuration. However, the updated CRDs (nvidia.com_clusterpolicies.yaml and nvidia.com_nvidiadrivers.yaml) mark configMapName as deprecated in favor of secretName. It is recommended to update the documentation to use a Secret for licensing configuration to align with the new best practices for security.

Comment on lines +152 to +154
{{- if .Values.cdi.default }}
default: {{ .Values.cdi.default }}
{{- end }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cdi.default field is deprecated in the ClusterPolicy CRD and is no longer used by the GPU Operator (setting cdi.enabled=true now automatically configures it as the default). This block should be removed to align with the updated CRD schema and avoid confusion.

@lexfrei lexfrei marked this pull request as ready for review April 2, 2026 12:51
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files kind/feature Categorizes issue or PR as related to a new feature labels Apr 2, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/gpu-vgpu.md`:
- Around line 81-109: The doc incorrectly instructs creating a ConfigMap and
using driver.licensingConfig.configMapName; update it to create a Secret (kind:
Secret) containing the gridd.conf under stringData and change the package values
example to use driver.licensingConfig.secretName: licensing-config so the
chart/schema picks up the NLS licensing configuration.

In
`@packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/post-delete-job.yaml`:
- Around line 97-100: The resources block is being rendered at the Pod spec
level using .Values.master.resources but must be moved into the nfd-master
container spec; update the template so the resources: {{- toYaml
.Values.master.resources }} block is placed inside the container definition for
"nfd-master" (e.g., immediately after that container's fields) rather than after
tolerations, ensuring the resources stanza is nested under the nfd-master
container.

In
`@packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml`:
- Around line 917-924: The CRD docs for the `operator` field currently list
`Lt`/`Gt` without noting they require the Kubernetes feature gate; update the
`operator` description under the `operator` field (in the NVIDIADriver CRD) to
state that `Lt` and `Gt` require Kubernetes v1.35+ with the
TaintTolerationComparisonOperators feature gate enabled, or alternatively update
the chart's `kubeVersion` constraint to >=1.35.0 so that the chart and the
`NVIDIADriver` spec only allow these operators on supported Kubernetes versions;
pick one approach and make the matching change to either the `operator`
description or the chart `kubeVersion`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 51f9626b-f0b3-498a-82e0-c845bb0b24ad

📥 Commits

Reviewing files that changed from the base of the PR and between 73ec5a5 and ab5c618.

⛔ Files ignored due to path filters (1)
  • packages/system/gpu-operator/charts/gpu-operator/Chart.lock is excluded by !**/*.lock
📒 Files selected for processing (32)
  • docs/gpu-vgpu.md
  • packages/core/platform/sources/gpu-operator.yaml
  • packages/system/gpu-operator/Makefile
  • packages/system/gpu-operator/charts/gpu-operator/Chart.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/Chart.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/README.md
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/crds/nfd-api-crds.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/_helpers.tpl
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/master-pdb.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/master.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/nfd-gc-pdb.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/nfd-gc.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/post-delete-job.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/prometheus.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/topologyupdater.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/worker.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/values.yaml
  • packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_clusterpolicies.yaml
  • packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/cleanup_crd.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/clusterpolicy.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/extra-objects.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/nodefeaturerules.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/nvidiadriver.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/operator.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/role.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/rolebinding.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/serviceaccount.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/upgrade_crd.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/validations.yaml
  • packages/system/gpu-operator/charts/gpu-operator/values.yaml
  • packages/system/gpu-operator/values-talos-vgpu.yaml

Comment thread docs/gpu-vgpu.md Outdated
Comment on lines +97 to +100
{{- with .Values.master.resources }}
resources:
{{- toYaml . | nindent 8 }}
{{- end }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Resources block is misplaced outside the container spec.

The resources block is rendered at the pod spec level (after tolerations) rather than inside the nfd-master container definition. This would cause a Kubernetes API validation error since resources is not a valid pod spec field.

🐛 Proposed fix to move resources inside container spec

The resources block should be inside the container definition (after line 83, inside the nfd-master container):

           args:
             - "-prune"
             {{- if .Values.master.instance | empty | not }}
             - "-instance={{ .Values.master.instance }}"
             {{- end }}
+          {{- with .Values.master.resources }}
+          resources:
+            {{- toYaml . | nindent 12 }}
+          {{- end }}
       restartPolicy: Never
       {{- with .Values.master.nodeSelector }}
       nodeSelector:
         {{- toYaml . | nindent 8 }}
       {{- end }}
       {{- with .Values.master.affinity }}
       affinity:
         {{- toYaml . | nindent 8 }}
       {{- end }}
       {{- with .Values.master.tolerations }}
       tolerations:
         {{- toYaml . | nindent 8 }}
       {{- end }}
-      {{- with .Values.master.resources }}
-      resources:
-        {{- toYaml . | nindent 8 }}
-      {{- end }}
 {{- end }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/post-delete-job.yaml`
around lines 97 - 100, The resources block is being rendered at the Pod spec
level using .Values.master.resources but must be moved into the nfd-master
container spec; update the template so the resources: {{- toYaml
.Values.master.resources }} block is placed inside the container definition for
"nfd-master" (e.g., immediately after that container's fields) rather than after
tolerations, ensuring the resources stanza is nested under the nfd-master
container.

Comment on lines 917 to 924
operator:
description: |-
Operator represents a key's relationship to the value.
Valid operators are Exists and Equal. Defaults to Equal.
Valid operators are Exists, Equal, Lt, and Gt. Defaults to Equal.
Exists is equivalent to wildcard for value, so that a pod can
tolerate all taints of a particular category.
Lt and Gt perform numeric comparisons (requires feature gate TaintTolerationComparisonOperators).
type: string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Do Kubernetes Pod tolerations support LtandGtoperators behind theTaintTolerationComparisonOperators feature gate, and in which Kubernetes versions is that feature available or enabled by default?

💡 Result:

Yes, Kubernetes Pod tolerations support the Lt (Less Than) and Gt (Greater Than) operators behind the TaintTolerationComparisonOperators feature gate. These operators enable numeric comparisons between taint values and toleration values (both must be valid 64-bit positive integers), allowing threshold-based scheduling. The feature is available starting in Kubernetes v1.35 as an alpha feature (disabled by default). It must be explicitly enabled via the --feature-gates=TaintTolerationComparisonOperators=true flag on the API server and kube-scheduler. As of 2026-04-02 (during v1.36 development), it remains alpha and is not enabled by default in any version.

Citations:


🏁 Script executed:

#!/bin/bash
# Search for Kubernetes version constraints in the repository
fd -t f -e yaml -e txt -e md | xargs grep -l "kubernetes" | head -20

# Also search for version info in common locations
if [ -f "Chart.yaml" ]; then echo "=== Chart.yaml ===" && cat Chart.yaml; fi
if [ -f "requirements.yaml" ]; then echo "=== requirements.yaml ===" && cat requirements.yaml; fi
if [ -f "README.md" ]; then echo "=== README.md (first 50 lines) ===" && head -50 README.md; fi
if [ -f "go.mod" ]; then echo "=== go.mod ===" && cat go.mod; fi

# Search for version constraints in the gpu-operator chart
if [ -d "packages/system/gpu-operator" ]; then
  echo "=== Files in gpu-operator ==="
  ls -la packages/system/gpu-operator/ | head -20
fi

Repository: cozystack/cozystack

Length of output: 10781


🏁 Script executed:

#!/bin/bash
# Search for references to TaintTolerationComparisonOperators or kubernetes version requirements
rg -i "kubernetes|k8s|version|feature.*gate|taint" --type yaml -A 2 -B 2 | head -100

Repository: cozystack/cozystack

Length of output: 13930


🏁 Script executed:

#!/bin/bash
# Check if this is a vendored chart by looking for upstream references
if [ -d "packages/system/gpu-operator/charts" ]; then
  echo "=== GPU Operator Chart Structure ==="
  find packages/system/gpu-operator/charts -type f -name "Chart.yaml" -o -name "values.yaml" | xargs head -20
fi

Repository: cozystack/cozystack

Length of output: 2228


Verify Lt/Gt toleration operators are supported in supported Kubernetes versions.

The GPU operator chart supports Kubernetes >= 1.16.0, but the TaintTolerationComparisonOperators feature gate is available only in Kubernetes v1.35+ and is disabled by default. Users can define valid NVIDIADriver specs with Lt/Gt operators that will be rejected by the API server on Kubernetes versions 1.16–1.34 or on v1.35+ without the feature gate explicitly enabled. Either update the CRD documentation to clarify this requirement or update the chart's kubeVersion constraint to reflect the actual compatibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml`
around lines 917 - 924, The CRD docs for the `operator` field currently list
`Lt`/`Gt` without noting they require the Kubernetes feature gate; update the
`operator` description under the `operator` field (in the NVIDIADriver CRD) to
state that `Lt` and `Gt` require Kubernetes v1.35+ with the
TaintTolerationComparisonOperators feature gate enabled, or alternatively update
the chart's `kubeVersion` constraint to >=1.35.0 so that the chart and the
`NVIDIADriver` spec only allow these operators on supported Kubernetes versions;
pick one approach and make the matching change to either the `operator`
description or the chart `kubeVersion`.

@IvanHunters IvanHunters self-assigned this Apr 9, 2026
@lexfrei lexfrei marked this pull request as draft April 10, 2026 11:46
@Arsolitt
Copy link
Copy Markdown
Contributor

Testing Report: gpu-operator v26.3.0 on Talos Linux

Tested this PR on a Talos v1.12.6 cluster (3× VM nodes with NVIDIA A10 GPUs, Kubernetes v1.34.3).

Critical: Talos OS version parsing failure

The ClusterPolicy enters notReady state immediately after upgrade:

failed to retrieve GPU node OS tag: error processing OS major version v1: strconv.Atoi: parsing "v1": invalid syntax

This is a known bug in gpu-operator v26.3.0 (NVIDIA/gpu-operator#2239). The getGPUNodeOSTag() function in controllers/state_manager.go runs unconditionally during ClusterPolicy controller init for any cluster with GPU-labeled nodes — it is not gated by driver.enabled or any other ClusterPolicy setting. Since Talos sets VERSION_ID=v1.12.6 (with v prefix), strconv.Atoi fails on the major version.

Already fixed upstream in NVIDIA/gpu-operator#2244, released as v26.3.1 on 2026-04-18. The vendored chart should be bumped from v26.3.0 to v26.3.1.

This affects all Talos deployments regardless of variant (default or vgpu).

Secondary: nvidia-operator-validator permission error

On nodes where the validator pod was re-rolled, the driver-validation init container crashes with:

error creating symlink creator: failed to create device node nvidia-modeset: permission denied

This happens during /dev/char symlink creation. Could not fully isolate this issue since the ClusterPolicy failure above prevents normal reconciliation. Might be resolved by v26.3.1 as well, but worth verifying separately. See NVIDIA/gpu-operator#1687 for related Talos context.

Summary

The primary blocker is the OS version parsing bug — any Talos cluster with GPU nodes will hit it. Bumping the vendored chart to v26.3.1 should resolve this. The validator permission issue needs verification after the bump.

@IvanHunters IvanHunters removed their assignment Apr 22, 2026
@github-actions github-actions Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress and removed do-not-merge labels Apr 27, 2026
@lexfrei lexfrei force-pushed the feat/update-gpu-operator-v26.3.0 branch from 783b51b to 5a61a33 Compare April 28, 2026 20:05
@github-actions github-actions Bot added the area/virtualization Issues or PRs related to virtualization (kubevirt, cdi, vmi, vm-import) label Apr 28, 2026
@lexfrei lexfrei changed the title [gpu-operator] Update to v26.3.0 and add vGPU variant [gpu-operator] Update to v26.3.1 and add experimental vGPU variant Apr 30, 2026
lexfrei added a commit to cozystack/website that referenced this pull request Apr 30, 2026
…stack#2323 fixes

Mirrors the corrections that landed in cozystack/cozystack#2323 docs/gpu-vgpu.md:

* Lead the vgpu install section with an experimental-status alert. The
  vGPU Manager DaemonSet works, but profile assignment is currently
  out-of-band (echo <id> > /sys/.../current_vgpu_type per VF) and
  resets on every reboot — without an automated mechanism the variant
  silently advertises zero allocatable resources.
* Note that the vgpu variant now sets sandboxWorkloads.defaultWorkload
  to vm-vgpu out of the box (cozystack#2323), so per-node labelling is
  optional rather than required.
* Move imagePullSecrets under vgpuManager: in the example. The chart
  reads it per-component (vgpuManager.imagePullSecrets,
  driver.imagePullSecrets, …); placing it at the package root would
  silently render-no-op and the DaemonSet would ImagePullBackOff
  without an obvious error.
* Split B100 (Blackwell) from L4/L40/L40S (Ada Lovelace) — the
  driver-model commentary applies to both architectures, but they
  should be named correctly.
* Note that on L40S the SR-IOV VFs share the PF device ID
  (10de:26b9), so a single pciVendorSelector matches both. Add lspci
  sanity-check guidance for other generations where PF/VF IDs may
  differ.
* Add a migration note in the Licensing section: chart v25 → v26
  deprecated driver.licensingConfig.configMapName in favour of
  secretName; old key still works but emits a deprecation marker in
  the CRD. SR-IOV vGPU does not consume the host-side licensing knob
  at all — relevant only for passthrough operators carrying old
  configMapName overrides.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
lexfrei added a commit to cozystack/website that referenced this pull request Apr 30, 2026
Six fixes from review:

* Add a top-of-section warning that the entire vgpu workflow depends
  on cozystack/cozystack#2323 (still in draft) and KubeVirt v1.9.0
  (unreleased) — neither is in any current Cozystack release. Without
  this banner the page reads like a supported workflow.

* Drop the incorrect claim that gpus: expects an mdev resource. The
  upstream KubeVirt VirtualMachine spec accepts both PCI and mdev
  resource names under either gpus: or hostDevices:. Restate the
  hostDevices vs gpus distinction as convention, not a hard
  constraint, and explain the cozystack/v1alpha1 wrapper trade-off.

* Drop the self-contradicting parenthetical that called
  pciVendorSelector 'the device ID, not the vendor:device pair' —
  the example right above already uses the vendor:device tuple
  (10DE:26B9), and the upstream KubeVirt godoc documents the field
  exactly as that tuple.

* Re-quote siderolabs/extensions#461. The issue was closed with
  stateReason=COMPLETED, not not_planned; the literal phrase 'won't
  fix' does not appear in the closure. Replace with the substantive
  reason from rothgar's closing comment.

* Fix pre-existing PCI ID error in the passthrough lspci sample
  output: A10 is 10de:2236 (verified against pci-ids.ucw.cz), the
  prior text said 10de:26b9 which is L40S. The rest of the
  passthrough section already uses 2236 correctly, so this was a
  copy-paste artefact.

* Soften the KubeVirt v1.9.0 ETA wording. KubeVirt does not publish
  hard release dates; replace 'ETA July 2026' with 'targeted for
  the v1.9.0 release; track the PR for the actual release tag'.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
lexfrei added a commit to cozystack/website that referenced this pull request Apr 30, 2026
… to vGPU 20.x

Round 4 of review fixes for the vgpu section.

Blocking fixes:

* Front-matter title was 'Running VMs with GPU Passthrough' but the
  page now also covers vGPU as a co-equal section. Update title /
  linkTitle / description / lead paragraph to reflect both flows
  and add a jump-link to the vGPU section.
* Unify ClientConfigToken file mode at 0744 across both sections
  (step 4 still had 0600, contradicting step 6's 0744 with the
  NVIDIA-cited justification). 0600 silently breaks license
  activation when nvidia-gridd runs as a service user.
* Harden the profile-loader DaemonSet against copy-paste failures:
  read-before-write so manual out-of-band changes are visible (and
  so the kernel does not reject writes while a VM holds the VF and
  the script logs an error every minute on every busy VF), strip
  trailing #-comments from ConfigMap lines, log on malformed lines
  instead of an opaque 'failed to set'.
* Add an explicit side-effect callout: while this DaemonSet runs,
  manual kubectl exec changes to current_vgpu_type are reverted
  within 60 s. Edit the ConfigMap rather than the sysfs file.
* Pin the SR-IOV path to vGPU 20.x explicitly. The previous '17/20'
  framing was misleading because the SR-IOV / pciHostDevices flow
  pairs specifically with branch 20 (driver 595.x); the example
  build is a 595.x .run, and a reader on a 17.x subscription would
  build the wrong manager image.

Worth-fixing:

* Reword the cozystack/cozystack#2323 reference time-stable so the
  text does not need editing once the upstream PR merges.
* Tighten the wrapper gpus: claim. The Cozystack VirtualMachine
  wrapper passes deviceName straight through to KubeVirt; what's
  missing for SR-IOV vGPU is the hostDevices: surface for headless
  setups and end-to-end exercise on real hardware. Be precise about
  which gap is which.
* Add (passthrough only) to the licensing migration alert so a vGPU
  operator does not 'fix' their setup by switching to secretName
  and then wonder why nothing changed.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
lexfrei added 3 commits April 30, 2026 14:52
Update vendored upstream chart to v26.3.0. Key changes:
- CDI enabled by default
- Driver version 580.126.20
- NFD subchart updated to 0.18.3
- Component versions bumped (toolkit v1.19.0, device-plugin v0.19.0, dcgm-exporter 4.5.2)
- Pin chart version in Makefile for reproducibility

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Add vGPU mode as a new variant alongside the existing passthrough (default) variant.
This enables sharing a single physical GPU across multiple VMs using NVIDIA mediated devices.

Changes:
- Add values-talos-vgpu.yaml with vGPU Manager and vGPU Device Manager enabled
- Add 'vgpu' variant to PackageSource
- Add documentation for vGPU setup (driver image, NLS licensing, profiles, KubeVirt)

Users select the vgpu variant via Package CR and provide their proprietary
vGPU Manager image coordinates through component values.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Replace simplified Containerfile with NVIDIA's Makefile-based build
system from gitlab.com/nvidia/container-images/driver. The GPU Operator
expects pre-compiled kernel modules, not a raw .run file.

Add EULA notice about public redistribution of vGPU driver images.
Add note about NLS ServerPort being deployment-dependent.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
lexfrei added 14 commits April 30, 2026 14:52
ConfigMap-based licensing configuration (configMapName) is deprecated
in GPU Operator v26.x. Switch to Secret-based configuration (secretName)
as recommended by the updated CRD schema.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Pull latest patch release: GDRCopy v2.5.2, Kata Sandbox Device Plugin
v0.0.3, multi-arch digest fix for KubeVirt GPU Device Plugin (no longer
amd64-only), and OS release name detection fix. NFD remains v0.18.3.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ncy, DLS-based licensing

Rewrites docs/gpu-vgpu.md to match the actual NVIDIA driver model on
current data-centre GPUs. Key changes:

- Drop the implicit assumption that vGPU uses mediated devices. Ada
  Lovelace and newer (L4, L40, L40S, B100) on the vGPU 17/20 branch
  use SR-IOV with per-VF NVIDIA sysfs (current_vgpu_type); KubeVirt
  advertises them via permittedHostDevices.pciHostDevices.
- Document KubeVirt PR #16890 as a hard dependency for the SR-IOV
  path. The PR is in main since 2026-04-10 and will ship in v1.9.0
  (ETA July 2026); released tags v1.6.x/1.7.x/1.8.x do not include
  it and backports are not planned.
- Replace the legacy NLS configuration block (ServerAddress=,
  ServerPort=7070, FeatureType=1) with the current DLS flow:
  ClientConfigToken file in /etc/nvidia/ClientConfigToken/, gridd.conf
  with FeatureType=0 (auto-detect).
- Switch the build instructions from the archived
  gitlab.com/nvidia/container-images/driver to the upstream
  github.com/NVIDIA/gpu-driver-container.
- Switch sample VirtualMachine spec from gpus: (mdev resource) to
  hostDevices: (PCI resource).
- Note explicitly that Talos is not recommended for the vGPU variant
  — the proprietary .run cannot be redistributed in a Talos system
  extension, and siderolabs/extensions#461 was closed won't-fix.
  Passthrough on Talos is unaffected.

Rename packages/system/gpu-operator/values-talos-vgpu.yaml to
values-vgpu.yaml — the overrides (sandbox workloads, pod-side driver
and device plugin disabled) are not Talos-specific, they apply to
any VM-only GPU node. Update PackageSource accordingly.

Default vgpuDeviceManager.enabled: false. The DaemonSet walks
/sys/class/mdev_bus/, which does not exist on Ada+ — keeping it on
by default would crashloop modern hardware. Operators of older GPUs
(Pascal/Volta/Turing/Ampere) can flip the flag back on.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ng example

Addresses follow-up review of the vGPU variant + chart bump.

* values.yaml: pin ccManager.enabled=false and vgpuDeviceManager.enabled=false.
  Upstream chart v26.x flipped both to true. ccManager auto-enables CC
  mode on Hopper-class hardware after upgrade — a non-trivial behavior
  change that should be opt-in. vgpuDeviceManager walks
  /sys/class/mdev_bus/, which does not exist on Ada+ (L4/L40/L40S/B100
  with vGPU 17/20). Pinning both off keeps the default footprint as it
  was on v25.x; users explicitly opt in.

* hack/dcgm-default-counters.csv: refresh from dcgm-exporter tag
  4.5.1-4.8.0 (matches the image bumped earlier in this PR). The
  previous snapshot was pinned to 4.1.1-4.0.4. Adds DCGM_FI_DEV_FB_RESERVED
  and other fields introduced upstream. Header comment updated.
  hack/check-gpu-recording-rules.bats passes against the refreshed
  fixture.

* docs/gpu-vgpu.md (Licensing section):
  - Drop the wrong recommendation to wire the licensing Secret through
    gpu-operator's driver.licensingConfig.secretName. That mounts into
    the host driver pod, where SR-IOV vGPU does not consume a license.
    Document guest-side delivery (cloud-init write_files for
    /etc/nvidia/ClientConfigToken/ and /etc/nvidia/gridd.conf) end to
    end.
  - Correct FeatureType comment. Per the NVIDIA Virtual GPU Software
    Licensing User Guide, FeatureType=0 is the unlicensed state, not
    auto-detect; FeatureType=1 is the value that asks the driver to
    auto-select the license type from the configured vGPU profile.
    Switch the example to FeatureType=1 and list all four valid values
    with their meaning.
  - Add nvidia-smi verification snippet inside the guest.

* docs/gpu-vgpu.md (KubeVirt section): note that L40S SR-IOV VFs share
  the PF device ID (10de:26b9), so a single pciVendorSelector matches
  both. Recommend lspci -nn -d 10de: as a sanity check on other
  generations where PF/VF IDs may differ.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…path, mark vgpu variant experimental

Addresses second review pass. Three blockers + one polish:

* values-vgpu.yaml: set sandboxWorkloads.defaultWorkload=vm-vgpu.
  Without this the upstream default 'container' applies on any node
  without the per-node nvidia.com/gpu.workload.config label, the
  operator selects the container workload (driver/devicePlugin
  DaemonSets), and those are disabled in this variant — so nothing
  runs on the node and vgpu-manager never starts. The cluster-wide
  default makes the variant functional out of the box; per-node
  labels still override it for mixed clusters.

* docs/gpu-vgpu.md: imagePullSecrets is per-component on this chart
  (vgpuManager.imagePullSecrets, driver.imagePullSecrets, …). The
  Package CR example previously placed the field at gpu-operator: root
  where it would be silently ignored at render. Move it under
  vgpuManager: where the chart actually reads it.

* docs/gpu-vgpu.md, release note: mark the vgpu variant as
  experimental and call out the missing profile-assignment loop
  explicitly. Until an SR-IOV-aware controller writes
  current_vgpu_type per VF (and re-applies after every reboot —
  PCIe re-enumeration resets the value), the variant is not safe
  for production. The docs now lead with that warning instead of
  burying it in a caveat.

* Rename packages/system/gpu-operator/values-talos.yaml to
  values-passthrough.yaml. The overrides (sandboxWorkloads.enabled,
  driver and devicePlugin disabled) apply to any host OS using the
  vfio-pci passthrough path — not Talos specifically. Update
  PackageSource ref accordingly. Comment in the file documents the
  rename so future readers understand the prior name.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
… Ada/Blackwell

* examples/README.md and examples/values-native-talos.yaml header
  comment still pointed at values-talos.yaml after the rename to
  values-passthrough.yaml. Update both to the new name.
* values-vgpu.yaml: drop the leftover vgpuDeviceManager.config block.
  The chart only renders the config when vgpuDeviceManager.enabled is
  true, and we keep it false on Ada+; the orphan config.default key
  was scaffolding without effect.
* docs/gpu-vgpu.md and values.yaml comment: stop grouping B100 under
  Ada Lovelace. B100 is Blackwell; the SR-IOV vGPU mechanism applies
  to both architectures but they should be named correctly.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…licy

Adds hack/check-gpu-operator-variants.bats. The test renders both
package variants with helm template and asserts the resulting
ClusterPolicy spec for the values that this PR cares about:

  * ccManager.enabled=false (pinned in values.yaml; upstream v26 default
    was flipped to true)
  * vgpuDeviceManager.enabled=false (same — and on Ada+/Blackwell the
    DaemonSet would crashloop because /sys/class/mdev_bus is empty)
  * sandboxWorkloads.defaultWorkload — vm-passthrough on the default
    variant, vm-vgpu on the vgpu variant; without these the upstream
    'container' default makes either variant a silent no-op on
    unlabelled GPU nodes
  * vgpuManager.enabled=true on the vgpu variant; driver / devicePlugin
    enabled=false on both variants (proprietary host driver path is
    out of scope for both passthrough and SR-IOV vGPU)

Negative-tested locally: deleting any of these settings makes the
corresponding test fail, so it actually guards regressions on the
next chart bump.

Also addresses last-pass review comments:

  * values-passthrough.yaml: set sandboxWorkloads.defaultWorkload=
    vm-passthrough so the default variant is symmetric with the vgpu
    variant — neither relies on per-node labels for the variant to
    activate. Without this the upstream default 'container' would
    apply on unlabelled GPU nodes and vfio-manager would never start.
  * values.yaml: rewrite the vgpuDeviceManager comment to lead with
    the load-bearing reason (mdev path missing on Ada+/Blackwell)
    instead of the secondary 'stays at zero pods on passthrough'
    rationalisation.
  * docs/gpu-vgpu.md: add a 'Migrating from chart v25.x' subsection
    in the Licensing section pointing at the configMapName ->
    secretName rename. Old key still works upstream but emits a
    deprecation warning; PR body release note already calls this out
    but the docs were missing the migration breadcrumb.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ote siderolabs#461

Two technical inaccuracies caught during review:

* The claim 'gpus: expects an mdev resource and will not match a
  pciHostDevices entry' is wrong. Upstream KubeVirt
  (pkg/virt-launcher/virtwrap/device/hostdevice/gpu/hostdev.go) iterates
  both the PCI pool (v1.PCIResourcePrefix) and the mdev pool
  (v1.MDevResourcePrefix) and accepts whichever resolves. Reword to
  describe gpus: vs hostDevices: as a style convention rather than a
  hard technical constraint.

* siderolabs/extensions#461 was closed with stateReason=COMPLETED, not
  not_planned. Quoting the closure as 'won't fix' is editorial
  recasting that does not match the actual closing comment. Replace
  with the substantive reason from rothgar's closing comment: NVIDIA
  does not publicly distribute the vGPU guest driver, requires
  Enterprise Portal access, and Sidero cannot support vGPU until
  NVIDIA changes its licensing terms.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ording

Mirroring the corresponding cleanup in cozystack/website#467 review:

* KubeVirt v1.9.0 ETA: drop the 'July 2026' date — KubeVirt does not
  publish hard release dates and the in-page wording was inconsistent
  across the file. Replace with 'Targeted at the next minor release
  (v1.9.0); track the PR for the actual release tag', matching the
  website doc.
* Kernel-headers timing: 'The build downloads kernel headers at
  container start time' conflated docker build with pod start; the
  Dockerfile's entrypoint downloads them at runtime. Reword.
* Replace guillemets with ASCII quotes for consistency with the rest
  of the file and the upstream chart docs.
* nvidia-smi -q license check: drop the unnecessary -A 1 — License
  Status value is on the same line as the field name.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
NVIDIA's License System User Guide recommends 744 on the
ClientConfigToken file so nvidia-gridd (which does not necessarily
run as the file owner) can read it. Previous example used 0600,
which can silently break license activation when nvidia-gridd
runs as a service user.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Apple Silicon / arm64 build hosts produce arm64 images native; GPU
nodes are amd64 and the kubelet pull fails with 'no matching
manifest' when the image is built without --platform. Document the
flag and add a one-line docker login note for private registries.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ream

NVIDIA's gpu-driver-container repository owns the build path —
keeping a parallel set of build instructions in our docs duplicates
upstream documentation and goes stale every time the upstream
Dockerfile or build args change. Replace the docker build / docker
push snippet with a paragraph pointing at the upstream repo and the
NVIDIA Licensing Portal as the sources of truth, keep only the
items that are Cozystack-specific (private-registry / EULA note).

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Pin a concrete reference set so future readers can tell whether the
guide is still accurate against the cluster they're running:
KubeVirt main nightly 20260429_74d7c52588, this PR's vgpu variant,
NVIDIA vGPU 20.0 host driver 595.58.02, GRID guest driver 595.58.03.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Free-floating 'NVIDIA recommends' claims are flagged as unverifiable
in review. Add the doc URL inline so future readers can verify the
permission model decision.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@lexfrei lexfrei force-pushed the feat/update-gpu-operator-v26.3.0 branch from 0e2dfad to b5c1e95 Compare April 30, 2026 11:53
lexfrei added a commit to cozystack/website that referenced this pull request Apr 30, 2026
…stack#2323 fixes

Mirrors the corrections that landed in cozystack/cozystack#2323 docs/gpu-vgpu.md:

* Lead the vgpu install section with an experimental-status alert. The
  vGPU Manager DaemonSet works, but profile assignment is currently
  out-of-band (echo <id> > /sys/.../current_vgpu_type per VF) and
  resets on every reboot — without an automated mechanism the variant
  silently advertises zero allocatable resources.
* Note that the vgpu variant now sets sandboxWorkloads.defaultWorkload
  to vm-vgpu out of the box (cozystack#2323), so per-node labelling is
  optional rather than required.
* Move imagePullSecrets under vgpuManager: in the example. The chart
  reads it per-component (vgpuManager.imagePullSecrets,
  driver.imagePullSecrets, …); placing it at the package root would
  silently render-no-op and the DaemonSet would ImagePullBackOff
  without an obvious error.
* Split B100 (Blackwell) from L4/L40/L40S (Ada Lovelace) — the
  driver-model commentary applies to both architectures, but they
  should be named correctly.
* Note that on L40S the SR-IOV VFs share the PF device ID
  (10de:26b9), so a single pciVendorSelector matches both. Add lspci
  sanity-check guidance for other generations where PF/VF IDs may
  differ.
* Add a migration note in the Licensing section: chart v25 → v26
  deprecated driver.licensingConfig.configMapName in favour of
  secretName; old key still works but emits a deprecation marker in
  the CRD. SR-IOV vGPU does not consume the host-side licensing knob
  at all — relevant only for passthrough operators carrying old
  configMapName overrides.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
lexfrei added a commit to cozystack/website that referenced this pull request Apr 30, 2026
Six fixes from review:

* Add a top-of-section warning that the entire vgpu workflow depends
  on cozystack/cozystack#2323 (still in draft) and KubeVirt v1.9.0
  (unreleased) — neither is in any current Cozystack release. Without
  this banner the page reads like a supported workflow.

* Drop the incorrect claim that gpus: expects an mdev resource. The
  upstream KubeVirt VirtualMachine spec accepts both PCI and mdev
  resource names under either gpus: or hostDevices:. Restate the
  hostDevices vs gpus distinction as convention, not a hard
  constraint, and explain the cozystack/v1alpha1 wrapper trade-off.

* Drop the self-contradicting parenthetical that called
  pciVendorSelector 'the device ID, not the vendor:device pair' —
  the example right above already uses the vendor:device tuple
  (10DE:26B9), and the upstream KubeVirt godoc documents the field
  exactly as that tuple.

* Re-quote siderolabs/extensions#461. The issue was closed with
  stateReason=COMPLETED, not not_planned; the literal phrase 'won't
  fix' does not appear in the closure. Replace with the substantive
  reason from rothgar's closing comment.

* Fix pre-existing PCI ID error in the passthrough lspci sample
  output: A10 is 10de:2236 (verified against pci-ids.ucw.cz), the
  prior text said 10de:26b9 which is L40S. The rest of the
  passthrough section already uses 2236 correctly, so this was a
  copy-paste artefact.

* Soften the KubeVirt v1.9.0 ETA wording. KubeVirt does not publish
  hard release dates; replace 'ETA July 2026' with 'targeted for
  the v1.9.0 release; track the PR for the actual release tag'.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
lexfrei added a commit to cozystack/website that referenced this pull request Apr 30, 2026
… to vGPU 20.x

Round 4 of review fixes for the vgpu section.

Blocking fixes:

* Front-matter title was 'Running VMs with GPU Passthrough' but the
  page now also covers vGPU as a co-equal section. Update title /
  linkTitle / description / lead paragraph to reflect both flows
  and add a jump-link to the vGPU section.
* Unify ClientConfigToken file mode at 0744 across both sections
  (step 4 still had 0600, contradicting step 6's 0744 with the
  NVIDIA-cited justification). 0600 silently breaks license
  activation when nvidia-gridd runs as a service user.
* Harden the profile-loader DaemonSet against copy-paste failures:
  read-before-write so manual out-of-band changes are visible (and
  so the kernel does not reject writes while a VM holds the VF and
  the script logs an error every minute on every busy VF), strip
  trailing #-comments from ConfigMap lines, log on malformed lines
  instead of an opaque 'failed to set'.
* Add an explicit side-effect callout: while this DaemonSet runs,
  manual kubectl exec changes to current_vgpu_type are reverted
  within 60 s. Edit the ConfigMap rather than the sysfs file.
* Pin the SR-IOV path to vGPU 20.x explicitly. The previous '17/20'
  framing was misleading because the SR-IOV / pciHostDevices flow
  pairs specifically with branch 20 (driver 595.x); the example
  build is a 595.x .run, and a reader on a 17.x subscription would
  build the wrong manager image.

Worth-fixing:

* Reword the cozystack/cozystack#2323 reference time-stable so the
  text does not need editing once the upstream PR merges.
* Tighten the wrapper gpus: claim. The Cozystack VirtualMachine
  wrapper passes deviceName straight through to KubeVirt; what's
  missing for SR-IOV vGPU is the hostDevices: surface for headless
  setups and end-to-end exercise on real hardware. Be precise about
  which gap is which.
* Add (passthrough only) to the licensing migration alert so a vGPU
  operator does not 'fix' their setup by switching to secretName
  and then wonder why nothing changed.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Build job in CI failed because the test relied on bats-specific
machinery the in-repo cozytest.sh runner does not provide:

* BATS_TEST_DIRNAME is undefined when the file is sourced as a plain
  shell script (cozytest.sh's mode); switch to BATS_TEST_FILENAME
  with a $0 fallback, matching hack/check-gpu-recording-rules.bats
  (REPO_ROOT pattern).
* setup()/teardown() are bats lifecycle hooks; cozytest.sh runs each
  @test in a fresh subshell with set -u and never invokes them. With
  TMP only assigned in setup(), every test exited immediately on
  'TMP: unbound variable'. Inline the mktemp + trap inside each
  @test so both runners work the same.

Verified locally that bash hack/cozytest.sh hack/check-gpu-operator-variants.bats
now reports all six tests OK.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/virtualization Issues or PRs related to virtualization (kubevirt, cdi, vmi, vm-import) do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress kind/feature Categorizes issue or PR as related to a new feature size:XXL This PR changes 1000+ lines, ignoring generated files

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Request] Support for Nvidia vGPU drivers

3 participants