Skip to content

chore(deps): update dependency nvidia/nvidia-container-toolkit to v1.19.0#140

Merged
surajssd merged 1 commit intomainfrom
renovate/nvidia-nvidia-container-toolkit-1.x
Apr 30, 2026
Merged

chore(deps): update dependency nvidia/nvidia-container-toolkit to v1.19.0#140
surajssd merged 1 commit intomainfrom
renovate/nvidia-nvidia-container-toolkit-1.x

Conversation

@renovate
Copy link
Copy Markdown
Contributor

@renovate renovate Bot commented Mar 12, 2026

This PR contains the following updates:

Package Update Change
NVIDIA/nvidia-container-toolkit minor 1.18.21.19.0

Release Notes

NVIDIA/nvidia-container-toolkit (NVIDIA/nvidia-container-toolkit)

v1.19.0

Compare Source

  • Promote v1.19.0-rc.7 to v1.19.0

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Enabled.

Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate renovate Bot force-pushed the renovate/nvidia-nvidia-container-toolkit-1.x branch from 8d0e539 to cd3218b Compare April 14, 2026 17:23
@renovate renovate Bot force-pushed the renovate/nvidia-nvidia-container-toolkit-1.x branch 2 times, most recently from 8222e85 to d3a918b Compare April 29, 2026 20:09
@renovate renovate Bot force-pushed the renovate/nvidia-nvidia-container-toolkit-1.x branch 3 times, most recently from fa538d0 to 4e8df2d Compare April 29, 2026 21:31
@renovate renovate Bot force-pushed the renovate/nvidia-nvidia-container-toolkit-1.x branch from 4e8df2d to b2920a2 Compare April 29, 2026 21:32
@surajssd
Copy link
Copy Markdown
Member

Testing aks-gpu nvidia-container-toolkit 1.18.2 -> 1.19.0

  • Key verification: Whether nvidia-ctk 1.19.0 works correctly.

Step 1: Login and set up environment

source az-login.sh env azcore-linux-k8s-dev
export INDEX="1"
export AZURE_RESOURCE_GROUP=""
export AZURE_REGION="eastus2"
export CLUSTER_NAME="aks-${INDEX}"
export NODE_POOL_VM_SIZE="Standard_NC24ads_A100_v4"
export NODE_POOL_NAME="gpu"
export NODE_POOL_NODE_COUNT=1
AZURE_SUBSCRIPTION_ID=$(az account show --query id --output tsv)
export AZURE_SUBSCRIPTION_ID
export K8S_VERSION="1.34"

Step 2: Deploy AKS cluster with GPU node pool (driver=none)

Uses the aks-rdma-infiniband helper scripts to create the cluster and add a GPU node pool with --gpu-driver=none so we can manually install the driver from the aks-gpu PR branch.

git clone https://github.com/Azure/aks-rdma-infiniband
cd aks-rdma-infiniband

./tests/setup-infra/deploy-aks.sh deploy-aks &&
    ./tests/setup-infra/deploy-aks.sh add-nodepool --gpu-driver=none

Step 3: Build aks-gpu image from PR #140 branch

Checkout the PR branch and build a custom aks-gpu image with the new nvidia-container-toolkit version.

gh pr checkout 140

DRIVER_VERSION="$(yq '.cuda.version' ./driver_config.yml)"
IMG="quay.io/surajd/aks-gpu:${DRIVER_VERSION}-ctk1.19.0"
export DRIVER_VERSION
docker build --push \
    --build-arg DRIVER_VERSION="${DRIVER_VERSION}" \
    -t $IMG .

Step 4: Install the driver on the GPU node

Get a shell on the GPU node and run the aks-gpu container to install the NVIDIA driver. This replicates what AgentBaker does via configGPUDrivers() in cse_config.sh.

GPU_NODE=$(kubectl get nodes -l accelerator=nvidia -o name)
kubectl debug "${GPU_NODE}" --image=ubuntu --profile=sysadmin -it -- chroot /host /bin/bash

Once on the node:

mkdir -p /opt/{actions,gpu}
ctr image pull quay.io/surajd/aks-gpu:580.126.09-ctk1.19.0
ctr run --privileged \
    --net-host \
    --with-ns pid:/proc/1/ns/pid \
    --mount type=bind,src=/opt/gpu,dst=/mnt/gpu,options=rbind \
    --mount type=bind,src=/opt/actions,dst=/mnt/actions,options=rbind \
    -t quay.io/surajd/aks-gpu:580.126.09-ctk1.19.0 \
    gpuinstall /entrypoint.sh install

Step 5: Configure containerd to use nvidia-container-runtime

The aks-gpu container installs the driver and nvidia-container-runtime binary, but does not update containerd's config. In AgentBaker this is done beforehand via a pre-rendered containerd config template (containerd.toml.gtpl). Without this step, GPU workload pods will fail with:

exec /usr/bin/nvidia-smi: no such file or directory

Still on the GPU node (debug pod / chroot):

which nvidia-container-runtime
nvidia-modprobe -u -c0
ldconfig

Use nvidia-ctk to configure containerd. This creates a drop-in file at /etc/containerd/conf.d/99-nvidia.toml that sets nvidia as the default runtime.

WARNING: Do not use sed to patch config.toml directly. The AKS containerd config uses imports = ["/etc/containerd/conf.d/*.toml"] and raw edits can break the CRI plugin, causing unknown service runtime.v1.RuntimeService errors that take the node offline.

nvidia-ctk runtime configure --runtime=containerd --set-as-default

Restart containerd and kubelet to pick up the new config:

systemctl restart containerd
systemctl restart kubelet
systemctl status containerd
systemctl status kubelet

Verify the drop-in config:

cat /etc/containerd/conf.d/99-nvidia.toml

Step 6: Verify driver and nvidia-ctk installation

nvidia-smi
nvidia-ctk --version
# Expected: 1.19.0

Fabric Manager and Persistenced (NVSwitch SKUs only)

NOTE: Fabric Manager only works on NVSwitch SKUs (ND96 H100, ND96 A100, etc.). On PCIe single-GPU SKUs like Standard_NC24ads_A100_v4 it will fail with NV_WARN_NOTHING_TO_DO -- that is expected and can be ignored.

NOTE: nvidia-persistenced is only available on NVSwitch SKUs (ND-series). On PCIe single-GPU SKUs the service does not exist.

systemctl start nvidia-fabricmanager.service
systemctl status nvidia-fabricmanager.service

systemctl start nvidia-persistenced.service
systemctl status nvidia-persistenced.service

Step 7: Test GPU workload

From outside the debug pod, deploy the nvidia device plugin and a test GPU pod.

Deploy the NVIDIA device plugin DaemonSet

Deploy nvidia-device-plugin:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-resources
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: gpu-resources
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.19.1
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
EOF

Deploy a test GPU pod

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: test-gpu
spec:
  restartPolicy: Never
  tolerations:
    - key: "sku"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
  containers:
  - name: test-gpu
    image: pytorch/pytorch:latest
    command: ["/usr/bin/nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

Verify

kubectl wait --for=condition=Ready pod/test-gpu --timeout=5m || kubectl describe pod test-gpu
kubectl logs test-gpu

Cleanup

kubectl delete pod test-gpu --ignore-not-found
az group delete --name "${AZURE_RESOURCE_GROUP}" --yes --no-wait

@surajssd surajssd merged commit bccc5b9 into main Apr 30, 2026
4 checks passed
@surajssd surajssd deleted the renovate/nvidia-nvidia-container-toolkit-1.x branch April 30, 2026 04:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant