chore(deps): update dependency nvidia/nvidia-container-toolkit to v1.19.0 by renovate[bot] · Pull Request #140 · Azure/aks-gpu

renovate · 2026-03-12T18:35:29Z

This PR contains the following updates:

Package	Update	Change
NVIDIA/nvidia-container-toolkit	minor	`1.18.2` → `1.19.0`

Release Notes

NVIDIA/nvidia-container-toolkit (NVIDIA/nvidia-container-toolkit)

`v1.19.0`

Compare Source

Promote v1.19.0-rc.7 to v1.19.0

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Enabled.

♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

…19.0

surajssd · 2026-04-30T04:03:04Z

Testing aks-gpu nvidia-container-toolkit 1.18.2 -> 1.19.0

Key verification: Whether nvidia-ctk 1.19.0 works correctly.

Step 1: Login and set up environment

source az-login.sh env azcore-linux-k8s-dev
export INDEX="1"
export AZURE_RESOURCE_GROUP=""
export AZURE_REGION="eastus2"
export CLUSTER_NAME="aks-${INDEX}"
export NODE_POOL_VM_SIZE="Standard_NC24ads_A100_v4"
export NODE_POOL_NAME="gpu"
export NODE_POOL_NODE_COUNT=1
AZURE_SUBSCRIPTION_ID=$(az account show --query id --output tsv)
export AZURE_SUBSCRIPTION_ID
export K8S_VERSION="1.34"

Step 2: Deploy AKS cluster with GPU node pool (driver=none)

Uses the aks-rdma-infiniband helper scripts to create the cluster and add a GPU node pool with --gpu-driver=none so we can manually install the driver from the aks-gpu PR branch.

git clone https://github.com/Azure/aks-rdma-infiniband
cd aks-rdma-infiniband

./tests/setup-infra/deploy-aks.sh deploy-aks &&
    ./tests/setup-infra/deploy-aks.sh add-nodepool --gpu-driver=none

Step 3: Build aks-gpu image from PR #140 branch

Checkout the PR branch and build a custom aks-gpu image with the new nvidia-container-toolkit version.

gh pr checkout 140

DRIVER_VERSION="$(yq '.cuda.version' ./driver_config.yml)"
IMG="quay.io/surajd/aks-gpu:${DRIVER_VERSION}-ctk1.19.0"
export DRIVER_VERSION
docker build --push \
    --build-arg DRIVER_VERSION="${DRIVER_VERSION}" \
    -t $IMG .

Step 4: Install the driver on the GPU node

Get a shell on the GPU node and run the aks-gpu container to install the NVIDIA driver. This replicates what AgentBaker does via configGPUDrivers() in cse_config.sh.

GPU_NODE=$(kubectl get nodes -l accelerator=nvidia -o name)
kubectl debug "${GPU_NODE}" --image=ubuntu --profile=sysadmin -it -- chroot /host /bin/bash

Once on the node:

mkdir -p /opt/{actions,gpu}
ctr image pull quay.io/surajd/aks-gpu:580.126.09-ctk1.19.0
ctr run --privileged \
    --net-host \
    --with-ns pid:/proc/1/ns/pid \
    --mount type=bind,src=/opt/gpu,dst=/mnt/gpu,options=rbind \
    --mount type=bind,src=/opt/actions,dst=/mnt/actions,options=rbind \
    -t quay.io/surajd/aks-gpu:580.126.09-ctk1.19.0 \
    gpuinstall /entrypoint.sh install

Step 5: Configure containerd to use nvidia-container-runtime

The aks-gpu container installs the driver and nvidia-container-runtime binary, but does not update containerd's config. In AgentBaker this is done beforehand via a pre-rendered containerd config template (containerd.toml.gtpl). Without this step, GPU workload pods will fail with:

exec /usr/bin/nvidia-smi: no such file or directory

Still on the GPU node (debug pod / chroot):

which nvidia-container-runtime
nvidia-modprobe -u -c0
ldconfig

Use nvidia-ctk to configure containerd. This creates a drop-in file at /etc/containerd/conf.d/99-nvidia.toml that sets nvidia as the default runtime.

WARNING: Do not use sed to patch config.toml directly. The AKS containerd config uses imports = ["/etc/containerd/conf.d/*.toml"] and raw edits can break the CRI plugin, causing unknown service runtime.v1.RuntimeService errors that take the node offline.

nvidia-ctk runtime configure --runtime=containerd --set-as-default

Restart containerd and kubelet to pick up the new config:

systemctl restart containerd
systemctl restart kubelet
systemctl status containerd
systemctl status kubelet

Verify the drop-in config:

cat /etc/containerd/conf.d/99-nvidia.toml

Step 6: Verify driver and nvidia-ctk installation

nvidia-smi
nvidia-ctk --version
# Expected: 1.19.0

Fabric Manager and Persistenced (NVSwitch SKUs only)

NOTE: Fabric Manager only works on NVSwitch SKUs (ND96 H100, ND96 A100, etc.). On PCIe single-GPU SKUs like Standard_NC24ads_A100_v4 it will fail with NV_WARN_NOTHING_TO_DO -- that is expected and can be ignored.

NOTE: nvidia-persistenced is only available on NVSwitch SKUs (ND-series). On PCIe single-GPU SKUs the service does not exist.

systemctl start nvidia-fabricmanager.service
systemctl status nvidia-fabricmanager.service

systemctl start nvidia-persistenced.service
systemctl status nvidia-persistenced.service

Step 7: Test GPU workload

From outside the debug pod, deploy the nvidia device plugin and a test GPU pod.

Deploy the NVIDIA device plugin DaemonSet

Deploy nvidia-device-plugin:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-resources
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: gpu-resources
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.19.1
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
EOF

Deploy a test GPU pod

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: test-gpu
spec:
  restartPolicy: Never
  tolerations:
    - key: "sku"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
  containers:
  - name: test-gpu
    image: pytorch/pytorch:latest
    command: ["/usr/bin/nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

Verify

kubectl wait --for=condition=Ready pod/test-gpu --timeout=5m || kubectl describe pod test-gpu
kubectl logs test-gpu

Cleanup

kubectl delete pod test-gpu --ignore-not-found
az group delete --name "${AZURE_RESOURCE_GROUP}" --yes --no-wait

renovate Bot force-pushed the renovate/nvidia-nvidia-container-toolkit-1.x branch from 8d0e539 to cd3218b Compare April 14, 2026 17:23

renovate Bot force-pushed the renovate/nvidia-nvidia-container-toolkit-1.x branch 2 times, most recently from 8222e85 to d3a918b Compare April 29, 2026 20:09

surajssd approved these changes Apr 29, 2026

View reviewed changes

renovate Bot force-pushed the renovate/nvidia-nvidia-container-toolkit-1.x branch 3 times, most recently from fa538d0 to 4e8df2d Compare April 29, 2026 21:31

chore(deps): update dependency nvidia/nvidia-container-toolkit to v1.…

b2920a2

…19.0

renovate Bot force-pushed the renovate/nvidia-nvidia-container-toolkit-1.x branch from 4e8df2d to b2920a2 Compare April 29, 2026 21:32

surajssd merged commit bccc5b9 into main Apr 30, 2026
4 checks passed

surajssd deleted the renovate/nvidia-nvidia-container-toolkit-1.x branch April 30, 2026 04:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps): update dependency nvidia/nvidia-container-toolkit to v1.19.0#140

chore(deps): update dependency nvidia/nvidia-container-toolkit to v1.19.0#140
surajssd merged 1 commit intomainfrom
renovate/nvidia-nvidia-container-toolkit-1.x

renovate Bot commented Mar 12, 2026

Uh oh!

surajssd commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

renovate Bot commented Mar 12, 2026

Release Notes

v1.19.0

Configuration

Uh oh!

surajssd commented Apr 30, 2026

Testing aks-gpu nvidia-container-toolkit 1.18.2 -> 1.19.0

Step 1: Login and set up environment

Step 2: Deploy AKS cluster with GPU node pool (driver=none)

Step 3: Build aks-gpu image from PR #140 branch

Step 4: Install the driver on the GPU node

Step 5: Configure containerd to use nvidia-container-runtime

Step 6: Verify driver and nvidia-ctk installation

Fabric Manager and Persistenced (NVSwitch SKUs only)

Step 7: Test GPU workload

Deploy the NVIDIA device plugin DaemonSet

Deploy a test GPU pod

Verify

Cleanup

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`v1.19.0`