Conversation
8d0e539 to
cd3218b
Compare
8222e85 to
d3a918b
Compare
fa538d0 to
4e8df2d
Compare
4e8df2d to
b2920a2
Compare
Testing aks-gpu nvidia-container-toolkit 1.18.2 -> 1.19.0
Step 1: Login and set up environmentsource az-login.sh env azcore-linux-k8s-dev
export INDEX="1"
export AZURE_RESOURCE_GROUP=""
export AZURE_REGION="eastus2"
export CLUSTER_NAME="aks-${INDEX}"
export NODE_POOL_VM_SIZE="Standard_NC24ads_A100_v4"
export NODE_POOL_NAME="gpu"
export NODE_POOL_NODE_COUNT=1
AZURE_SUBSCRIPTION_ID=$(az account show --query id --output tsv)
export AZURE_SUBSCRIPTION_ID
export K8S_VERSION="1.34"Step 2: Deploy AKS cluster with GPU node pool (driver=none)Uses the aks-rdma-infiniband helper scripts to create the cluster and add a GPU node pool with git clone https://github.com/Azure/aks-rdma-infiniband
cd aks-rdma-infiniband
./tests/setup-infra/deploy-aks.sh deploy-aks &&
./tests/setup-infra/deploy-aks.sh add-nodepool --gpu-driver=noneStep 3: Build aks-gpu image from PR #140 branchCheckout the PR branch and build a custom aks-gpu image with the new nvidia-container-toolkit version. gh pr checkout 140
DRIVER_VERSION="$(yq '.cuda.version' ./driver_config.yml)"
IMG="quay.io/surajd/aks-gpu:${DRIVER_VERSION}-ctk1.19.0"
export DRIVER_VERSION
docker build --push \
--build-arg DRIVER_VERSION="${DRIVER_VERSION}" \
-t $IMG .Step 4: Install the driver on the GPU nodeGet a shell on the GPU node and run the aks-gpu container to install the NVIDIA driver. This replicates what AgentBaker does via GPU_NODE=$(kubectl get nodes -l accelerator=nvidia -o name)
kubectl debug "${GPU_NODE}" --image=ubuntu --profile=sysadmin -it -- chroot /host /bin/bashOnce on the node: mkdir -p /opt/{actions,gpu}
ctr image pull quay.io/surajd/aks-gpu:580.126.09-ctk1.19.0
ctr run --privileged \
--net-host \
--with-ns pid:/proc/1/ns/pid \
--mount type=bind,src=/opt/gpu,dst=/mnt/gpu,options=rbind \
--mount type=bind,src=/opt/actions,dst=/mnt/actions,options=rbind \
-t quay.io/surajd/aks-gpu:580.126.09-ctk1.19.0 \
gpuinstall /entrypoint.sh installStep 5: Configure containerd to use nvidia-container-runtimeThe aks-gpu container installs the driver and exec /usr/bin/nvidia-smi: no such file or directoryStill on the GPU node (debug pod / chroot): which nvidia-container-runtime
nvidia-modprobe -u -c0
ldconfigUse
nvidia-ctk runtime configure --runtime=containerd --set-as-defaultRestart containerd and kubelet to pick up the new config: systemctl restart containerd
systemctl restart kubelet
systemctl status containerd
systemctl status kubeletVerify the drop-in config: cat /etc/containerd/conf.d/99-nvidia.tomlStep 6: Verify driver and nvidia-ctk installationnvidia-smi
nvidia-ctk --version
# Expected: 1.19.0Fabric Manager and Persistenced (NVSwitch SKUs only)
systemctl start nvidia-fabricmanager.service
systemctl status nvidia-fabricmanager.service
systemctl start nvidia-persistenced.service
systemctl status nvidia-persistenced.serviceStep 7: Test GPU workloadFrom outside the debug pod, deploy the nvidia device plugin and a test GPU pod. Deploy the NVIDIA device plugin DaemonSetDeploy nvidia-device-plugin: cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata:
name: gpu-resources
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: gpu-resources
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.19.1
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
EOFDeploy a test GPU podkubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: test-gpu
spec:
restartPolicy: Never
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
containers:
- name: test-gpu
image: pytorch/pytorch:latest
command: ["/usr/bin/nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
EOFVerifykubectl wait --for=condition=Ready pod/test-gpu --timeout=5m || kubectl describe pod test-gpu
kubectl logs test-gpuCleanupkubectl delete pod test-gpu --ignore-not-found
az group delete --name "${AZURE_RESOURCE_GROUP}" --yes --no-wait |
This PR contains the following updates:
1.18.2→1.19.0Release Notes
NVIDIA/nvidia-container-toolkit (NVIDIA/nvidia-container-toolkit)
v1.19.0Compare Source
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Enabled.
♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.