Skip to content

Ruigao/fix 1.6 docker#181

Open
hippogr wants to merge 9 commits into
devfrom
ruigao/fix-1.6-docker
Open

Ruigao/fix 1.6 docker#181
hippogr wants to merge 9 commits into
devfrom
ruigao/fix-1.6-docker

Conversation

@hippogr
Copy link
Copy Markdown
Contributor

@hippogr hippogr commented May 18, 2026

Fix several issues with release 1.6 including:

  1. CIDR IP address ranges for cilium deployment YAML
  2. add dpkg command to fix dpkg broken issues when deploying VMSS extensions
  3. fix openpaidbsdk missing problem in job-status-change-notification docker image
  4. add /usr/share/datacenter-gpu-manager-4/bindings/python3 path in job exporter

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets release 1.6 deployment reliability by aligning Cilium cluster-pool CIDR settings, hardening VMSS extension scripts against broken dpkg states, and ensuring required dependencies/paths are present in container images and exporters.

Changes:

  • Adjust Cilium cluster-pool IPv4 CIDR and corresponding Envoy internal CIDR ranges.
  • Add dpkg --configure -a to AKS VM extension helper scripts to recover from interrupted dpkg states before running apt/dpkg operations.
  • Fix missing openpaidbsdk contents in the job-status-change-notification image build and add an additional DCGM Python bindings path for the NVIDIA exporter.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/job-exporter/src/Moneo/src/worker/exporters/nvidia_exporter.py Adds an additional DCGM Python bindings search path to support alternate install locations.
src/alert-manager/build/job-status-change-notification.common.dockerfile Copies the full openpaidbsdk directory so the image includes the actual package content, not just metadata.
contrib/aks/scripts/install-fuse.sh Runs dpkg --configure -a after waiting for dpkg locks to mitigate broken dpkg states during extension installs.
contrib/aks/scripts/config-ipoib.sh Same dpkg recovery step added prior to apt operations in IPoIB configuration flow.
contrib/aks/k8s-deploy/cilium.yaml Narrows the cluster-pool CIDR to /16 and updates Envoy internal CIDR prefix lengths accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants