Skip to content

Commit a656ed7

Browse files
authored
ci(docker): use prebuilt Rust binaries by default (#1027)
* ci(docker): use prebuilt Rust binaries by default Flip Docker image builds to consume staged native Rust artifacts, remove in-Docker Rust build stages, and publish per-arch images with a manifest merge. Add local staging support for prebuilt gateway and sandbox binaries so development image builds continue to work without CI artifacts. Signed-off-by: Jonas Toelke <jtoelke@nvidia.com> * ci(docker): address prebuilt build review feedback * ci(rust): allow existing vfio complexity * ci(rust): pin toolchain to 1.95 --------- Signed-off-by: Jonas Toelke <jtoelke@nvidia.com>
1 parent ee2de81 commit a656ed7

28 files changed

Lines changed: 725 additions & 504 deletions

.agents/skills/debug-openshell-cluster/SKILL.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ Component images (server, sandbox) can reach kubelet via two paths:
184184

185185
**Local/external pull mode** (default local via `mise run cluster`): Local images are tagged to the configured local registry base (default `127.0.0.1:5000/openshell/*`), pushed to that registry, and pulled by k3s via `registries.yaml` mirror endpoint (typically `host.docker.internal:5000`). The `cluster` task pushes prebuilt local tags (`openshell/*:dev`, falling back to `localhost:5000/openshell/*:dev` or `127.0.0.1:5000/openshell/*:dev`).
186186

187-
Gateway image builds now stage a partial Rust workspace from `deploy/docker/Dockerfile.images`. If cargo fails with a missing manifest under `/build/crates/...`, or an imported symbol exists locally but is missing in the image build, verify that every current gateway dependency crate (including `openshell-driver-docker`, `openshell-driver-kubernetes`, and `openshell-ocsf`) is copied into the staged workspace there.
187+
Gateway and cluster image builds consume Rust binaries staged at `deploy/docker/.build/prebuilt-binaries/<arch>/`. In CI these come from the reusable Rust native build workflow; locally `tasks/scripts/docker-build-image.sh` runs `tasks/scripts/stage-prebuilt-binaries.sh` before invoking Docker unless `PREBUILT_AUTO_STAGE=0` is set.
188188

189189
```bash
190190
# Verify image refs currently used by openshell deployment
@@ -368,7 +368,7 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w
368368
| `metrics-server` errors in logs | Normal k3s noise, not the root cause | These errors are benign — look for the actual failing health check component |
369369
| Stale NotReady nodes from previous deploys | Volume reused across container recreations | The deploy flow now auto-cleans stale nodes; if it still fails, manually delete NotReady nodes (see Step 2) or choose "Recreate" when prompted |
370370
| gRPC `UNIMPLEMENTED` for newer RPCs in push mode | Helm values still point at older pulled images instead of the pushed refs | Verify rendered `openshell-helmchart.yaml` uses the expected push refs (`server`, `sandbox`, `pki-job`) and not `:latest` |
371-
| Sandbox pods crash with `/opt/openshell/bin/openshell-sandbox: no such file or directory` | Supervisor binary missing from cluster image | The cluster image was built/published without the `supervisor-builder` target in `deploy/docker/Dockerfile.images`. Rebuild with `mise run docker:build:cluster` and recreate gateway. Bootstrap auto-detects via `HEALTHCHECK_MISSING_SUPERVISOR` marker |
371+
| Sandbox pods crash with `/opt/openshell/bin/openshell-sandbox: no such file or directory` | Supervisor binary missing from cluster image | The cluster image was built/published without a staged `openshell-sandbox` prebuilt binary. Rebuild with `mise run docker:build:cluster` and recreate gateway. Bootstrap auto-detects via `HEALTHCHECK_MISSING_SUPERVISOR` marker |
372372
| `HEALTHCHECK_MISSING_SUPERVISOR` in health check logs | `/opt/openshell/bin/openshell-sandbox` not found in gateway container | Rebuild cluster image: `mise run docker:build:cluster`, then `openshell gateway destroy <name> && openshell gateway start` |
373373
| `nvidia-ctk cdi list` returns no `k8s.device-plugin.nvidia.com/gpu=` entries | CDI specs not yet generated by device plugin | Device plugin may still be starting; wait and retry, or check pod logs (Step 8) |
374374

.github/workflows/docker-build.yml

Lines changed: 202 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ on:
88
required: true
99
type: string
1010
timeout-minutes:
11-
description: "Job timeout in minutes"
11+
description: "Per-arch Docker image job timeout in minutes"
1212
required: false
1313
type: number
1414
default: 20
@@ -23,7 +23,7 @@ on:
2323
type: string
2424
default: "linux/amd64,linux/arm64"
2525
runner:
26-
description: "GitHub Actions runner label"
26+
description: "Deprecated; per-arch native runners are selected automatically"
2727
required: false
2828
type: string
2929
default: "build-amd64"
@@ -35,17 +35,121 @@ on:
3535

3636
env:
3737
MISE_GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
38-
SCCACHE_MEMCACHED_ENDPOINT: ${{ vars.SCCACHE_MEMCACHED_ENDPOINT }}
3938

4039
permissions:
4140
contents: read
4241
packages: write
4342

43+
defaults:
44+
run:
45+
shell: bash
46+
4447
jobs:
48+
resolve:
49+
name: Resolve build plan
50+
runs-on: ubuntu-latest
51+
outputs:
52+
matrix: ${{ steps.resolve.outputs.matrix }}
53+
platform_count: ${{ steps.resolve.outputs.platform_count }}
54+
arches: ${{ steps.resolve.outputs.arches }}
55+
binary_component: ${{ steps.resolve.outputs.binary_component }}
56+
binary_name: ${{ steps.resolve.outputs.binary_name }}
57+
artifact_prefix: ${{ steps.resolve.outputs.artifact_prefix }}
58+
steps:
59+
- name: Resolve component and platform matrix
60+
id: resolve
61+
run: |
62+
set -euo pipefail
63+
64+
component="${{ inputs.component }}"
65+
case "$component" in
66+
gateway)
67+
binary_component=gateway
68+
binary_name=openshell-gateway
69+
;;
70+
supervisor|cluster)
71+
binary_component=sandbox
72+
binary_name=openshell-sandbox
73+
;;
74+
*)
75+
echo "unsupported component: $component" >&2
76+
exit 1
77+
;;
78+
esac
79+
80+
platform_input="${{ inputs.platform }}"
81+
platform_input="${platform_input//[[:space:]]/}"
82+
if [[ -z "$platform_input" ]]; then
83+
echo "platform input must not be empty" >&2
84+
exit 1
85+
fi
86+
87+
IFS=',' read -r -a platforms <<< "$platform_input"
88+
matrix='{"include":['
89+
arches=()
90+
count=0
91+
92+
for platform in "${platforms[@]}"; do
93+
case "$platform" in
94+
linux/amd64)
95+
arch=amd64
96+
runner=linux-amd64-cpu8
97+
;;
98+
linux/arm64)
99+
arch=arm64
100+
runner=linux-arm64-cpu8
101+
;;
102+
*)
103+
echo "unsupported platform: $platform" >&2
104+
echo "supported platforms: linux/amd64, linux/arm64" >&2
105+
exit 1
106+
;;
107+
esac
108+
109+
if [[ $count -gt 0 ]]; then
110+
matrix+=','
111+
fi
112+
matrix+='{"platform":"'"$platform"'","arch":"'"$arch"'","runner":"'"$runner"'"}'
113+
arches+=("$arch")
114+
count=$((count + 1))
115+
done
116+
117+
matrix+=']}'
118+
{
119+
echo "matrix=$matrix"
120+
echo "platform_count=$count"
121+
echo "arches=${arches[*]}"
122+
echo "binary_component=$binary_component"
123+
echo "binary_name=$binary_name"
124+
echo "artifact_prefix=rust-binary-${component}-${binary_component}"
125+
} >> "$GITHUB_OUTPUT"
126+
127+
rust-binary:
128+
name: Rust ${{ needs.resolve.outputs.binary_component }} (${{ matrix.arch }})
129+
needs: resolve
130+
permissions:
131+
contents: read
132+
packages: read
133+
strategy:
134+
fail-fast: false
135+
matrix: ${{ fromJSON(needs.resolve.outputs.matrix) }}
136+
uses: ./.github/workflows/shadow-rust-native-build.yml
137+
with:
138+
component: ${{ needs.resolve.outputs.binary_component }}
139+
arch: ${{ matrix.arch }}
140+
cargo-version: ${{ inputs['cargo-version'] }}
141+
features: openshell-core/dev-settings
142+
artifact-name: ${{ needs.resolve.outputs.artifact_prefix }}-linux-${{ matrix.arch }}
143+
secrets: inherit
144+
45145
build:
46-
name: Build ${{ inputs.component }}
47-
runs-on: ${{ inputs.runner }}
48-
timeout-minutes: ${{ inputs.timeout-minutes }}
146+
name: Build ${{ inputs.component }} (${{ matrix.arch }})
147+
needs: [resolve, rust-binary]
148+
runs-on: ${{ matrix.runner }}
149+
timeout-minutes: ${{ inputs['timeout-minutes'] }}
150+
strategy:
151+
fail-fast: false
152+
matrix: ${{ fromJSON(needs.resolve.outputs.matrix) }}
49153
container:
50154
image: ghcr.io/nvidia/openshell/ci:latest
51155
credentials:
@@ -54,11 +158,14 @@ jobs:
54158
options: --privileged
55159
volumes:
56160
- /var/run/docker.sock:/var/run/docker.sock
161+
# Expose the nv-gha-runners buildkitd.toml registry mirror config
162+
# inside the container so setup-buildx can read it.
163+
- /etc/buildkit:/etc/buildkit:ro
57164
env:
58-
IMAGE_TAG: ${{ github.sha }}
165+
IMAGE_TAG: ${{ needs.resolve.outputs.platform_count == '1' && github.sha || format('{0}-{1}', github.sha, matrix.arch) }}
59166
IMAGE_REGISTRY: ghcr.io/nvidia/openshell
60167
DOCKER_PUSH: ${{ inputs.push && '1' || '0' }}
61-
DOCKER_PLATFORM: ${{ inputs.platform }}
168+
DOCKER_PLATFORM: ${{ matrix.platform }}
62169
steps:
63170
- uses: actions/checkout@v4
64171
with:
@@ -67,30 +174,99 @@ jobs:
67174
- name: Mark workspace safe for git
68175
run: git config --global --add safe.directory "$GITHUB_WORKSPACE"
69176

70-
- name: Fetch tags
71-
run: git fetch --tags --force
72-
73-
- name: Compute cargo version
74-
id: version
75-
run: |
76-
set -eu
77-
if [[ -n "${{ inputs.cargo-version }}" ]]; then
78-
echo "cargo_version=${{ inputs.cargo-version }}" >> "$GITHUB_OUTPUT"
79-
else
80-
echo "cargo_version=$(uv run python tasks/scripts/release.py get-version --cargo)" >> "$GITHUB_OUTPUT"
81-
fi
177+
- name: Install tools
178+
run: mise install --locked
82179

83180
- name: Log in to GHCR
181+
if: ${{ inputs.push }}
84182
run: echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u "${{ github.actor }}" --password-stdin
85183

86-
- name: Set up Docker Buildx
184+
- name: Set up buildx (local driver)
87185
uses: ./.github/actions/setup-buildx
186+
with:
187+
driver: local
188+
buildkitd-config: /etc/buildkit/buildkitd.toml
189+
190+
- name: Download Rust binary artifact
191+
uses: actions/download-artifact@v4
192+
with:
193+
name: ${{ needs.resolve.outputs.artifact_prefix }}-linux-${{ matrix.arch }}
194+
path: prebuilt-rust-binary
195+
196+
- name: Stage Rust binary in Docker build context
197+
run: |
198+
set -euo pipefail
199+
binary="${{ needs.resolve.outputs.binary_name }}"
200+
download_dir="prebuilt-rust-binary"
201+
stage="deploy/docker/.build/prebuilt-binaries/${{ matrix.arch }}"
202+
found="$(find "$download_dir" -type f -name "$binary" -print -quit)"
203+
if [[ -z "$found" ]]; then
204+
echo "missing downloaded artifact file: $binary" >&2
205+
find "$download_dir" -maxdepth 4 -type f -print >&2 || true
206+
exit 1
207+
fi
208+
mkdir -p "$stage"
209+
install -m 0755 "$found" "$stage/$binary"
210+
ls -lh "$stage/"
88211
89212
- name: Build ${{ inputs.component }} image
90213
env:
91214
DOCKER_BUILDER: openshell
92-
OPENSHELL_CARGO_VERSION: ${{ steps.version.outputs.cargo_version }}
93-
# Enable dev-settings feature for test settings (dummy_bool, dummy_int)
94-
# used by e2e tests.
95-
EXTRA_CARGO_FEATURES: openshell-core/dev-settings
96-
run: mise run --no-deps build:docker:${{ inputs.component }}
215+
run: |
216+
set -euo pipefail
217+
mise exec -- tasks/scripts/docker-build-image.sh "${{ inputs.component }}" \
218+
--cache-from "type=gha,scope=${{ inputs.component }}-${{ matrix.arch }}" \
219+
--cache-to "type=gha,mode=max,scope=${{ inputs.component }}-${{ matrix.arch }}"
220+
221+
- name: Smoke check ${{ inputs.component }} image
222+
if: ${{ !inputs.push }}
223+
run: |
224+
set -euo pipefail
225+
image="${IMAGE_REGISTRY}/${{ inputs.component }}:${IMAGE_TAG}"
226+
case "${{ inputs.component }}" in
227+
gateway)
228+
output="$(docker run --rm --platform "${{ matrix.platform }}" "$image" --version)"
229+
echo "$output"
230+
grep -q '^openshell-gateway ' <<<"$output"
231+
;;
232+
supervisor)
233+
output="$(docker run --rm --platform "${{ matrix.platform }}" "$image" --version)"
234+
echo "$output"
235+
grep -q '^openshell-sandbox ' <<<"$output"
236+
;;
237+
cluster)
238+
output="$(docker run --rm --platform "${{ matrix.platform }}" --entrypoint /opt/openshell/bin/openshell-sandbox "$image" --version)"
239+
echo "$output"
240+
grep -q '^openshell-sandbox ' <<<"$output"
241+
;;
242+
esac
243+
244+
merge:
245+
name: Merge ${{ inputs.component }} manifest
246+
needs: [resolve, build]
247+
if: ${{ inputs.push && needs.resolve.outputs.platform_count != '1' }}
248+
runs-on: linux-amd64-cpu8
249+
timeout-minutes: 10
250+
container:
251+
image: ghcr.io/nvidia/openshell/ci:latest
252+
credentials:
253+
username: ${{ github.actor }}
254+
password: ${{ secrets.GITHUB_TOKEN }}
255+
volumes:
256+
- /var/run/docker.sock:/var/run/docker.sock
257+
steps:
258+
- name: Log in to GHCR
259+
run: echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u "${{ github.actor }}" --password-stdin
260+
261+
- name: Create multi-arch manifest
262+
run: |
263+
set -euo pipefail
264+
image="ghcr.io/nvidia/openshell/${{ inputs.component }}"
265+
refs=()
266+
for arch in ${{ needs.resolve.outputs.arches }}; do
267+
refs+=("${image}:${GITHUB_SHA}-${arch}")
268+
done
269+
docker buildx imagetools create \
270+
--prefer-index=false \
271+
-t "${image}:${GITHUB_SHA}" \
272+
"${refs[@]}"

.github/workflows/release-vm-kernel.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ jobs:
135135
- name: Install dependencies
136136
run: |
137137
set -euo pipefail
138-
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
138+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain 1.95.0
139139
echo "$HOME/.cargo/bin" >> "$GITHUB_PATH"
140140
brew install lld dtc xz
141141

0 commit comments

Comments
 (0)