Emit transfer.v1.local unpack_config for containerd_snapshotter#5
Emit transfer.v1.local unpack_config for containerd_snapshotter#5larsyencken wants to merge 1 commit intofar-mainfrom
Conversation
containerd 2.2.2 requires an explicit unpack_config entry per snapshotter; without one, image pulls on ZFS workers fail with "no unpack platforms defined". Also add an imports = ["/etc/containerd/conf.d/*.toml"] line so post-bringup drop-ins (e.g. nvidia toolkit's 99-nvidia.toml) are picked up without a second manual edit. FOUND-795
| @@ -1,5 +1,7 @@ | |||
| version = 3 | |||
|
|
|||
| imports = ["/etc/containerd/conf.d/*.toml"] | |||
There was a problem hiding this comment.
🔴 The new imports line hardcodes /etc/containerd/conf.d/*.toml while the rest of the template and role uses the configurable containerd_cfg_dir (default /etc/containerd). If a user overrides containerd_cfg_dir (e.g. to /opt/containerd), config.toml will be written under the override but containerd will still scan the hardcoded /etc/containerd/conf.d/, silently dropping their drop-ins. Fix: imports = ["{{ containerd_cfg_dir }}/conf.d/*.toml"].
Extended reasoning...
What the bug is
The new line
imports = ["/etc/containerd/conf.d/*.toml"]
hardcodes the path /etc/containerd/conf.d/*.toml, but everywhere else in the role the configurable variable containerd_cfg_dir is used to derive containerd-config paths.
Why this is inconsistent with the rest of the role
roles/container-engine/containerd/defaults/main.yml:81declarescontainerd_cfg_dir: /etc/containerd(overridable).roles/container-engine/containerd/tasks/main.yml:- line 36 creates
{{ containerd_cfg_dir }}itself, - line 68 templates
config.tomlto{{ containerd_cfg_dir }}/config.toml, - lines 79/86 manage
{{ containerd_cfg_dir }}/certs.d/hosts files.
- line 36 creates
- Within this same template,
containerd_cfg_diris already used at:- line 58 for
base_runtime_spec("{{ containerd_cfg_dir }}/{{ runtime.base_runtime_spec }}"), - line 88 for
[plugins."io.containerd.cri.v1.images".registry] config_path = "{{ containerd_cfg_dir }}/certs.d".
- line 58 for
The new imports line is the only place that bakes in /etc/containerd literally.
Concrete proof / step-by-step
- User sets
containerd_cfg_dir: /opt/containerdin their inventory. - The role creates
/opt/containerd/(tasks/main.yml:36). - The template renders
config.tomland Ansible writes it to/opt/containerd/config.toml(tasks/main.yml:68). - The rendered file contains
imports = ["/etc/containerd/conf.d/*.toml"]— pointing at a directory the role never creates or manages under the override. - The user (or another role, e.g. nvidia toolkit) drops
99-nvidia.tomlinto/opt/containerd/conf.d/(mirroring where everything else lives). - containerd starts and never sees the drop-in: it reads only the empty/non-existent
/etc/containerd/conf.d/. The PR description specifically motivatesimportsfor picking up post-bringup drop-ins, so this silently defeats the feature for the override case.
Why existing code doesn’t prevent it
Nothing else in the role enforces the /etc/containerd literal — defaults match the hardcoded path, so for default deployments the bug is invisible. There is no validation that containerd_cfg_dir equals /etc/containerd, and no symlink or fallback creating /etc/containerd/conf.d/ when an override is in use.
Impact
Limited but real: any operator who overrides containerd_cfg_dir gets a config that silently disagrees with itself. Drop-ins land in {{ containerd_cfg_dir }}/conf.d/ (the natural location, and where the rest of the role keeps subdirectories like certs.d) but imports looks elsewhere. Failures are silent — containerd just doesn’t apply the drop-in.
Fix
One-line change at roles/container-engine/containerd/templates/config.toml.j2:3:
imports = ["{{ containerd_cfg_dir }}/conf.d/*.toml"]This matches the convention already used at lines 58 and 88 of the same template and at all containerd_cfg_dir callsites in tasks/main.yml.
| [plugins."io.containerd.transfer.v1.local"] | ||
| [[plugins."io.containerd.transfer.v1.local".unpack_config]] | ||
| differ = "" | ||
| platform = "linux/amd64" | ||
| snapshotter = "{{ containerd_snapshotter }}" |
There was a problem hiding this comment.
🔴 The new unpack_config block hardcodes platform = "linux/amd64", which regresses arm64/arm hosts: on aarch64 nodes the declared unpack platform won't match the host and containerd will reproduce the very "no unpack platforms defined" failure this PR is meant to fix. Replace with platform = "linux/{{ image_arch }}" (or linux/{{ host_architecture }}) to stay consistent with how the rest of the codebase handles arch-aware values.
Extended reasoning...
What the bug is
The new [[plugins."io.containerd.transfer.v1.local".unpack_config]] block at roles/container-engine/containerd/templates/config.toml.j2:93-97 hardcodes:
platform = "linux/amd64"This is the same template that is rendered on every containerd node regardless of CPU architecture.
Why it's wrong in this codebase
Kubespray is explicitly multi-arch. The defaults already provide arch-aware variables:
roles/kubespray_defaults/defaults/main/main.yml:734-743defineshost_architecturemapping (x86_64→amd64,aarch64→arm64,armv7l→arm).roles/kubespray_defaults/defaults/main/download.yml:75definesimage_archdefaulting tohost_architecture.
These are used pervasively throughout the project for arch-aware download URLs, checksums, and container image refs (kubelet, kubectl, etcd, cni, containerd itself, etc.). Hardcoding linux/amd64 in this template breaks that pattern.
The failure mode this re-introduces
containerd 2.x's transfer.v1.local selects an unpacker by matching the image's manifest platform against the configured unpack_config entries. If none match the host platform, containerd logs "Unpack configuration not supported, skipping" and pulls fail with "unable to initialize unpacker: no unpack platforms defined" — which is exactly the symptom the PR description cites as the motivation for this change. So on arm64/arm workers this PR replaces a previously-implicit working default with an explicitly-wrong configuration that resurfaces that exact failure.
Step-by-step proof on an aarch64 worker
- Operator runs the playbook against an aarch64 ZFS worker. Ansible facts set
ansible_architecture = aarch64, sohost_architecture = arm64andimage_arch = arm64. - The containerd role renders
/etc/containerd/config.tomlfrom this template. The hardcoded line emitsplatform = "linux/amd64". - containerd starts and registers a single unpacker for
linux/amd64with the configured snapshotter. - kubelet asks containerd to pull a multi-arch sandbox/pod image. The CRI plugin resolves the manifest list and selects the
linux/arm64manifest for the host. transfer.v1.localwalks itsunpack_configlist looking for an entry whoseplatformmatcheslinux/arm64. None match (linux/amd64≠linux/arm64).- containerd logs "Unpack configuration not supported, skipping" and the pull fails with "unable to initialize unpacker: no unpack platforms defined" — the precise failure the PR set out to fix, now triggered on arm64 instead of being avoided.
Fix
One-line change:
platform = "linux/{{ image_arch }}"(Or linux/{{ host_architecture }} — both resolve to the right value, and either matches the convention already used by the rest of the role.)
Summary
[plugins."io.containerd.transfer.v1.local"]block withunpack_configparameterised oncontainerd_snapshotter. Without this, containerd 2.2.2 logsUnpack configuration not supported, skippingand image pulls on ZFS workers fail withunable to initialize unpacker: no unpack platforms defined.imports = ["/etc/containerd/conf.d/*.toml"]at the top of the config so post-bringup drop-ins (e.g. nvidia toolkit's99-nvidia.toml) are picked up without a second manual edit.Companion patch to #4 (FOUND-794), which fixed the bootstrap
ctr runsnapshotter side; this PR fixes the daemon-config side. Together they remove the manual post-bringup containerd patch we've been doing on every fresh ZFS worker.Linear: FOUND-795
Test plan
transfer.v1.localinjection is needed and post-join daemonsets (kube-proxy, nvidia toolkit, etc.) reach Running on first try./etc/containerd/config.tomlon a non-ZFS node still loads (block is harmless withsnapshotter = "overlayfs").