Skip to content

Emit transfer.v1.local unpack_config for containerd_snapshotter#5

Open
larsyencken wants to merge 1 commit intofar-mainfrom
lars/found-795-zfs-unpack-config
Open

Emit transfer.v1.local unpack_config for containerd_snapshotter#5
larsyencken wants to merge 1 commit intofar-mainfrom
lars/found-795-zfs-unpack-config

Conversation

@larsyencken
Copy link
Copy Markdown

Summary

  • Adds [plugins."io.containerd.transfer.v1.local"] block with unpack_config parameterised on containerd_snapshotter. Without this, containerd 2.2.2 logs Unpack configuration not supported, skipping and image pulls on ZFS workers fail with unable to initialize unpacker: no unpack platforms defined.
  • Adds imports = ["/etc/containerd/conf.d/*.toml"] at the top of the config so post-bringup drop-ins (e.g. nvidia toolkit's 99-nvidia.toml) are picked up without a second manual edit.

Companion patch to #4 (FOUND-794), which fixed the bootstrap ctr run snapshotter side; this PR fixes the daemon-config side. Together they remove the manual post-bringup containerd patch we've been doing on every fresh ZFS worker.

Linear: FOUND-795

Test plan

  • Bump submodule pin in flamingo, cycle a ZFS worker, confirm that no manual transfer.v1.local injection is needed and post-join daemonsets (kube-proxy, nvidia toolkit, etc.) reach Running on first try.
  • Confirm the rendered /etc/containerd/config.toml on a non-ZFS node still loads (block is harmless with snapshotter = "overlayfs").

containerd 2.2.2 requires an explicit unpack_config entry per snapshotter;
without one, image pulls on ZFS workers fail with "no unpack platforms
defined". Also add an imports = ["/etc/containerd/conf.d/*.toml"] line so
post-bringup drop-ins (e.g. nvidia toolkit's 99-nvidia.toml) are picked
up without a second manual edit.

FOUND-795
@@ -1,5 +1,7 @@
version = 3

imports = ["/etc/containerd/conf.d/*.toml"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new imports line hardcodes /etc/containerd/conf.d/*.toml while the rest of the template and role uses the configurable containerd_cfg_dir (default /etc/containerd). If a user overrides containerd_cfg_dir (e.g. to /opt/containerd), config.toml will be written under the override but containerd will still scan the hardcoded /etc/containerd/conf.d/, silently dropping their drop-ins. Fix: imports = ["{{ containerd_cfg_dir }}/conf.d/*.toml"].

Extended reasoning...

What the bug is

The new line

imports = ["/etc/containerd/conf.d/*.toml"]

hardcodes the path /etc/containerd/conf.d/*.toml, but everywhere else in the role the configurable variable containerd_cfg_dir is used to derive containerd-config paths.

Why this is inconsistent with the rest of the role

  • roles/container-engine/containerd/defaults/main.yml:81 declares containerd_cfg_dir: /etc/containerd (overridable).
  • roles/container-engine/containerd/tasks/main.yml:
    • line 36 creates {{ containerd_cfg_dir }} itself,
    • line 68 templates config.toml to {{ containerd_cfg_dir }}/config.toml,
    • lines 79/86 manage {{ containerd_cfg_dir }}/certs.d/ hosts files.
  • Within this same template, containerd_cfg_dir is already used at:
    • line 58 for base_runtime_spec ("{{ containerd_cfg_dir }}/{{ runtime.base_runtime_spec }}"),
    • line 88 for [plugins."io.containerd.cri.v1.images".registry] config_path = "{{ containerd_cfg_dir }}/certs.d".

The new imports line is the only place that bakes in /etc/containerd literally.

Concrete proof / step-by-step

  1. User sets containerd_cfg_dir: /opt/containerd in their inventory.
  2. The role creates /opt/containerd/ (tasks/main.yml:36).
  3. The template renders config.toml and Ansible writes it to /opt/containerd/config.toml (tasks/main.yml:68).
  4. The rendered file contains imports = ["/etc/containerd/conf.d/*.toml"] — pointing at a directory the role never creates or manages under the override.
  5. The user (or another role, e.g. nvidia toolkit) drops 99-nvidia.toml into /opt/containerd/conf.d/ (mirroring where everything else lives).
  6. containerd starts and never sees the drop-in: it reads only the empty/non-existent /etc/containerd/conf.d/. The PR description specifically motivates imports for picking up post-bringup drop-ins, so this silently defeats the feature for the override case.

Why existing code doesn’t prevent it

Nothing else in the role enforces the /etc/containerd literal — defaults match the hardcoded path, so for default deployments the bug is invisible. There is no validation that containerd_cfg_dir equals /etc/containerd, and no symlink or fallback creating /etc/containerd/conf.d/ when an override is in use.

Impact

Limited but real: any operator who overrides containerd_cfg_dir gets a config that silently disagrees with itself. Drop-ins land in {{ containerd_cfg_dir }}/conf.d/ (the natural location, and where the rest of the role keeps subdirectories like certs.d) but imports looks elsewhere. Failures are silent — containerd just doesn’t apply the drop-in.

Fix

One-line change at roles/container-engine/containerd/templates/config.toml.j2:3:

imports = ["{{ containerd_cfg_dir }}/conf.d/*.toml"]

This matches the convention already used at lines 58 and 88 of the same template and at all containerd_cfg_dir callsites in tasks/main.yml.

Comment on lines +93 to +97
[plugins."io.containerd.transfer.v1.local"]
[[plugins."io.containerd.transfer.v1.local".unpack_config]]
differ = ""
platform = "linux/amd64"
snapshotter = "{{ containerd_snapshotter }}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new unpack_config block hardcodes platform = "linux/amd64", which regresses arm64/arm hosts: on aarch64 nodes the declared unpack platform won't match the host and containerd will reproduce the very "no unpack platforms defined" failure this PR is meant to fix. Replace with platform = "linux/{{ image_arch }}" (or linux/{{ host_architecture }}) to stay consistent with how the rest of the codebase handles arch-aware values.

Extended reasoning...

What the bug is

The new [[plugins."io.containerd.transfer.v1.local".unpack_config]] block at roles/container-engine/containerd/templates/config.toml.j2:93-97 hardcodes:

platform = "linux/amd64"

This is the same template that is rendered on every containerd node regardless of CPU architecture.

Why it's wrong in this codebase

Kubespray is explicitly multi-arch. The defaults already provide arch-aware variables:

  • roles/kubespray_defaults/defaults/main/main.yml:734-743 defines host_architecture mapping (x86_64amd64, aarch64arm64, armv7larm).
  • roles/kubespray_defaults/defaults/main/download.yml:75 defines image_arch defaulting to host_architecture.

These are used pervasively throughout the project for arch-aware download URLs, checksums, and container image refs (kubelet, kubectl, etcd, cni, containerd itself, etc.). Hardcoding linux/amd64 in this template breaks that pattern.

The failure mode this re-introduces

containerd 2.x's transfer.v1.local selects an unpacker by matching the image's manifest platform against the configured unpack_config entries. If none match the host platform, containerd logs "Unpack configuration not supported, skipping" and pulls fail with "unable to initialize unpacker: no unpack platforms defined" — which is exactly the symptom the PR description cites as the motivation for this change. So on arm64/arm workers this PR replaces a previously-implicit working default with an explicitly-wrong configuration that resurfaces that exact failure.

Step-by-step proof on an aarch64 worker

  1. Operator runs the playbook against an aarch64 ZFS worker. Ansible facts set ansible_architecture = aarch64, so host_architecture = arm64 and image_arch = arm64.
  2. The containerd role renders /etc/containerd/config.toml from this template. The hardcoded line emits platform = "linux/amd64".
  3. containerd starts and registers a single unpacker for linux/amd64 with the configured snapshotter.
  4. kubelet asks containerd to pull a multi-arch sandbox/pod image. The CRI plugin resolves the manifest list and selects the linux/arm64 manifest for the host.
  5. transfer.v1.local walks its unpack_config list looking for an entry whose platform matches linux/arm64. None match (linux/amd64linux/arm64).
  6. containerd logs "Unpack configuration not supported, skipping" and the pull fails with "unable to initialize unpacker: no unpack platforms defined" — the precise failure the PR set out to fix, now triggered on arm64 instead of being avoided.

Fix

One-line change:

platform = "linux/{{ image_arch }}"

(Or linux/{{ host_architecture }} — both resolve to the right value, and either matches the convention already used by the rest of the role.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant