fix(ci): ride out transient registry blips in nightly image pulls by jsugg · Pull Request #191 · jsugg/strongclaw

jsugg · 2026-06-17T18:33:22Z

Summary

The nightly Linux Fresh Host job has been failing intermittently. RCA of the latest failed run (27669020598) showed the ensure-images prepull step exiting 1 with every image dying at the same ~15s wall:

Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection

That ~15s is the Docker daemon's own registry HTTP client timeout reaching Docker Hub — i.e. a transient registry connectivity blip, not a daemon fault. The pull helper already retries, but these registry timeouts were not recognized as a distinct transient class, so they fell through to the short generic backoff (2s, 4s) and exhausted the retry budget within seconds — explaining the pass/fail/pass flakiness.

Changes

Classify transient registry failures (context deadline exceeded, request canceled while waiting for connection, i/o timeout, tls handshake timeout, DNS errors, …) as a separate retry class from daemon-connectivity faults, in the hosted-docker pull helper.
Apply a longer, capped exponential backoff for that class (10s → 20s → 40s → 45s) so a brief upstream outage is ridden out across the retry budget instead of being burned in a few seconds. Generic failures keep the existing short backoff.
Raise the Linux fresh-host pull retry budget to 5 so the new backoff has room to work.
Added unit coverage asserting the exponential-capped schedule and that the helper retries to success once the registry recovers.

Local: 38 passed, 4 skipped (hosted-docker unit + CI workflow contract suites); ruff lint/format and mypy clean on the changed files.

Note (not addressed here)

The Linux fresh-host job reads the pull knobs from the top-level workflow env, so it ignores the per-caller docker_pull_* inputs (only the macOS job wires them through). Bumping the budget therefore lives in the shared env. Wiring the Linux job to honor per-caller inputs is a reasonable follow-up but is out of scope for this fix.

The Linux nightly fresh-host job failed intermittently when 'docker pull' could not reach registry-1.docker.io ('context deadline exceeded' / 'request canceled while waiting for connection'). These registry network timeouts were not classified as transient, so the pull helper retried them with the short generic backoff (2s, 4s) and burned its retry budget within seconds. Classify registry connectivity timeouts as a distinct transient failure and apply a longer, capped exponential backoff (10s, 20s, 40s, 45s) so a brief upstream outage is ridden out across the retry budget. Raise the Linux fresh-host pull retry budget to 5 so the backoff has room to work. Add unit coverage for the new backoff schedule.

jsugg merged commit 146ca99 into main Jun 17, 2026
33 of 35 checks passed

jsugg deleted the nightly-registry-pull-retry-hardening branch June 17, 2026 23:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): ride out transient registry blips in nightly image pulls#191

fix(ci): ride out transient registry blips in nightly image pulls#191
jsugg merged 1 commit into
mainfrom
nightly-registry-pull-retry-hardening

jsugg commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jsugg commented Jun 17, 2026

Summary

Changes

Note (not addressed here)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant