From e836a8ec605219b664c43909aa06c9f2bdc9672e Mon Sep 17 00:00:00 2001 From: Pulkit Pareek Date: Fri, 15 May 2026 12:55:53 +0530 Subject: [PATCH] B02 Phase 2 hotfix: 127.0.0.1 not localhost in verifier healthcheck MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The deploy after PR #35 succeeded in building both containers but the verifier never became 'healthy' from Docker's perspective: Connecting to localhost:3001 ([::1]:3001) wget: can't connect to remote host: Connection refused Root cause: alpine ships busybox wget. Busybox wget resolves `localhost` to ::1 (IPv6) first and does NOT fall back to 127.0.0.1 (IPv4) on refusal. The verifier binds 0.0.0.0 (IPv4-only). Connection refused on every healthcheck, container marked unhealthy after 3 retries, zeroauth-prod (which depends on it via depends_on: service_healthy) never started. Result: prod was 502 for ~3 minutes between 07:21 UTC and 07:25 UTC until I manually started zeroauth-prod with --no-deps via SSH. That restored service. The verifier was running and responding to requests fine the whole time — only the healthcheck command was wrong. Fix: use the literal 127.0.0.1 in both the Dockerfile HEALTHCHECK and the compose-level healthcheck. The two are redundant by design: compose-level wins for `docker compose` orchestration; Dockerfile HEALTHCHECK wins for `docker run` outside compose. Both need to be correct. Comment added in both places explaining why localhost is wrong, so the next operator doesn't revert. Production state right now: zeroauth-prod is up + healthy via the manual --no-deps recovery. The verifier is up + responding but marked unhealthy by Docker (cosmetic — it doesn't block anything since prod is now running without the dependency wait). After this hotfix deploys, both will be healthy and the dependency edge reactivates on next restart. Verified locally: docker exec zeroauth-verifier wget -qO- http://127.0.0.1:3001/health → {"status":"ok","version":"0.1.0","vkeyAvailable":true,...} Co-Authored-By: Claude Opus 4.7 (1M context) --- Dockerfile | 6 +++++- docker-compose.yml | 5 ++++- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/Dockerfile b/Dockerfile index 72d8a79..1911a43 100644 --- a/Dockerfile +++ b/Dockerfile @@ -107,8 +107,12 @@ ENV VERIFIER_PORT=3001 EXPOSE 3001 +# NOTE: 127.0.0.1 not localhost. Alpine's busybox wget resolves +# `localhost` to IPv6 (::1) first; the verifier binds IPv4 0.0.0.0, +# so the IPv6 connection is refused and busybox bails without falling +# back to IPv4. Using the literal IPv4 address sidesteps the resolver. HEALTHCHECK --interval=30s --timeout=10s --start-period=15s --retries=3 \ - CMD wget --no-verbose --tries=1 --spider http://localhost:3001/health || exit 1 + CMD wget --no-verbose --tries=1 --spider http://127.0.0.1:3001/health || exit 1 CMD ["node", "dist/server.js"] diff --git a/docker-compose.yml b/docker-compose.yml index f8cdac4..5e13b59 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -50,8 +50,11 @@ services: - VERIFIER_CIRCUIT_VERSION=v1 - LOG_LEVEL=info restart: unless-stopped + # 127.0.0.1 (not localhost) because alpine busybox wget hits IPv6 first + # and the verifier binds 0.0.0.0 (IPv4 only). The Dockerfile carries + # the same fix; this is belt-and-braces for compose-level overrides. healthcheck: - test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:3001/health'] + test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://127.0.0.1:3001/health'] interval: 30s timeout: 10s retries: 3