Skip to content

HOTFIX: 127.0.0.1 not localhost in verifier healthcheck (B02 Phase 2 follow-up)#36

Merged
pulkitpareek18 merged 1 commit into
mainfrom
hotfix-verifier-healthcheck
May 15, 2026
Merged

HOTFIX: 127.0.0.1 not localhost in verifier healthcheck (B02 Phase 2 follow-up)#36
pulkitpareek18 merged 1 commit into
mainfrom
hotfix-verifier-healthcheck

Conversation

@pulkitpareek18
Copy link
Copy Markdown
Collaborator

Hotfix for PR #35. Production was 502 for ~3 minutes between 07:21 and 07:25 UTC because the new verifier container failed its healthcheck → `zeroauth-prod` (which depends on it via `condition: service_healthy`) never started.

Root cause

Alpine ships busybox `wget`. Busybox `wget` resolves `localhost` to `::1` (IPv6) first and does NOT fall back to `127.0.0.1` (IPv4) on refusal. The verifier binds `0.0.0.0` which is IPv4-only.

```text
Connecting to localhost:3001 ([::1]:3001)
wget: can't connect to remote host: Connection refused
```

The verifier was running and serving HTTP perfectly the whole time. Only the healthcheck command was wrong.

Manual recovery already done

I SSH'd to the VPS at 07:25 UTC and ran:

```bash
cd /opt/zeroauth && docker compose --profile prod up -d --no-deps zeroauth-prod
```

That started `zeroauth-prod` without waiting for the `service_healthy` dependency. Production has been serving traffic normally since. The API actually IS hitting the verifier (its `VERIFIER_URL=http://zeroauth-verifier:3001\` env was preserved) — the verifier service itself works fine, it's only Docker's healthcheck status that's wrong.

So this PR is a "correctness restoration" not an emergency — the next `docker compose up -d --build --remove-orphans` (e.g. next deploy) would re-introduce the same hang on the dependency-wait without this fix.

What changed

Two-line fix in two places:

  • `Dockerfile` verifier-production stage HEALTHCHECK: `localhost` → `127.0.0.1`
  • `docker-compose.yml` zeroauth-verifier healthcheck: same.

Both carry a comment explaining why `localhost` is wrong, so the next operator doesn't revert.

Verified

```bash
ssh root@104.207.143.14 \
'docker exec zeroauth-verifier wget -qO- http://127.0.0.1:3001/health'

{"status":"ok","version":"0.1.0","vkeyAvailable":true,"uptimeSeconds":202}

```

Test plan

  • Local: `docker build --target verifier-production` builds clean
  • On VPS: `wget http://127.0.0.1:3001/health\` from inside the verifier container returns 200 + valid JSON
  • CI green on this PR
  • After merge: deploy completes, BOTH containers healthy (not just zeroauth-prod), and the dependency edge re-activates cleanly

🤖 Generated with Claude Code

The deploy after PR #35 succeeded in building both containers but the
verifier never became 'healthy' from Docker's perspective:

  Connecting to localhost:3001 ([::1]:3001)
  wget: can't connect to remote host: Connection refused

Root cause: alpine ships busybox wget. Busybox wget resolves `localhost`
to ::1 (IPv6) first and does NOT fall back to 127.0.0.1 (IPv4) on
refusal. The verifier binds 0.0.0.0 (IPv4-only). Connection refused
on every healthcheck, container marked unhealthy after 3 retries,
zeroauth-prod (which depends on it via depends_on: service_healthy)
never started.

Result: prod was 502 for ~3 minutes between 07:21 UTC and 07:25 UTC
until I manually started zeroauth-prod with --no-deps via SSH. That
restored service. The verifier was running and responding to requests
fine the whole time — only the healthcheck command was wrong.

Fix: use the literal 127.0.0.1 in both the Dockerfile HEALTHCHECK
and the compose-level healthcheck. The two are redundant by design:
compose-level wins for `docker compose` orchestration; Dockerfile
HEALTHCHECK wins for `docker run` outside compose. Both need to be
correct.

Comment added in both places explaining why localhost is wrong, so
the next operator doesn't revert.

Production state right now: zeroauth-prod is up + healthy via the
manual --no-deps recovery. The verifier is up + responding but
marked unhealthy by Docker (cosmetic — it doesn't block anything
since prod is now running without the dependency wait). After this
hotfix deploys, both will be healthy and the dependency edge
reactivates on next restart.

Verified locally:
  docker exec zeroauth-verifier wget -qO- http://127.0.0.1:3001/health
  → {"status":"ok","version":"0.1.0","vkeyAvailable":true,...}

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 15, 2026 07:26
@pulkitpareek18 pulkitpareek18 merged commit 0a0bae3 into main May 15, 2026
1 of 2 checks passed
@pulkitpareek18 pulkitpareek18 deleted the hotfix-verifier-healthcheck branch May 15, 2026 07:26
@pulkitpareek18 pulkitpareek18 review requested due to automatic review settings May 15, 2026 07:50
pulkitpareek18 added a commit that referenced this pull request May 15, 2026
Task 4 of today. Formally records the decision Pulkit made yesterday
when he picked Plan B over Plan A. Captures the three reasons
single-engineer velocity beat the brainstorm's Rust spec, what we
gave up (reproducible-build provenance, smaller transitive surface,
unsafe-discipline) and what we kept (cross-repo HTTP shape stays
Rust-compatible if we ever swap).

Also pins the inline-fallback retirement plan:
- 2026-05-15: verifier shipped, inline path unused but compiled-in
- 2026-05-16 → 2026-06-06: 3-week soak in prod
- 2026-06-08: PR to delete verifyInline + snarkjs from root deps +
  refuse-to-start when VERIFIER_URL is unset
- 2026-06-09: prod runs verifier-only

References the three shipping PRs (#35 cutover, #36 healthcheck hotfix,
#37 SQLite audit log) + the plan-mode design doc + the B02 build
prompt that we rejected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pulkitpareek18 added a commit that referenced this pull request May 15, 2026
Task 4 of today. Formally records the decision Pulkit made yesterday
when he picked Plan B over Plan A. Captures the three reasons
single-engineer velocity beat the brainstorm's Rust spec, what we
gave up (reproducible-build provenance, smaller transitive surface,
unsafe-discipline) and what we kept (cross-repo HTTP shape stays
Rust-compatible if we ever swap).

Also pins the inline-fallback retirement plan:
- 2026-05-15: verifier shipped, inline path unused but compiled-in
- 2026-05-16 → 2026-06-06: 3-week soak in prod
- 2026-06-08: PR to delete verifyInline + snarkjs from root deps +
  refuse-to-start when VERIFIER_URL is unset
- 2026-06-09: prod runs verifier-only

References the three shipping PRs (#35 cutover, #36 healthcheck hotfix,
#37 SQLite audit log) + the plan-mode design doc + the B02 build
prompt that we rejected.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pulkitpareek18 added a commit that referenced this pull request May 15, 2026
)

The deploy after PR #35 succeeded in building both containers but the
verifier never became 'healthy' from Docker's perspective:

  Connecting to localhost:3001 ([::1]:3001)
  wget: can't connect to remote host: Connection refused

Root cause: alpine ships busybox wget. Busybox wget resolves `localhost`
to ::1 (IPv6) first and does NOT fall back to 127.0.0.1 (IPv4) on
refusal. The verifier binds 0.0.0.0 (IPv4-only). Connection refused
on every healthcheck, container marked unhealthy after 3 retries,
zeroauth-prod (which depends on it via depends_on: service_healthy)
never started.

Result: prod was 502 for ~3 minutes between 07:21 UTC and 07:25 UTC
until I manually started zeroauth-prod with --no-deps via SSH. That
restored service. The verifier was running and responding to requests
fine the whole time — only the healthcheck command was wrong.

Fix: use the literal 127.0.0.1 in both the Dockerfile HEALTHCHECK
and the compose-level healthcheck. The two are redundant by design:
compose-level wins for `docker compose` orchestration; Dockerfile
HEALTHCHECK wins for `docker run` outside compose. Both need to be
correct.

Comment added in both places explaining why localhost is wrong, so
the next operator doesn't revert.

Production state right now: zeroauth-prod is up + healthy via the
manual --no-deps recovery. The verifier is up + responding but
marked unhealthy by Docker (cosmetic — it doesn't block anything
since prod is now running without the dependency wait). After this
hotfix deploys, both will be healthy and the dependency edge
reactivates on next restart.

Verified locally:
  docker exec zeroauth-verifier wget -qO- http://127.0.0.1:3001/health
  → {"status":"ok","version":"0.1.0","vkeyAvailable":true,...}
pulkitpareek18 added a commit that referenced this pull request May 15, 2026
Task 4 of today. Formally records the decision Pulkit made yesterday
when he picked Plan B over Plan A. Captures the three reasons
single-engineer velocity beat the brainstorm's Rust spec, what we
gave up (reproducible-build provenance, smaller transitive surface,
unsafe-discipline) and what we kept (cross-repo HTTP shape stays
Rust-compatible if we ever swap).

Also pins the inline-fallback retirement plan:
- 2026-05-15: verifier shipped, inline path unused but compiled-in
- 2026-05-16 → 2026-06-06: 3-week soak in prod
- 2026-06-08: PR to delete verifyInline + snarkjs from root deps +
  refuse-to-start when VERIFIER_URL is unset
- 2026-06-09: prod runs verifier-only

References the three shipping PRs (#35 cutover, #36 healthcheck hotfix,
#37 SQLite audit log) + the plan-mode design doc + the B02 build
prompt that we rejected.
pulkitpareek18 added a commit that referenced this pull request May 28, 2026
First issue of the BFSI v1 compliance roadmap, owned by Agent #36
(Chief Compliance Officer). Covers the four certification tracks that
gate the 12-month plan: DPDP Act 2023, the four binding RBI Master
Directions (IT Governance, Digital Lending, Digital Payment Security
Controls, KYC), SOC 2 Type I + Type II, and ISO/IEC 27001:2022. The
RBI Sandbox application is tracked alongside as a Q3 deliverable.

Eight sections per the agent-36 W1-Mon ticket:
1. Scope (in/out + India primary, GCC/UK secondary v2 lookahead).
2. Frameworks tracked with auditor + counsel relationships.
3. Q1-Q4 milestones aligned to the phase map in
   docs/plan/bfsi-v1/00-README.md.
4. Per-quarter deliverables table (D-Qn-NN IDs, owner agent, target
   week, dependencies) covering the year end-to-end.
5. Audit calendar weeks 1-52 listing every external interaction.
6. Vendor + counsel calendar (DPDP counsel, external cryptographer,
   SOC 2 auditor, ISO lead auditor, smart-contract audit firm,
   RBI counsel, bug bounty platform, evidence collector tool).
7. Open dependencies + risks (R-COMP-01..08) with owner + mitigation
   for each. Explicitly captures the three risks called out in the
   ticket: DPDP rule notification mid-evidence, evidence-collector
   tool slip, trusted-setup ceremony slip blocking ISO certification.
8. Document hygiene rules: quarterly retros in
   docs/compliance/retros/, regulator interaction log in
   docs/compliance/regulator-log.md, evidence pack rotation each
   quarter.

Cross-references docs/plan/bfsi-v1/06-ways-of-working.md for the
escalation path and docs/threat_model.md for the attack catalogue
that control narratives map to. Calls out the trusted-setup ceremony
artefact at docs/cryptography/trusted-setup-ceremony.md as the input
to ISO Annex A.5.31 and SOC 2 CC6.1 evidence.

[no-test] markdown-only deliverable per ticket.

Reviewer: Agent #1.
pulkitpareek18 pushed a commit that referenced this pull request May 28, 2026
First issue of the enterprise risk register at docs/compliance/risk/enterprise-risk-register-v1.md. Captures the 10 baseline commercial, operational, regulatory, strategic, security, and financial risks that the founder, CCO, CRO, and Risk & Audit lead carry on their dashboards. Distinct from docs/threat_model.md, which holds the technical attack catalogue (A-NN rows). Each enterprise risk references the threat-model rows it relates to so the two documents stay bidirectionally linked per the §6.5 operating principle.

Document deliverable A40-W1-Mon from docs/plan/bfsi-v1/agents/agent-40-risk-audit.md. Pairs with the compliance roadmap at docs/compliance/compliance-roadmap-v1.md whose §7 holds the thinner compliance-bearing subset; this register is the authoritative copy. References docs/threat_model.md throughout (A-02, A-07, A-09, A-10, A-13, A-17, A-21, A-22, A-28) and docs/cryptography/trusted-setup-ceremony.md (R-ENT-04, R-ENT-07) and docs/compliance/privacy/data-inventory-v1.md (R-ENT-03 scoping).

Risks classified by likelihood (1..5) x impact (1..5) with appetite bands accept <= 6, review 7-12, reject >= 13. At v1 all residuals sit in the auto-accept band after mitigation. Cadence is weekly walk by Agent #40, monthly review with Agent #1 + #36 + #42 on the 15th, quarterly board review in the last week of each Q, plus event-driven triggers per §6.3. Sign-offs in §7.

[no-test] markdown-only documentation deliverable. Next review 2026-06-01 per A40-W2-Mon ticket which updates the register with commit hashes for closed mitigations.
pulkitpareek18 added a commit that referenced this pull request May 28, 2026
First issue of the BFSI v1 compliance roadmap, owned by Agent #36
(Chief Compliance Officer). Covers the four certification tracks that
gate the 12-month plan: DPDP Act 2023, the four binding RBI Master
Directions (IT Governance, Digital Lending, Digital Payment Security
Controls, KYC), SOC 2 Type I + Type II, and ISO/IEC 27001:2022. The
RBI Sandbox application is tracked alongside as a Q3 deliverable.

Eight sections per the agent-36 W1-Mon ticket:
1. Scope (in/out + India primary, GCC/UK secondary v2 lookahead).
2. Frameworks tracked with auditor + counsel relationships.
3. Q1-Q4 milestones aligned to the phase map in
   docs/plan/bfsi-v1/00-README.md.
4. Per-quarter deliverables table (D-Qn-NN IDs, owner agent, target
   week, dependencies) covering the year end-to-end.
5. Audit calendar weeks 1-52 listing every external interaction.
6. Vendor + counsel calendar (DPDP counsel, external cryptographer,
   SOC 2 auditor, ISO lead auditor, smart-contract audit firm,
   RBI counsel, bug bounty platform, evidence collector tool).
7. Open dependencies + risks (R-COMP-01..08) with owner + mitigation
   for each. Explicitly captures the three risks called out in the
   ticket: DPDP rule notification mid-evidence, evidence-collector
   tool slip, trusted-setup ceremony slip blocking ISO certification.
8. Document hygiene rules: quarterly retros in
   docs/compliance/retros/, regulator interaction log in
   docs/compliance/regulator-log.md, evidence pack rotation each
   quarter.

Cross-references docs/plan/bfsi-v1/06-ways-of-working.md for the
escalation path and docs/threat_model.md for the attack catalogue
that control narratives map to. Calls out the trusted-setup ceremony
artefact at docs/cryptography/trusted-setup-ceremony.md as the input
to ISO Annex A.5.31 and SOC 2 CC6.1 evidence.

[no-test] markdown-only deliverable per ticket.

Reviewer: Agent #1.
pulkitpareek18 pushed a commit that referenced this pull request May 28, 2026
First issue of the enterprise risk register at docs/compliance/risk/enterprise-risk-register-v1.md. Captures the 10 baseline commercial, operational, regulatory, strategic, security, and financial risks that the founder, CCO, CRO, and Risk & Audit lead carry on their dashboards. Distinct from docs/threat_model.md, which holds the technical attack catalogue (A-NN rows). Each enterprise risk references the threat-model rows it relates to so the two documents stay bidirectionally linked per the §6.5 operating principle.

Document deliverable A40-W1-Mon from docs/plan/bfsi-v1/agents/agent-40-risk-audit.md. Pairs with the compliance roadmap at docs/compliance/compliance-roadmap-v1.md whose §7 holds the thinner compliance-bearing subset; this register is the authoritative copy. References docs/threat_model.md throughout (A-02, A-07, A-09, A-10, A-13, A-17, A-21, A-22, A-28) and docs/cryptography/trusted-setup-ceremony.md (R-ENT-04, R-ENT-07) and docs/compliance/privacy/data-inventory-v1.md (R-ENT-03 scoping).

Risks classified by likelihood (1..5) x impact (1..5) with appetite bands accept <= 6, review 7-12, reject >= 13. At v1 all residuals sit in the auto-accept band after mitigation. Cadence is weekly walk by Agent #40, monthly review with Agent #1 + #36 + #42 on the 15th, quarterly board review in the last week of each Q, plus event-driven triggers per §6.3. Sign-offs in §7.

[no-test] markdown-only documentation deliverable. Next review 2026-06-01 per A40-W2-Mon ticket which updates the register with commit hashes for closed mitigations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant