B02 Phase 2: ship verifier as its own container in production#35
Merged
Conversation
Plan B (TS workspace) chosen yesterday. Yesterday's PR #29 landed the verifier package in the repo. Today's PR Phase 2 flips production to actually use it instead of the inline-snarkjs fallback that's been serving since v0. Dockerfile: - Adds a `verifier-build` stage that npm-ci's the verifier workspace against the root lockfile (reproducible), compiles src/ → dist/. - Adds a `verifier-production` stage: slim alpine image, non-root user uid 1001, flat `npm install --omit=dev` (verifier has 4 prod deps: express, snarkjs, winston, uuid — workspace-aware ci complicates a per-package prod install; trade-off accepted per ADR-0005). Copies the compiled JS + the production vkey. Healthcheck on /health. Binds 0.0.0.0:3001 inside the container so docker network reaches it; no host port binding so it stays loopback-only at the boundary. docker-compose.yml: - New `zeroauth-verifier` service in BOTH the `dev` and `prod` profiles. `expose: 3001` (no `ports:` — no host binding). Healthcheck wired. - `zeroauth-prod` gains: VERIFIER_URL=http://zeroauth-verifier:3001 VERIFIER_TIMEOUT_MS=2000 in its `environment:` block (not via .env — wired directly so a hand-edited prod .env can't drift from the compose intent). - `zeroauth-prod.depends_on` now requires `zeroauth-verifier` to be service_healthy before starting. Deploy fails loud if the verifier can't load its vkey or bind its port. - `zeroauth-dev` gets the same wiring so local dev exercises the service path by default. Developers who want the inline-snarkjs fallback can override with `VERIFIER_URL=` in their .env. Local validation: - `docker build --target verifier-production` builds clean - `docker run` + `curl /health` returns {"status":"ok","version":"0.1.0","vkeyAvailable":true,"uptimeSeconds":5} - `POST /verify` with a structurally-valid junk proof exercises the real Groth16 verifier against the real vkey (structuralFallback:false) and rejects it correctly (verified:false, 543ms — first call cost includes snarkjs init; subsequent calls are <50ms in the same image). Post-deploy verification plan (after this PR merges): - The deploy workflow runs scripts/deploy-remote.sh which does `docker compose --profile prod up -d --build --remove-orphans` — that auto-picks-up the new zeroauth-verifier service. - After healthchecks pass, the API container's src/services/zkp.ts switches from the inline path to the HTTP path because VERIFIER_URL is now set in its environment. - I'll smoke-test by POSTing a /v1/auth/zkp/verify and confirming the API logs show "ZKP: verifier service: FAIL/PASS" with a verifierAuditId (the service path's signature in zkp.ts:212) instead of "ZKP: inline Groth16: …" (the legacy path's signature). Out of scope (separate follow-ups today): - SQLite append-only audit log + hash chain in the verifier (task 3) - ADR-0008 capturing the TS-vs-Rust decision formally (task 4) - Promotion of governance/docs/threat-model/verifier.md from stub → full with A-V01 through A-V05 entries (task 5) - Retirement of the inline-fallback code path in zkp.ts (next week) Tests: 228 passing (no change). Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
pulkitpareek18
added a commit
that referenced
this pull request
May 15, 2026
) The deploy after PR #35 succeeded in building both containers but the verifier never became 'healthy' from Docker's perspective: Connecting to localhost:3001 ([::1]:3001) wget: can't connect to remote host: Connection refused Root cause: alpine ships busybox wget. Busybox wget resolves `localhost` to ::1 (IPv6) first and does NOT fall back to 127.0.0.1 (IPv4) on refusal. The verifier binds 0.0.0.0 (IPv4-only). Connection refused on every healthcheck, container marked unhealthy after 3 retries, zeroauth-prod (which depends on it via depends_on: service_healthy) never started. Result: prod was 502 for ~3 minutes between 07:21 UTC and 07:25 UTC until I manually started zeroauth-prod with --no-deps via SSH. That restored service. The verifier was running and responding to requests fine the whole time — only the healthcheck command was wrong. Fix: use the literal 127.0.0.1 in both the Dockerfile HEALTHCHECK and the compose-level healthcheck. The two are redundant by design: compose-level wins for `docker compose` orchestration; Dockerfile HEALTHCHECK wins for `docker run` outside compose. Both need to be correct. Comment added in both places explaining why localhost is wrong, so the next operator doesn't revert. Production state right now: zeroauth-prod is up + healthy via the manual --no-deps recovery. The verifier is up + responding but marked unhealthy by Docker (cosmetic — it doesn't block anything since prod is now running without the dependency wait). After this hotfix deploys, both will be healthy and the dependency edge reactivates on next restart. Verified locally: docker exec zeroauth-verifier wget -qO- http://127.0.0.1:3001/health → {"status":"ok","version":"0.1.0","vkeyAvailable":true,...} Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pulkitpareek18
added a commit
that referenced
this pull request
May 15, 2026
Task 4 of today. Formally records the decision Pulkit made yesterday when he picked Plan B over Plan A. Captures the three reasons single-engineer velocity beat the brainstorm's Rust spec, what we gave up (reproducible-build provenance, smaller transitive surface, unsafe-discipline) and what we kept (cross-repo HTTP shape stays Rust-compatible if we ever swap). Also pins the inline-fallback retirement plan: - 2026-05-15: verifier shipped, inline path unused but compiled-in - 2026-05-16 → 2026-06-06: 3-week soak in prod - 2026-06-08: PR to delete verifyInline + snarkjs from root deps + refuse-to-start when VERIFIER_URL is unset - 2026-06-09: prod runs verifier-only References the three shipping PRs (#35 cutover, #36 healthcheck hotfix, #37 SQLite audit log) + the plan-mode design doc + the B02 build prompt that we rejected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pulkitpareek18
added a commit
that referenced
this pull request
May 15, 2026
Task 4 of today. Formally records the decision Pulkit made yesterday when he picked Plan B over Plan A. Captures the three reasons single-engineer velocity beat the brainstorm's Rust spec, what we gave up (reproducible-build provenance, smaller transitive surface, unsafe-discipline) and what we kept (cross-repo HTTP shape stays Rust-compatible if we ever swap). Also pins the inline-fallback retirement plan: - 2026-05-15: verifier shipped, inline path unused but compiled-in - 2026-05-16 → 2026-06-06: 3-week soak in prod - 2026-06-08: PR to delete verifyInline + snarkjs from root deps + refuse-to-start when VERIFIER_URL is unset - 2026-06-09: prod runs verifier-only References the three shipping PRs (#35 cutover, #36 healthcheck hotfix, #37 SQLite audit log) + the plan-mode design doc + the B02 build prompt that we rejected. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pulkitpareek18
added a commit
that referenced
this pull request
May 15, 2026
* QA log — 2026-05-15 (HOLD, surrogate green, email NOW delivering) * B02 Phase 2: ship verifier as its own container in prod Plan B (TS workspace) chosen yesterday. Yesterday's PR #29 landed the verifier package in the repo. Today's PR Phase 2 flips production to actually use it instead of the inline-snarkjs fallback that's been serving since v0. Dockerfile: - Adds a `verifier-build` stage that npm-ci's the verifier workspace against the root lockfile (reproducible), compiles src/ → dist/. - Adds a `verifier-production` stage: slim alpine image, non-root user uid 1001, flat `npm install --omit=dev` (verifier has 4 prod deps: express, snarkjs, winston, uuid — workspace-aware ci complicates a per-package prod install; trade-off accepted per ADR-0005). Copies the compiled JS + the production vkey. Healthcheck on /health. Binds 0.0.0.0:3001 inside the container so docker network reaches it; no host port binding so it stays loopback-only at the boundary. docker-compose.yml: - New `zeroauth-verifier` service in BOTH the `dev` and `prod` profiles. `expose: 3001` (no `ports:` — no host binding). Healthcheck wired. - `zeroauth-prod` gains: VERIFIER_URL=http://zeroauth-verifier:3001 VERIFIER_TIMEOUT_MS=2000 in its `environment:` block (not via .env — wired directly so a hand-edited prod .env can't drift from the compose intent). - `zeroauth-prod.depends_on` now requires `zeroauth-verifier` to be service_healthy before starting. Deploy fails loud if the verifier can't load its vkey or bind its port. - `zeroauth-dev` gets the same wiring so local dev exercises the service path by default. Developers who want the inline-snarkjs fallback can override with `VERIFIER_URL=` in their .env. Local validation: - `docker build --target verifier-production` builds clean - `docker run` + `curl /health` returns {"status":"ok","version":"0.1.0","vkeyAvailable":true,"uptimeSeconds":5} - `POST /verify` with a structurally-valid junk proof exercises the real Groth16 verifier against the real vkey (structuralFallback:false) and rejects it correctly (verified:false, 543ms — first call cost includes snarkjs init; subsequent calls are <50ms in the same image). Post-deploy verification plan (after this PR merges): - The deploy workflow runs scripts/deploy-remote.sh which does `docker compose --profile prod up -d --build --remove-orphans` — that auto-picks-up the new zeroauth-verifier service. - After healthchecks pass, the API container's src/services/zkp.ts switches from the inline path to the HTTP path because VERIFIER_URL is now set in its environment. - I'll smoke-test by POSTing a /v1/auth/zkp/verify and confirming the API logs show "ZKP: verifier service: FAIL/PASS" with a verifierAuditId (the service path's signature in zkp.ts:212) instead of "ZKP: inline Groth16: …" (the legacy path's signature). Out of scope (separate follow-ups today): - SQLite append-only audit log + hash chain in the verifier (task 3) - ADR-0008 capturing the TS-vs-Rust decision formally (task 4) - Promotion of governance/docs/threat-model/verifier.md from stub → full with A-V01 through A-V05 entries (task 5) - Retirement of the inline-fallback code path in zkp.ts (next week) Tests: 228 passing (no change). Typecheck clean. ---------
pulkitpareek18
added a commit
that referenced
this pull request
May 15, 2026
) The deploy after PR #35 succeeded in building both containers but the verifier never became 'healthy' from Docker's perspective: Connecting to localhost:3001 ([::1]:3001) wget: can't connect to remote host: Connection refused Root cause: alpine ships busybox wget. Busybox wget resolves `localhost` to ::1 (IPv6) first and does NOT fall back to 127.0.0.1 (IPv4) on refusal. The verifier binds 0.0.0.0 (IPv4-only). Connection refused on every healthcheck, container marked unhealthy after 3 retries, zeroauth-prod (which depends on it via depends_on: service_healthy) never started. Result: prod was 502 for ~3 minutes between 07:21 UTC and 07:25 UTC until I manually started zeroauth-prod with --no-deps via SSH. That restored service. The verifier was running and responding to requests fine the whole time — only the healthcheck command was wrong. Fix: use the literal 127.0.0.1 in both the Dockerfile HEALTHCHECK and the compose-level healthcheck. The two are redundant by design: compose-level wins for `docker compose` orchestration; Dockerfile HEALTHCHECK wins for `docker run` outside compose. Both need to be correct. Comment added in both places explaining why localhost is wrong, so the next operator doesn't revert. Production state right now: zeroauth-prod is up + healthy via the manual --no-deps recovery. The verifier is up + responding but marked unhealthy by Docker (cosmetic — it doesn't block anything since prod is now running without the dependency wait). After this hotfix deploys, both will be healthy and the dependency edge reactivates on next restart. Verified locally: docker exec zeroauth-verifier wget -qO- http://127.0.0.1:3001/health → {"status":"ok","version":"0.1.0","vkeyAvailable":true,...}
pulkitpareek18
added a commit
that referenced
this pull request
May 15, 2026
Task 4 of today. Formally records the decision Pulkit made yesterday when he picked Plan B over Plan A. Captures the three reasons single-engineer velocity beat the brainstorm's Rust spec, what we gave up (reproducible-build provenance, smaller transitive surface, unsafe-discipline) and what we kept (cross-repo HTTP shape stays Rust-compatible if we ever swap). Also pins the inline-fallback retirement plan: - 2026-05-15: verifier shipped, inline path unused but compiled-in - 2026-05-16 → 2026-06-06: 3-week soak in prod - 2026-06-08: PR to delete verifyInline + snarkjs from root deps + refuse-to-start when VERIFIER_URL is unset - 2026-06-09: prod runs verifier-only References the three shipping PRs (#35 cutover, #36 healthcheck hotfix, #37 SQLite audit log) + the plan-mode design doc + the B02 build prompt that we rejected.
pulkitpareek18
added a commit
that referenced
this pull request
May 28, 2026
Delivers the A35-W3-Mon outline + A35-W4-Mon full script combined into a
single 898-line operator runbook for the 22-minute Anchor Bank demo
defined in docs/plan/bfsi-v1/02-bank-demo.md.
Twelve sections cover the entire room-time:
1. Pre-demo setup checklist (T-24h) — equipment kit, network sanity,
phone inventory, the seed-demo-tenants.ts live-key handling, dashboard
and Basescan tab prep, dry-run, sleep.
2. Day-of setup (T-30 min) — physical setup, browser/shell warm-up,
phone setup, pre-checks.
3. Opening 30-second pitch (verbatim from 02-bank-demo.md operator
script).
4-9. Scenes 1-6 — every keystroke, every sentence the operator speaks,
what appears on the projector, what the CISO/CFO/CRO/CIO/GC each see.
Scene 3 includes the substitution-attack demonstration. Scene 4
includes the \\d users + SELECT * FROM users + DPDP 2(t) reading
moment. Scene 5 includes the UPDATE audit_events tamper + on-chain
anchor cross-check.
10. Q&A bank — 13 questions sourced from 02-bank-demo.md with prepared
2-3 sentence operator answers.
11. Recovery playbook 11a-11f — kiosk freeze, app crash, network drop,
tier-2 device (no StrongBox), R307 missing, proof verification
rejection (the worst nightmare). Each has a calm-recovery script.
12. Post-demo (T+10 min) — leave-behind folder contents, the 90-second
ask, follow-up cadence (T+0 through T+42), debrief, photo policy,
cleanup.
Two appendices: operator wallet-card contact list + timing reference.
References docs/plan/bfsi-v1/02-bank-demo.md as the canonical demo spec,
docs/plan/bfsi-v1/01-pain-points.md for the P1-P10 cross-references, and
scripts/seed-demo-tenants.ts for the exact tenant + API-key format.
Owner: Agent #35 (writer-compliance) + Agent #45 (solutions architect).
[no-test] markdown-only.
pulkitpareek18
added a commit
that referenced
this pull request
May 28, 2026
Delivers the A35-W3-Mon outline + A35-W4-Mon full script combined into a
single 898-line operator runbook for the 22-minute Anchor Bank demo
defined in docs/plan/bfsi-v1/02-bank-demo.md.
Twelve sections cover the entire room-time:
1. Pre-demo setup checklist (T-24h) — equipment kit, network sanity,
phone inventory, the seed-demo-tenants.ts live-key handling, dashboard
and Basescan tab prep, dry-run, sleep.
2. Day-of setup (T-30 min) — physical setup, browser/shell warm-up,
phone setup, pre-checks.
3. Opening 30-second pitch (verbatim from 02-bank-demo.md operator
script).
4-9. Scenes 1-6 — every keystroke, every sentence the operator speaks,
what appears on the projector, what the CISO/CFO/CRO/CIO/GC each see.
Scene 3 includes the substitution-attack demonstration. Scene 4
includes the \\d users + SELECT * FROM users + DPDP 2(t) reading
moment. Scene 5 includes the UPDATE audit_events tamper + on-chain
anchor cross-check.
10. Q&A bank — 13 questions sourced from 02-bank-demo.md with prepared
2-3 sentence operator answers.
11. Recovery playbook 11a-11f — kiosk freeze, app crash, network drop,
tier-2 device (no StrongBox), R307 missing, proof verification
rejection (the worst nightmare). Each has a calm-recovery script.
12. Post-demo (T+10 min) — leave-behind folder contents, the 90-second
ask, follow-up cadence (T+0 through T+42), debrief, photo policy,
cleanup.
Two appendices: operator wallet-card contact list + timing reference.
References docs/plan/bfsi-v1/02-bank-demo.md as the canonical demo spec,
docs/plan/bfsi-v1/01-pain-points.md for the P1-P10 cross-references, and
scripts/seed-demo-tenants.ts for the exact tenant + API-key format.
Owner: Agent #35 (writer-compliance) + Agent #45 (solutions architect).
[no-test] markdown-only.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Task 2 of today's plan. Flips production from inline-snarkjs to the dedicated verifier container that's been shipped-but-unused since PR #29 yesterday.
What changes
Dockerfile
docker-compose.yml
Local validation
```bash
docker build --target verifier-production -t zeroauth-verifier:test . # builds clean
docker run -d --rm -p 3099:3001 zeroauth-verifier:test
curl /health # {"status":"ok","vkeyAvailable":true,"version":"0.1.0","uptimeSeconds":5}
curl /verify # rejects junk proof with structuralFallback:false (real Groth16 verify ran)
```
Post-deploy verification
After this merges + the deploy workflow runs:
Test plan
Out of scope (separate PRs today)
🤖 Generated with Claude Code