Skip to content

B02 Phase 2: ship verifier as its own container in production#35

Merged
pulkitpareek18 merged 2 commits into
mainfrom
dev
May 15, 2026
Merged

B02 Phase 2: ship verifier as its own container in production#35
pulkitpareek18 merged 2 commits into
mainfrom
dev

Conversation

@pulkitpareek18
Copy link
Copy Markdown
Collaborator

Task 2 of today's plan. Flips production from inline-snarkjs to the dedicated verifier container that's been shipped-but-unused since PR #29 yesterday.

What changes

Dockerfile

  • New `verifier-build` stage — npm-ci against the root lockfile (reproducible build), compiles `verifier/src/` → `verifier/dist/`.
  • New `verifier-production` stage — slim alpine, non-root uid 1001, flat `npm install --omit=dev` against `verifier/package.json` (4 prod deps: express, snarkjs, winston, uuid). Copies the compiled JS + the production vkey. Binds `0.0.0.0:3001` inside the container (Docker network only — no host binding).

docker-compose.yml

  • New `zeroauth-verifier` service in `['dev', 'prod']` profiles. `expose: 3001` (no `ports:` — loopback-only at the network boundary). Healthcheck wired.
  • `zeroauth-prod` gains:
    • `VERIFIER_URL=http://zeroauth-verifier:3001\` in `environment:` (not in .env — eliminates drift)
    • `VERIFIER_TIMEOUT_MS=2000`
    • `depends_on: zeroauth-verifier: condition: service_healthy` (deploy fails loud if verifier can't load vkey or bind port)
  • `zeroauth-dev` gets the same wiring so local dev hits the service path by default. Override with `VERIFIER_URL=` (empty) in your `.env` to keep the inline-fallback for fast iteration.

Local validation

```bash
docker build --target verifier-production -t zeroauth-verifier:test . # builds clean
docker run -d --rm -p 3099:3001 zeroauth-verifier:test
curl /health # {"status":"ok","vkeyAvailable":true,"version":"0.1.0","uptimeSeconds":5}
curl /verify # rejects junk proof with structuralFallback:false (real Groth16 verify ran)
```

Post-deploy verification

After this merges + the deploy workflow runs:

  • `scripts/deploy-remote.sh` does `docker compose --profile prod up -d --build --remove-orphans` — auto-picks-up the new service
  • API container's `src/services/zkp.ts` switches from inline → HTTP because `VERIFIER_URL` is now in its environment
  • I'll smoke a real `/v1/auth/zkp/verify` call afterward and confirm the API logs show `"ZKP: verifier service: PASS/FAIL"` with a `verifierAuditId` (the service path's log shape) instead of `"ZKP: inline Groth16: …"` (the legacy path)

Test plan

  • `npx tsc --noEmit` clean
  • `npm test` — 228 passing (no change)
  • `docker build --target verifier-production` succeeds
  • Standalone smoke of the built image — `/health` ok, `/verify` runs real Groth16
  • CI green on this PR
  • After merge: deploy completes, both containers healthy
  • After deploy: `/v1/auth/zkp/verify` smoke shows the API hits the verifier service path

Out of scope (separate PRs today)

  • SQLite audit log + hash chain in the verifier (task 3 — next)
  • ADR-0008 (task 4)
  • Promote governance/docs/threat-model/verifier.md from stub (task 5)
  • Inline-fallback retirement in zkp.ts (next week — keeping the safety net while the verifier soaks)

🤖 Generated with Claude Code

pulkitpareek18 and others added 2 commits May 15, 2026 12:43
Plan B (TS workspace) chosen yesterday. Yesterday's PR #29 landed the
verifier package in the repo. Today's PR Phase 2 flips production to
actually use it instead of the inline-snarkjs fallback that's been
serving since v0.

Dockerfile:

- Adds a `verifier-build` stage that npm-ci's the verifier workspace
  against the root lockfile (reproducible), compiles src/ → dist/.
- Adds a `verifier-production` stage: slim alpine image, non-root user
  uid 1001, flat `npm install --omit=dev` (verifier has 4 prod deps:
  express, snarkjs, winston, uuid — workspace-aware ci complicates a
  per-package prod install; trade-off accepted per ADR-0005). Copies
  the compiled JS + the production vkey. Healthcheck on /health.
  Binds 0.0.0.0:3001 inside the container so docker network reaches
  it; no host port binding so it stays loopback-only at the boundary.

docker-compose.yml:

- New `zeroauth-verifier` service in BOTH the `dev` and `prod`
  profiles. `expose: 3001` (no `ports:` — no host binding).
  Healthcheck wired.
- `zeroauth-prod` gains:
    VERIFIER_URL=http://zeroauth-verifier:3001
    VERIFIER_TIMEOUT_MS=2000
  in its `environment:` block (not via .env — wired directly so a
  hand-edited prod .env can't drift from the compose intent).
- `zeroauth-prod.depends_on` now requires `zeroauth-verifier` to be
  service_healthy before starting. Deploy fails loud if the verifier
  can't load its vkey or bind its port.
- `zeroauth-dev` gets the same wiring so local dev exercises the
  service path by default. Developers who want the inline-snarkjs
  fallback can override with `VERIFIER_URL=` in their .env.

Local validation:

- `docker build --target verifier-production` builds clean
- `docker run` + `curl /health` returns
    {"status":"ok","version":"0.1.0","vkeyAvailable":true,"uptimeSeconds":5}
- `POST /verify` with a structurally-valid junk proof exercises the
  real Groth16 verifier against the real vkey (structuralFallback:false)
  and rejects it correctly (verified:false, 543ms — first call cost
  includes snarkjs init; subsequent calls are <50ms in the same image).

Post-deploy verification plan (after this PR merges):

- The deploy workflow runs scripts/deploy-remote.sh which does
  `docker compose --profile prod up -d --build --remove-orphans` — that
  auto-picks-up the new zeroauth-verifier service.
- After healthchecks pass, the API container's
  src/services/zkp.ts switches from the inline path to the HTTP
  path because VERIFIER_URL is now set in its environment.
- I'll smoke-test by POSTing a /v1/auth/zkp/verify and confirming
  the API logs show "ZKP: verifier service: FAIL/PASS" with a
  verifierAuditId (the service path's signature in zkp.ts:212) instead
  of "ZKP: inline Groth16: …" (the legacy path's signature).

Out of scope (separate follow-ups today):

- SQLite append-only audit log + hash chain in the verifier (task 3)
- ADR-0008 capturing the TS-vs-Rust decision formally (task 4)
- Promotion of governance/docs/threat-model/verifier.md from stub →
  full with A-V01 through A-V05 entries (task 5)
- Retirement of the inline-fallback code path in zkp.ts (next week)

Tests: 228 passing (no change). Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 15, 2026 07:19
@pulkitpareek18 pulkitpareek18 merged commit 2b7f3bf into main May 15, 2026
2 of 3 checks passed
@pulkitpareek18 pulkitpareek18 deleted the dev branch May 15, 2026 07:20
pulkitpareek18 added a commit that referenced this pull request May 15, 2026
)

The deploy after PR #35 succeeded in building both containers but the
verifier never became 'healthy' from Docker's perspective:

  Connecting to localhost:3001 ([::1]:3001)
  wget: can't connect to remote host: Connection refused

Root cause: alpine ships busybox wget. Busybox wget resolves `localhost`
to ::1 (IPv6) first and does NOT fall back to 127.0.0.1 (IPv4) on
refusal. The verifier binds 0.0.0.0 (IPv4-only). Connection refused
on every healthcheck, container marked unhealthy after 3 retries,
zeroauth-prod (which depends on it via depends_on: service_healthy)
never started.

Result: prod was 502 for ~3 minutes between 07:21 UTC and 07:25 UTC
until I manually started zeroauth-prod with --no-deps via SSH. That
restored service. The verifier was running and responding to requests
fine the whole time — only the healthcheck command was wrong.

Fix: use the literal 127.0.0.1 in both the Dockerfile HEALTHCHECK
and the compose-level healthcheck. The two are redundant by design:
compose-level wins for `docker compose` orchestration; Dockerfile
HEALTHCHECK wins for `docker run` outside compose. Both need to be
correct.

Comment added in both places explaining why localhost is wrong, so
the next operator doesn't revert.

Production state right now: zeroauth-prod is up + healthy via the
manual --no-deps recovery. The verifier is up + responding but
marked unhealthy by Docker (cosmetic — it doesn't block anything
since prod is now running without the dependency wait). After this
hotfix deploys, both will be healthy and the dependency edge
reactivates on next restart.

Verified locally:
  docker exec zeroauth-verifier wget -qO- http://127.0.0.1:3001/health
  → {"status":"ok","version":"0.1.0","vkeyAvailable":true,...}

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pulkitpareek18 pulkitpareek18 review requested due to automatic review settings May 15, 2026 07:40
pulkitpareek18 added a commit that referenced this pull request May 15, 2026
Task 4 of today. Formally records the decision Pulkit made yesterday
when he picked Plan B over Plan A. Captures the three reasons
single-engineer velocity beat the brainstorm's Rust spec, what we
gave up (reproducible-build provenance, smaller transitive surface,
unsafe-discipline) and what we kept (cross-repo HTTP shape stays
Rust-compatible if we ever swap).

Also pins the inline-fallback retirement plan:
- 2026-05-15: verifier shipped, inline path unused but compiled-in
- 2026-05-16 → 2026-06-06: 3-week soak in prod
- 2026-06-08: PR to delete verifyInline + snarkjs from root deps +
  refuse-to-start when VERIFIER_URL is unset
- 2026-06-09: prod runs verifier-only

References the three shipping PRs (#35 cutover, #36 healthcheck hotfix,
#37 SQLite audit log) + the plan-mode design doc + the B02 build
prompt that we rejected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pulkitpareek18 added a commit that referenced this pull request May 15, 2026
Task 4 of today. Formally records the decision Pulkit made yesterday
when he picked Plan B over Plan A. Captures the three reasons
single-engineer velocity beat the brainstorm's Rust spec, what we
gave up (reproducible-build provenance, smaller transitive surface,
unsafe-discipline) and what we kept (cross-repo HTTP shape stays
Rust-compatible if we ever swap).

Also pins the inline-fallback retirement plan:
- 2026-05-15: verifier shipped, inline path unused but compiled-in
- 2026-05-16 → 2026-06-06: 3-week soak in prod
- 2026-06-08: PR to delete verifyInline + snarkjs from root deps +
  refuse-to-start when VERIFIER_URL is unset
- 2026-06-09: prod runs verifier-only

References the three shipping PRs (#35 cutover, #36 healthcheck hotfix,
#37 SQLite audit log) + the plan-mode design doc + the B02 build
prompt that we rejected.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pulkitpareek18 added a commit that referenced this pull request May 15, 2026
* QA log — 2026-05-15 (HOLD, surrogate green, email NOW delivering)

* B02 Phase 2: ship verifier as its own container in prod

Plan B (TS workspace) chosen yesterday. Yesterday's PR #29 landed the
verifier package in the repo. Today's PR Phase 2 flips production to
actually use it instead of the inline-snarkjs fallback that's been
serving since v0.

Dockerfile:

- Adds a `verifier-build` stage that npm-ci's the verifier workspace
  against the root lockfile (reproducible), compiles src/ → dist/.
- Adds a `verifier-production` stage: slim alpine image, non-root user
  uid 1001, flat `npm install --omit=dev` (verifier has 4 prod deps:
  express, snarkjs, winston, uuid — workspace-aware ci complicates a
  per-package prod install; trade-off accepted per ADR-0005). Copies
  the compiled JS + the production vkey. Healthcheck on /health.
  Binds 0.0.0.0:3001 inside the container so docker network reaches
  it; no host port binding so it stays loopback-only at the boundary.

docker-compose.yml:

- New `zeroauth-verifier` service in BOTH the `dev` and `prod`
  profiles. `expose: 3001` (no `ports:` — no host binding).
  Healthcheck wired.
- `zeroauth-prod` gains:
    VERIFIER_URL=http://zeroauth-verifier:3001
    VERIFIER_TIMEOUT_MS=2000
  in its `environment:` block (not via .env — wired directly so a
  hand-edited prod .env can't drift from the compose intent).
- `zeroauth-prod.depends_on` now requires `zeroauth-verifier` to be
  service_healthy before starting. Deploy fails loud if the verifier
  can't load its vkey or bind its port.
- `zeroauth-dev` gets the same wiring so local dev exercises the
  service path by default. Developers who want the inline-snarkjs
  fallback can override with `VERIFIER_URL=` in their .env.

Local validation:

- `docker build --target verifier-production` builds clean
- `docker run` + `curl /health` returns
    {"status":"ok","version":"0.1.0","vkeyAvailable":true,"uptimeSeconds":5}
- `POST /verify` with a structurally-valid junk proof exercises the
  real Groth16 verifier against the real vkey (structuralFallback:false)
  and rejects it correctly (verified:false, 543ms — first call cost
  includes snarkjs init; subsequent calls are <50ms in the same image).

Post-deploy verification plan (after this PR merges):

- The deploy workflow runs scripts/deploy-remote.sh which does
  `docker compose --profile prod up -d --build --remove-orphans` — that
  auto-picks-up the new zeroauth-verifier service.
- After healthchecks pass, the API container's
  src/services/zkp.ts switches from the inline path to the HTTP
  path because VERIFIER_URL is now set in its environment.
- I'll smoke-test by POSTing a /v1/auth/zkp/verify and confirming
  the API logs show "ZKP: verifier service: FAIL/PASS" with a
  verifierAuditId (the service path's signature in zkp.ts:212) instead
  of "ZKP: inline Groth16: …" (the legacy path's signature).

Out of scope (separate follow-ups today):

- SQLite append-only audit log + hash chain in the verifier (task 3)
- ADR-0008 capturing the TS-vs-Rust decision formally (task 4)
- Promotion of governance/docs/threat-model/verifier.md from stub →
  full with A-V01 through A-V05 entries (task 5)
- Retirement of the inline-fallback code path in zkp.ts (next week)

Tests: 228 passing (no change). Typecheck clean.


---------
pulkitpareek18 added a commit that referenced this pull request May 15, 2026
)

The deploy after PR #35 succeeded in building both containers but the
verifier never became 'healthy' from Docker's perspective:

  Connecting to localhost:3001 ([::1]:3001)
  wget: can't connect to remote host: Connection refused

Root cause: alpine ships busybox wget. Busybox wget resolves `localhost`
to ::1 (IPv6) first and does NOT fall back to 127.0.0.1 (IPv4) on
refusal. The verifier binds 0.0.0.0 (IPv4-only). Connection refused
on every healthcheck, container marked unhealthy after 3 retries,
zeroauth-prod (which depends on it via depends_on: service_healthy)
never started.

Result: prod was 502 for ~3 minutes between 07:21 UTC and 07:25 UTC
until I manually started zeroauth-prod with --no-deps via SSH. That
restored service. The verifier was running and responding to requests
fine the whole time — only the healthcheck command was wrong.

Fix: use the literal 127.0.0.1 in both the Dockerfile HEALTHCHECK
and the compose-level healthcheck. The two are redundant by design:
compose-level wins for `docker compose` orchestration; Dockerfile
HEALTHCHECK wins for `docker run` outside compose. Both need to be
correct.

Comment added in both places explaining why localhost is wrong, so
the next operator doesn't revert.

Production state right now: zeroauth-prod is up + healthy via the
manual --no-deps recovery. The verifier is up + responding but
marked unhealthy by Docker (cosmetic — it doesn't block anything
since prod is now running without the dependency wait). After this
hotfix deploys, both will be healthy and the dependency edge
reactivates on next restart.

Verified locally:
  docker exec zeroauth-verifier wget -qO- http://127.0.0.1:3001/health
  → {"status":"ok","version":"0.1.0","vkeyAvailable":true,...}
pulkitpareek18 added a commit that referenced this pull request May 15, 2026
Task 4 of today. Formally records the decision Pulkit made yesterday
when he picked Plan B over Plan A. Captures the three reasons
single-engineer velocity beat the brainstorm's Rust spec, what we
gave up (reproducible-build provenance, smaller transitive surface,
unsafe-discipline) and what we kept (cross-repo HTTP shape stays
Rust-compatible if we ever swap).

Also pins the inline-fallback retirement plan:
- 2026-05-15: verifier shipped, inline path unused but compiled-in
- 2026-05-16 → 2026-06-06: 3-week soak in prod
- 2026-06-08: PR to delete verifyInline + snarkjs from root deps +
  refuse-to-start when VERIFIER_URL is unset
- 2026-06-09: prod runs verifier-only

References the three shipping PRs (#35 cutover, #36 healthcheck hotfix,
#37 SQLite audit log) + the plan-mode design doc + the B02 build
prompt that we rejected.
pulkitpareek18 added a commit that referenced this pull request May 28, 2026
Delivers the A35-W3-Mon outline + A35-W4-Mon full script combined into a
single 898-line operator runbook for the 22-minute Anchor Bank demo
defined in docs/plan/bfsi-v1/02-bank-demo.md.

Twelve sections cover the entire room-time:

  1. Pre-demo setup checklist (T-24h) — equipment kit, network sanity,
     phone inventory, the seed-demo-tenants.ts live-key handling, dashboard
     and Basescan tab prep, dry-run, sleep.
  2. Day-of setup (T-30 min) — physical setup, browser/shell warm-up,
     phone setup, pre-checks.
  3. Opening 30-second pitch (verbatim from 02-bank-demo.md operator
     script).
  4-9. Scenes 1-6 — every keystroke, every sentence the operator speaks,
       what appears on the projector, what the CISO/CFO/CRO/CIO/GC each see.
       Scene 3 includes the substitution-attack demonstration. Scene 4
       includes the \\d users + SELECT * FROM users + DPDP 2(t) reading
       moment. Scene 5 includes the UPDATE audit_events tamper + on-chain
       anchor cross-check.
  10. Q&A bank — 13 questions sourced from 02-bank-demo.md with prepared
      2-3 sentence operator answers.
  11. Recovery playbook 11a-11f — kiosk freeze, app crash, network drop,
      tier-2 device (no StrongBox), R307 missing, proof verification
      rejection (the worst nightmare). Each has a calm-recovery script.
  12. Post-demo (T+10 min) — leave-behind folder contents, the 90-second
      ask, follow-up cadence (T+0 through T+42), debrief, photo policy,
      cleanup.

Two appendices: operator wallet-card contact list + timing reference.

References docs/plan/bfsi-v1/02-bank-demo.md as the canonical demo spec,
docs/plan/bfsi-v1/01-pain-points.md for the P1-P10 cross-references, and
scripts/seed-demo-tenants.ts for the exact tenant + API-key format.

Owner: Agent #35 (writer-compliance) + Agent #45 (solutions architect).

[no-test] markdown-only.
pulkitpareek18 added a commit that referenced this pull request May 28, 2026
Delivers the A35-W3-Mon outline + A35-W4-Mon full script combined into a
single 898-line operator runbook for the 22-minute Anchor Bank demo
defined in docs/plan/bfsi-v1/02-bank-demo.md.

Twelve sections cover the entire room-time:

  1. Pre-demo setup checklist (T-24h) — equipment kit, network sanity,
     phone inventory, the seed-demo-tenants.ts live-key handling, dashboard
     and Basescan tab prep, dry-run, sleep.
  2. Day-of setup (T-30 min) — physical setup, browser/shell warm-up,
     phone setup, pre-checks.
  3. Opening 30-second pitch (verbatim from 02-bank-demo.md operator
     script).
  4-9. Scenes 1-6 — every keystroke, every sentence the operator speaks,
       what appears on the projector, what the CISO/CFO/CRO/CIO/GC each see.
       Scene 3 includes the substitution-attack demonstration. Scene 4
       includes the \\d users + SELECT * FROM users + DPDP 2(t) reading
       moment. Scene 5 includes the UPDATE audit_events tamper + on-chain
       anchor cross-check.
  10. Q&A bank — 13 questions sourced from 02-bank-demo.md with prepared
      2-3 sentence operator answers.
  11. Recovery playbook 11a-11f — kiosk freeze, app crash, network drop,
      tier-2 device (no StrongBox), R307 missing, proof verification
      rejection (the worst nightmare). Each has a calm-recovery script.
  12. Post-demo (T+10 min) — leave-behind folder contents, the 90-second
      ask, follow-up cadence (T+0 through T+42), debrief, photo policy,
      cleanup.

Two appendices: operator wallet-card contact list + timing reference.

References docs/plan/bfsi-v1/02-bank-demo.md as the canonical demo spec,
docs/plan/bfsi-v1/01-pain-points.md for the P1-P10 cross-references, and
scripts/seed-demo-tenants.ts for the exact tenant + API-key format.

Owner: Agent #35 (writer-compliance) + Agent #45 (solutions architect).

[no-test] markdown-only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant