| status | done | |
|---|---|---|
| depends |
|
|
| specs |
|
|
| issues |
|
|
| pr | 35 |
Containerization, Helm chart, CI/CD wiring, secret management, the bucket provisioning. Stand up a staging environment that the team can hit from outside their laptops. Production deploy follows the same template; pointed at production secrets.
Can start in parallel with the API/web work since this plan exercises just the boot path, but should land before cutover-prep so the actual cutover is just a config flip.
Out of scope: cutover orchestration (cutover-prep); production data (laddr-import).
- architecture.md — the Deploy section's env-var table, the "single Docker image bundles API + static web" claim, the k8s Helm chart story, the bucket-versioning requirement, the data repo's working-tree-on-startup pattern
Dockerfile at the repo root, multi-stage:
- Build stage —
FROM node:22-alpine, copy lockfile,npm ci, copy source,npm run build - Runtime stage —
FROM node:22-alpine, copydist/,node_modules(production only vianpm prune --production),package.json. Installgit(the API shells out forgit pushvia the gitsheets push daemon). Installca-certificates. - Entrypoint: a small shell script that:
- Clones
CFP_DATA_REMOTEtoCFP_DATA_REPO_PATH(or pulls if already present) exec node apps/api/dist/index.js
- Clones
Single image serves both API (port 3001) and static apps/web/dist/ via @fastify/static.
deploy/charts/codeforphilly/ — modeled on the existing legacy laddr Helm chart but trimmed:
deploy/charts/codeforphilly/
├── Chart.yaml
├── values.yaml
├── values.staging.yaml
├── values.production.yaml
└── templates/
├── deployment.yaml # single replica per architecture.md
├── service.yaml
├── ingress.yaml # TLS via cert-manager
├── pvc.yaml # for the data repo working tree
├── configmap.yaml # non-secret env
└── secrets.yaml # sealed-secrets
The Deployment specifies:
replicas: 1(hard constraint per architecture.md)strategy.type: Recreate(no rolling — single replica + write mutex means concurrent old/new replicas would corrupt state)- Volume mount for the data repo PVC
- Init container or entrypoint command that clones/pulls the data repo
All secrets go through sealed-secrets per the legacy cluster's existing pattern. Required secrets:
CFP_JWT_SIGNING_KEY— generated withopenssl rand -base64 64GITHUB_OAUTH_CLIENT_SECRET— from the GitHub OAuth App (one app for staging, one for production)SAML_PRIVATE_KEY+SAML_CERTIFICATE— generated with the openssl recipe in laddr'sdocs/operations/update-saml2-certificate.mdS3_ACCESS_KEY_ID+S3_SECRET_ACCESS_KEY— from whatever bucket provider we end up on- A deploy key (SSH) for pushing to the data repo (write access only to one branch)
Document each secret's generation + rotation procedure in docs/operations/secrets.md.
The deploy plan picks the production bucket provider. Options:
- MinIO inside the cluster — zero outside dependency; uses cluster's existing storage; cheapest. Adds a small MinIO Helm chart.
- Cloudflare R2 — zero egress fees; uses the existing Cloudflare account if Code for Philly has one; pennies per month at our scale.
- Backblaze B2 — similar.
- AWS S3 — standard; slightly more expensive but most familiar.
Pick at start of this plan. Document the choice. Either way:
- Enable bucket versioning (per behaviors/private-storage.md)
- IAM policy scoped to the bucket only
- Lifecycle rule deleting non-current versions after 365 days
.github/workflows/:
ci.yml(exists fromworkspace) — runs on PRs: lint, type-check, build, testdeploy-staging.yml— runs on merge tomain: build image, push to ghcr.io, update the staging Helm release viahelm upgradedeploy-production.yml— runs on git tag push: same flow against production values
Use actions/checkout@v4, docker/login-action@v3, azure/setup-helm@v4 (or azure/setup-kubectl@v4 + raw kubectl). Pin each action version per CLAUDE.md tooling rule (check the action's repo first via gh-axi repo view).
/api/health (already from api-skeleton) is the liveness probe. Readiness probe checks /api/health/ready — added in this plan, returns 200 only after both stores have loaded.
1. K8s deployment starts; init container clones CFP_DATA_REMOTE
2. Main container starts; entrypoint validates env via Zod
3. API loads public gitsheets data into memory
4. API loads private store from S3 into memory
5. API builds FTS index
6. API starts the push daemon
7. API binds to port 3001; readiness probe returns 200
8. K8s routes traffic
If any step fails: the container exits, k8s restarts it, alert fires.
Same image; different values. Staging:
codeforphilly-rewrite-staging.k8s.phl.io(or similar)- Separate GitHub OAuth App registered with the staging callback
- Separate bucket (or a separate prefix in the same bucket)
- Separate SAML cert (or shared with a different entityID)
- Data repo: a staging-only branch or a separate repo with anonymized data
Pino's structured JSON logs go to stdout; k8s log aggregator captures them. Add metrics later if needed. Prometheus scrape config left out of v1 (pino-prometheus-style or a /metrics endpoint can be added when there's a specific metric we want to alert on).
-
docker build .produces an image;docker runboots the API - The same image serves both
/api/*and the static SPA -
helm installto a staging namespace boots the deployment cleanly - Ingress + TLS works (verified by hitting
https://codeforphilly-rewrite-staging.k8s.phl.io/api/healthfrom outside) - The data repo PVC persists across pod restarts (verify by killing the pod and observing the API comes back without re-cloning)
- The push daemon successfully pushes a test commit to the data remote (using the deploy key)
- The S3-backed PrivateStore reads/writes against the production bucket; bucket versioning works (verify a PUT increments the version)
- Readiness probe returns 200 only after both stores load (verify by intentionally pointing at an empty bucket; readiness fails until populated)
- CI workflows pass and produce deployable artifacts
- Sealed-secrets in the cluster decrypt and inject correctly
- Operational docs in
docs/operations/: secrets management, runbook for "API won't boot", cert rotation
- PVC sizing. The data repo working tree's size depends on history; estimate at ~100MB initial post-import, ~1GB after a few years of activity. 5GB PVC gives plenty of headroom.
- Deploy key vs GitHub App for push auth. Deploy key is simpler; GitHub App is more rotateable. Either works. Probably deploy key for v1.
- Init container vs entrypoint clone. Init container is k8s-idiomatic but adds a layer. Entrypoint clone is simpler. Either works.
- Helm chart drift from legacy. The legacy CFP Helm chart is the reference for cluster conventions. Don't reinvent — copy + adapt.
- Cluster + bucket stand-up are not closeable from a dev workstation. Six of the validation criteria need a live k8s cluster, a real bucket, or both (
docker runsmoke,helm install, ingress/TLS, PVC persistence, push daemon, S3 PrivateStore, sealed-secrets injection). They're left unchecked and tracked by #36 so they close out when a human operator actually stands up staging. - Single notFoundHandler, even without an SPA bundled.
apps/api/src/plugins/static-web.tsinstalls a not-found handler unconditionally — whenCFP_WEB_DIST_PATHis unset (dev/tests) it still returns the JSON envelope for unknown paths. Avoids drift between dev and prod 404 behavior on/api/*. - index.html is read into memory at boot. fastify-static's per-file cache-control headers stamped the SPA entrypoint with
immutable max-age=1y, which is wrong for the file that decides which hashed assets the browser loads next. The notFoundHandler serves a cached buffer withcache-control: no-cacheinstead. Hashed assets in/assets/*keep the long cache. RecreateoverRollingUpdate. A rolling update would temporarily run two pods, each holding the gitsheets write mutex against the same PVC working tree. Old + new pods committing concurrently would interleave commits and corrupt state. Recreate forces the old pod down before the new one starts; brief downtime is the price.- Entrypoint clone, not init container. Simpler — one container in the pod, one log stream, no separate ServiceAccount semantics. The plan's "init container vs entrypoint" unknown resolved to entrypoint.
- Filesystem private-store on staging until a bucket exists.
values.staging.yamlsetsstorage.backend=filesystemagainst a small PVC so the chart can stand up before the bucket is provisioned. Flipping to S3 once a real bucket exists is values-only (no code change, no schema migration). helmpinned to 4.1.0 via asdf. Workflow actions install their own helm (v3.16.2 viaazure/setup-helm@v4); local validation uses the asdf pin. v4 lints v3 charts cleanly.
- Issue #36 — stand up staging cluster + bucket, generate per-environment secrets, run
deploy-staging.yml, verify externalcurlto/api/health+/api/health/ready, verify PVC persistence + push-daemon push (closes the six unchecked validation criteria). - Tracked as: bucket provider choice (R2 / B2 / S3 / MinIO) deferred to whoever stands up staging — decision deliberately left to the operator with the bucket-provisioning checklist in docs/operations/deploy.md. Until decided, staging runs on filesystem.
- Tracked as: production cluster stand-up is the same template with
values.production.yaml; a separate issue should be filed once staging is green, not now.