Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions app.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,25 @@ build_env_variables:
instance_class: F4
automatic_scaling:
max_concurrent_requests: 100
# Cost cap: bounds worst-case fan-out if a render storm or crash loop drives
# autoscaling (issue #694). 8 F4 instances is far above routine load; raise
# deliberately if organic traffic ever needs it. Mirror this into
# .app.prod.yaml -- scripts/validate-app-prod-config.mjs enforces it there.
max_instances: 8

handlers:
- url: /static
static_dir: public/static
secure: always
http_headers:
# The embeddable web component (sd-component.js) is hotlinked from
# third-party origins (issue #688); its engine worker's WASM fetch (and
# any @font-face/CSS asset loads) are cross-origin CORS requests against
# these assets. They are public, immutable, content-hashed files served
# without credentials, so a wildcard origin is appropriate and adds no
# risk. Mirror this into .app.prod.yaml --
# scripts/validate-app-prod-config.mjs enforces it there.
Access-Control-Allow-Origin: "*"

- url: /$
static_files: public/index.html
Expand Down
16 changes: 11 additions & 5 deletions docs/dev/deploy.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Deploying to production

**Last reviewed:** 2026-05-08
**Last reviewed:** 2026-07-01

The web app at `app.simlin.com` runs on Google App Engine standard. GAE serves the static React SPA built from `src/app` and runs the Express backend in `src/server` (Firebase Auth, models persisted in Firestore as protobuf). `@simlin/mcp`, `@simlin/serve`, `pysimlin`, and `simlin-cli` are released separately to npm/PyPI -- they aren't part of this deploy.

Expand All @@ -27,6 +27,7 @@ export NODE_ENV=production
pnpm clean # cargo clean + each package's clean script
pnpm build # pnpm -r run build: Rust+WASM, then every TS package
pnpm --filter @simlin/app run deploy:assemble # copy build/ and build-component/ into public/; drop symlinks
scripts/check-upload-file-count.sh . # abort if the upload set would trip GAE's 10,000-file cap
gcloud app deploy ./.app.prod.yaml # upload the repo minus .gcloudignore, switch traffic
# A bash trap runs the cleanup below on EXIT/INT/TERM, even if any step above fails:
pnpm --filter @simlin/app run deploy:clean # git checkout the symlinks and index.html; rm build artifacts
Expand Down Expand Up @@ -103,6 +104,8 @@ On the instance GAE runs `pnpm install`, then the root `start` script: `node src

- `runtime`: `nodejs24`. (`nodejs18` is EOL on GAE; `nodejs16`, the runtime of the last deploy before May 2026, is gone entirely -- which is why there's no "redeploy the old commit" rollback.)
- `build_env_variables.GOOGLE_NODE_RUN_SCRIPTS`: `''`. The local deploy script already runs the monorepo build and stages the exact artifact to upload; this prevents App Engine's Node buildpack from running the root `build` script again during staging.
- `automatic_scaling.max_instances`: `8`. Cost cap so a render storm or crash loop can't fan out F4 instances without bound (issue #694). **Mirror this into `.app.prod.yaml`** -- both deploy scripts run `scripts/validate-app-prod-config.mjs`, which fails the deploy if it's missing or not a positive integer. Raise it deliberately if organic traffic ever needs more instances.
- The `/static` handler's `http_headers` with `Access-Control-Allow-Origin: "*"`. Third-party pages hotlink `sd-component.js`, and its engine worker's WASM fetch (plus fonts/CSS assets) are cross-origin CORS requests against `/static` (issue #688); without this header embeds load their data but the engine never initializes. The assets are public, immutable, and served without credentials, so the wildcard adds no risk. **Mirror this into `.app.prod.yaml`** -- `scripts/validate-app-prod-config.mjs` fails the deploy if the header is missing.
- The handler list: `/static`, `/`, `/new`, `/legal*`, `/privacy`, `robots.txt`, `ads.txt`, favicon, `manifest.json`, then `/.*` -> `script: auto`. The `/` and `/new` static HTML handlers carry CSP/HSTS headers because they bypass Express Helmet. The SPA's dynamic routes like `/:username/:projectName` fall through to `/.*`, i.e. the Express server.
- `env_variables`: the committed `app.yaml` has none; the server needs a couple (next section). They live in `.app.prod.yaml`.

Expand All @@ -122,19 +125,22 @@ The server loads `config/default.json`, then `config/production.json` when `NODE
## Pre-deploy checklist

- [ ] CI is green on the commit you're deploying.
- [ ] `git status` is clean.
- [ ] `git status` is clean. Note this only makes the tracked-`public/` mutation and cleanup reliable -- it does NOT bound what gets uploaded. The upload set is whatever `.gcloudignore` leaves in, so gitignored/untracked junk (venvs, nested cargo `target/` dirs, scratch checkouts) still counts toward GAE Standard's hard 10,000-file cap (issue #695). Both deploy scripts run [`scripts/check-upload-file-count.sh`](/scripts/check-upload-file-count.sh) immediately before `gcloud app deploy` and abort with a per-top-level-directory breakdown if the cap would be hit; fix by deleting the offending dirs or adding them to `.gcloudignore`.
- [ ] `gcloud config get-value project` is the production project.
- [ ] `gcloud app versions list --service=default` shows a known-good current version -- note its ID, that's your rollback target. (If GAE has garbage-collected it, there is no rollback; see below.)
- [ ] `.app.prod.yaml` reconciled against `app.yaml` (`runtime: nodejs24`, `build_env_variables.GOOGLE_NODE_RUN_SCRIPTS: ''`, handlers, `authentication__seshcookie__key` set to the value already in use).
- [ ] `.app.prod.yaml` reconciled against `app.yaml` (`runtime: nodejs24`, `build_env_variables.GOOGLE_NODE_RUN_SCRIPTS: ''`, `automatic_scaling.max_instances: 8`, handlers including the `/static` `Access-Control-Allow-Origin: "*"` header, `authentication__seshcookie__key` set to the value already in use).
- [ ] `wasm-opt --version` works; `rustup show` lists the `wasm32-unknown-unknown` target.

## Post-deploy smoke test

Against the `--no-promote` version URL, then again on production:

- `curl -sI https://<host>/` -> 200 HTML. View source: it links a hashed `/static/js/index.<hash>.js` (literal `<%= PUBLIC_URL %>` means the build was skipped) and `/static/css/index.<hash>.css`.
- `curl -s https://<host>/healthz` -> 200 `ok`. This is the only check that exercises the Node server: `/` is a GAE static handler and stays green even when every Express instance is crash-looping (e.g. `ServerInitError`). A WASM preload failure aborts boot before the route mounts, so it shows up as a non-responding instance (connection failure / GAE 5xx), not a 503 -- treat any non-200 here as down. (The route's 503 branch is defense-in-depth, not the expected failure signal.)
- `curl -sI https://<host>/static/js/sd-component.js` -> 200 -- the embeddable web component; external sites `<script src>` this exact path.
- `curl -H "Origin: https://example.com" -sI https://<host>/static/js/sd-component.js` -> response includes `access-control-allow-origin: *`. Cross-origin embeds need this on everything under `/static` (worker chunk, WASM -- issue #688). GAE emits `http_headers` unconditionally (the request `Origin` header doesn't matter), so any curl shows it; the earlier smoke checks missed its absence only because they never asserted on response headers. Without it, embeds load data but never initialize the engine.
- `curl -sI https://<host>/static/wasm/<hash>.module.wasm` -> 200, `content-type: application/wasm`.
- Full embed check (the header curl only proves CORS, not the blob-trampoline worker boot): serve `<script src="https://<host>/static/js/sd-component.js"></script><sd-model username="..." projectName="..."></sd-model>` from a different origin (e.g. `python3 -m http.server` on localhost) and confirm the diagram renders and simulates with no console errors.
- `curl -sI` on `/robots.txt`, `/manifest.json`, `/favicon.ico`, `/legal/`, `/privacy/` -> 200; `curl -I http://<host>/` -> 301 to https.
- Browser: log in with Google, land on Home, no console errors.
- New-user flow: sign in with a fresh account, claim a username, confirm the example projects appear and one opens and simulates.
Expand All @@ -161,5 +167,5 @@ The `frontend` job in [`.github/workflows/ci.yaml`](/.github/workflows/ci.yaml)
Things to know that don't have a clean fix yet:

- `pnpm deploy:web` deploys from the workspace root, so GAE's Node buildpack installs the *whole workspace's* dependency set on the instance -- `@rsbuild/*`, `jest`, `slate`, `radix`, rspress, vite, and every other package's deps (~590 MB / 1171 packages), none needed by the server at runtime. App Engine standard always reinstalls from the deployed `package.json` + lockfile and has no vendored-`node_modules` escape hatch, so the only lever is the deployed manifest. The smaller-deploy fix is implemented as **`pnpm deploy:web:staged`** (see below); it is locally proven but still pending a real `gcloud --no-promote` test, so `deploy:web` remains the default. Tracked in [docs/tech-debt.md](/docs/tech-debt.md) "Web deploy uploads the whole monorepo and GAE installs the full dep set".
- Server-side PNG preview (`src/server/render.ts`) parses and rasterizes user-uploaded models in-process with no size cap beyond the 10 MB request body limit and no timeout.
- There's no error reporting or alerting. Cloud Logging and the GAE metrics dashboard are it.
- Server-side PNG preview (`src/server/render.ts`) renders user-uploaded models in per-request `worker_threads` workers (each with its own WASM instance) with a 10 s total wall-clock budget per request (queue wait included) and at most 2 concurrent renders -- restoring the isolation the 2022 deploy had (issue #694). What remains rough: there's no model-complexity cap below the 10 MB request body limit, so a pathological model still costs a bounded 10 s worker per attempt before failing with a 500.
- There's no error reporting or alerting. Cloud Logging and the GAE metrics dashboard are it. The Express `/healthz` route exists as an uptime-check target (see the smoke test above), but no Cloud Monitoring notification channel, uptime check, or alerting policy points at it yet -- that ops-side setup is tracked in [issue #693](https://github.com/bpowers/simlin/issues/693).
6 changes: 6 additions & 0 deletions scripts/build-deploy-staging.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,12 @@ function verify(stagingDir) {
};

check(fs.existsSync(path.join(stagingDir, 'lib/index.js')), 'lib/index.js missing');
// render.ts spawns this sibling via __dirname at runtime (issue #694); a
// deploy without it 500s every preview while everything else looks healthy.
check(
fs.existsSync(path.join(stagingDir, 'lib/render-worker.js')),
'lib/render-worker.js missing (preview renders would 500 at runtime)',
);
check(fs.existsSync(path.join(stagingDir, 'config/production.json')), 'config/production.json missing');
check(
fs.existsSync(path.join(stagingDir, 'default_projects')) &&
Expand Down
57 changes: 57 additions & 0 deletions scripts/check-upload-file-count.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/usr/bin/env bash
#
# Pre-flight gate: fail if the gcloud upload set for DIR meets or exceeds
# App Engine Standard's hard per-deploy file cap (10,000 files).
#
# Why this exists (issue #695): `gcloud app deploy` uploads everything that
# .gcloudignore does not exclude, which is INDEPENDENT of git tracking
# status. A `git status`-clean tree can still hold gitignored artifacts
# (cargo target dirs, Python venvs, third_party checkouts) or brand-new
# untracked scratch dirs that blow the cap -- and without this gate the
# failure only surfaces inside `gcloud app deploy`, after the expensive
# clean+build. `gcloud meta list-files-for-upload` is the only check that
# reflects the real upload set, so we run it once here, immediately before
# the deploy, and turn a late upload failure into a fast, actionable error.
#
# Usage: check-upload-file-count.sh DIR [MAX_FILES]
# MAX_FILES defaults to 10000 (the GAE Standard cap). It is overridable
# so the failure path can be exercised by hand against a small directory
# without building a 10k-file fixture.

set -euo pipefail

if [ "$#" -lt 1 ] || [ "$#" -gt 2 ]; then
echo "usage: $0 DIR [MAX_FILES]" >&2
exit 2
fi

DIR="$1"
MAX_FILES="${2:-10000}"

UPLOAD_LIST="$(mktemp)"
trap 'rm -f "$UPLOAD_LIST"' EXIT

# Enumerate exactly once: on a polluted tree this walk covers 100k+ files
# and is the slow part, so both the count and the per-directory breakdown
# below are derived from this single capture. Running from inside DIR makes
# gcloud emit paths relative to DIR, which the breakdown's `cut` relies on.
(cd "$DIR" && gcloud meta list-files-for-upload .) > "$UPLOAD_LIST"

UPLOAD_COUNT="$(wc -l < "$UPLOAD_LIST" | tr -d '[:space:]')"

if [ "$UPLOAD_COUNT" -ge "$MAX_FILES" ]; then
{
echo "ERROR: gcloud would upload $UPLOAD_COUNT files from $DIR, but App Engine"
echo " Standard rejects deploys of $MAX_FILES files or more."
echo ""
echo "The upload set is whatever .gcloudignore leaves in -- a clean 'git status'"
echo "does NOT bound it (gitignored and untracked files still upload). Largest"
echo "top-level directories in the upload set; delete the junk ones or add them"
echo "to .gcloudignore:"
echo ""
cut -d/ -f1 "$UPLOAD_LIST" | sort | uniq -c | sort -rn | head -10
} >&2
exit 1
fi

echo " upload set: $UPLOAD_COUNT files (cap: $MAX_FILES)"
7 changes: 7 additions & 0 deletions scripts/deploy-web-staged.sh
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,13 @@ bash "$REPO_ROOT/scripts/verify-deploy-build.sh"
echo "==> Assembling self-contained server staging dir (scripts/build-deploy-staging.mjs)"
node "$REPO_ROOT/scripts/build-deploy-staging.mjs" "$STAGING_DIR" "$REPO_ROOT/.app.prod.yaml"

# The staging dir is bounded by construction (build-deploy-staging.mjs copies
# an explicit file list), so this gate is cheap here -- it exists to catch a
# regression in the staging assembly (e.g. accidentally vendoring a
# node_modules tree) before the upload starts. See issue #695.
echo "==> Checking upload file count against the GAE 10k cap (scripts/check-upload-file-count.sh)"
bash "$REPO_ROOT/scripts/check-upload-file-count.sh" "$STAGING_DIR"

echo "==> gcloud app deploy $STAGING_DIR/app.yaml"
gcloud app deploy "$STAGING_DIR/app.yaml" "$@"

Expand Down
8 changes: 8 additions & 0 deletions scripts/deploy-web.sh
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,14 @@ pnpm build
echo "==> Staging app build into public/ (pnpm --filter @simlin/app run deploy:assemble)"
pnpm --filter @simlin/app run deploy:assemble

# Gate on the real upload set right before the deploy: this deploy uploads
# from the repo root, where the upload set is whatever .gcloudignore leaves
# in -- independent of git status, and including files the build steps above
# just created. Failing here (instead of inside gcloud app deploy) names the
# offending directories and still runs the cleanup trap. See issue #695.
echo "==> Checking upload file count against the GAE 10k cap (scripts/check-upload-file-count.sh)"
bash "$REPO_ROOT/scripts/check-upload-file-count.sh" "$REPO_ROOT"

echo "==> gcloud app deploy ./.app.prod.yaml"
gcloud app deploy "$REPO_ROOT/.app.prod.yaml" "$@"

Expand Down
Loading
Loading