Skip to content

Pre-deploy user-impact fixes: healthz, preview render isolation, cross-origin embeds, upload gate#826

Merged
bpowers merged 4 commits into
mainfrom
deploy-user-impact-fixes
Jul 1, 2026
Merged

Pre-deploy user-impact fixes: healthz, preview render isolation, cross-origin embeds, upload gate#826
bpowers merged 4 commits into
mainfrom
deploy-user-impact-fixes

Conversation

@bpowers

@bpowers bpowers commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Fixes #688
Fixes #694
Fixes #695

Triage pass over epic #733 focused on issues most likely to impact users of the redeployed App Engine app, skipping compat corner cases (MDL/period-ident items, local-tool issues). Each fix was implemented and then adversarially reviewed by an independent pass; all review findings were resolved before commit.

What's here (one commit per fix)

/healthz route + exit-on-failed-boot (in-repo half of #693). GAE serves / as a static file, so production can be completely down while / stays green. The new unauthenticated Express route is the uptime-check target; it mounts ahead of the request logger and session middleware. A rejected main() now logs and exits non-zero instead of leaving a zombie process that never binds the port. #693 stays open for the ops-side Cloud Monitoring setup (channel, uptime checks, alert policies).

Preview render isolation (#694). Server-side PNG previews ran the whole protobuf-to-SVG-to-PNG pipeline synchronously on the Express event loop through the shared WASM backend -- one pathological model stalled every request on the instance (GAE multiplexes 100). Restored the 2022-style worker_threads isolation: per-request worker with its own WASM instance, a 2-slot FIFO limiter, and a 10s total wall-clock budget that includes queue wait (a request whose deadline lapses while queued rejects without spawning a doomed worker). app.yaml gains automatic_scaling.max_instances: 8 as a cost cap, the config validator refuses to deploy if .app.prod.yaml lacks it, and the build-verification scripts assert lib/render-worker.js ships.

Cross-origin embeds (#688). The embeddable web component was broken on third-party pages: the engine worker chunk resolves to an absolute app.simlin.com URL and the Worker constructor enforces same-origin regardless of CORS. The engine now boots the worker through a same-origin blob trampoline when the resolved chunk URL is cross-origin, and the /static GAE handler serves Access-Control-Allow-Origin: * so the trampoline's import and the worker's WASM fetch pass CORS. The same-origin main-app path constructs through the native Worker with identical arguments (verified against a rebuilt bundle during review). The remaining browser-only semantics are covered by a new mandatory two-origin manual smoke check in deploy.md; #825 tracks an automated Playwright harness.

Upload-count gate (#695). gcloud app deploy uploads whatever .gcloudignore leaves in, independent of git status; a clean tree could still trip GAE's hard 10,000-file cap after the expensive clean+build. Both deploy scripts now enumerate the real upload set immediately before deploying and abort with a per-directory breakdown. Together with the .gcloudignore hardening from #809, this completes #695.

Also done alongside

bpowers added 4 commits July 1, 2026 11:42
GAE serves / as a static file, so production can be fully down (every
Express instance crash-looping) while / stays green. /healthz is an
Express-routed, unauthenticated uptime-check target that reflects the
WASM engine preload via isReady(). It mounts ahead of the request
logger and session middleware so constant polling stays cheap and out
of the logs. Because initializeServerDependencies() throws before the
route mounts, a preload failure surfaces as a non-responding instance,
not a 503; the 503 branch is defense-in-depth.

Also make a rejected main() log and exit(1) instead of falling into
the unhandledRejection logger, which left a zombie process that never
bound the port; on GAE that hung instances until the port-bind timeout
instead of recycling them with a clear error. This is the in-repo half
of issue 693 (the Cloud Monitoring channel/uptime-check/alert setup
remains ops-side).
…dline

Preview PNG rendering ran the whole protobuf->SVG->PNG pipeline
synchronously on the Express event loop through the process-wide WASM
DirectBackend, so one slow or pathological model stalled every
concurrently-multiplexed request on the instance (GAE allows 100) --
a regression from the 2022 per-request worker_threads design. Restore
that isolation: each render runs in its own worker thread with its own
WASM instance, behind a 2-slot FIFO limiter, under a 10s total
wall-clock budget that includes queue wait (a request whose deadline
lapses while queued rejects without spawning a doomed worker). A WASM
panic or OOM now kills a disposable thread, not the server. Failures
stay per-request: the preview route's existing 500 path is unchanged.

Cap automatic_scaling at 8 instances in app.yaml so a render storm or
crash loop cannot fan out cost without bound, and make
validate-app-prod-config.mjs refuse to deploy if the operator-local
.app.prod.yaml lacks max_instances. verify-deploy-build.sh and the
staged-deploy verify step now assert lib/render-worker.js ships,
because a build that dropped it would 500 every preview while looking
healthy. Addresses the isolation, timeout, and instance-cap facets of
issue 694; a model-complexity cap below the 10 MB body limit remains
open there.
…nd /static CORS

The embeddable web component was broken on third-party pages: the
engine runs in a Web Worker whose chunk URL resolves (via the
component build's assetPrefix 'auto') to an absolute app.simlin.com
URL, and the Worker constructor enforces same-origin regardless of
CORS, so embeds loaded project data but never initialized the engine.

Boot the worker through a same-origin blob: trampoline when the
resolved chunk URL is cross-origin: worker-trampoline.ts holds the
pure origin-decision and trampoline-source logic plus an injectable
spawn shell that briefly intercepts the global Worker constructor to
observe the URL the bundler resolved (the new Worker(new URL(...))
expression must stay inline or rspack stops bundling the worker
chunk). The imported chunk keeps its https URL identity, so its
internal WASM fetch resolves against app.simlin.com; engine-worker.ts
overrides the runtime publicPath in the trampolined (classic,
blob-origin) context where self.location would otherwise poison it.
The same-origin path -- the main app -- constructs through the native
Worker with identical arguments, verified against a rebuilt bundle.

Serve Access-Control-Allow-Origin: * on the /static GAE handler so
the trampoline's cross-origin import and the worker's WASM fetch pass
CORS; everything under /static is public build output.
validate-app-prod-config.mjs now requires the header in the
operator-local .app.prod.yaml, and the deploy smoke list gains a
header curl plus a real two-origin embed check (the remaining
browser-only semantics are uncoverable by unit tests; issue 825
tracks an automated Playwright harness).
gcloud app deploy uploads whatever .gcloudignore leaves in, entirely
independent of git tracking status, so a git-status-clean tree can
still trip App Engine Standard's hard 10,000-file cap on gitignored
artifacts (cargo target dirs, venvs, scratch checkouts) -- and the
failure only surfaced inside gcloud, after the expensive clean+build.
check-upload-file-count.sh enumerates the real upload set once via
gcloud meta list-files-for-upload immediately before each deploy and
aborts with a per-top-level-directory breakdown when the cap would be
hit. Both deploy paths run it: the root-path deploy because that is
where the unbounded exposure lives, the staged path as a cheap guard
against staging-assembly regressions. The cap parameter is
overridable only so the failure path can be exercised by hand against
a tiny fixture. With the .gcloudignore hardening that landed earlier,
this completes issue 695.
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.87%. Comparing base (242dbde) to head (b9586a7).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #826      +/-   ##
==========================================
- Coverage   90.87%   90.87%   -0.01%     
==========================================
  Files         224      224              
  Lines      143133   143133              
==========================================
- Hits       130073   130071       -2     
- Misses      13060    13062       +2     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9586a71ce

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/server/render.ts
}
settled = true;
clearTimeout(timer);
void worker.terminate();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Await worker termination before releasing render slots

When a preview times out or a worker reports an error under a burst of render requests, this ignores the Promise returned by worker.terminate() and settles the task immediately, so the limiter releases its slot before the old thread has actually emitted exit. Because Node worker termination is asynchronous, queued requests can start new workers while timed-out CPU/WASM workers are still shutting down, defeating the intended MAX_CONCURRENT_RENDERS cap and reintroducing the worker fan-out/memory pressure this change is trying to bound.

Useful? React with 👍 / 👎.

@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review

Looked at the four fix bundles carefully — code, tests, and cross-file interactions. No blocking issues found.

Areas I verified as correct:

  • render.ts limiter/timeout: FIFO active++/active-- transfer is clean; settled + clearTimeout + void worker.terminate() in settle() correctly dedupes across message/error/messageerror/exit/timeout paths. The deadline-in-queue check before spawning a worker is the right shape.
  • engine-worker.ts publicPath override: The reassignment happens at module top-level before the import('./worker-server') dynamic import that triggers the WASM fetch. typeof __webpack_public_path__ === 'string' correctly guards the assignment under bundlers that don't rewrite the identifier.
  • worker-trampoline.ts blob cleanup on failure: Synchronous try/catch around new NativeWorker(blobUrl, options) revokes on construction failure; finally restores the global constructor even when spawn() throws.
  • resolveWorkerScript(): __dirname works in the CJS server build; the lib/ fallback covers ts-jest.
  • /healthz middleware ordering: Intentional and documented — GAE's edge handles HTTP→HTTPS, the response is text/plain so no CSP/CORS/session middleware needed.
  • validate-app-prod-config.mjs: Fixture reordering covers both new checks; the Number.isInteger guard correctly rejects string/float/negative values.

Overall correctness verdict: correct

@bpowers bpowers merged commit 6df1618 into main Jul 1, 2026
17 checks passed
@bpowers bpowers deleted the deploy-user-impact-fixes branch July 1, 2026 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment