Audience: People on call for HoloScript production surfaces (MCP mesh, Studio-adjacent APIs, absorb/orchestrator where applicable).
Goal: Repeatable checks, common failures, and rollback posture — not a substitute for provider-specific Railway/AWS docs.
| Surface | Role | Health |
|---|---|---|
mcp.holoscript.net |
MCP HTTP + tool mesh | GET /health |
| Orchestrator (example) | Registry, federation | Team-specific /health URL |
| Absorb | Codebase intelligence | GET /health on absorb host |
| Studio | Next.js app | App / + API routes under /api/* |
Exact hostnames change by environment — store canonical URLs in the team vault or internal wiki; this file stays pattern-based.
- Pre-deploy: CI green; no secrets in diff; version tag recorded.
- Deploy: Use the project’s standard pipeline (e.g. Railway GitHub integration or
railway up). - Post-deploy smoke:
curl -sf https://mcp.holoscript.net/health | jq .— expectok/ tool count present.- Hit one read-only MCP tool or public JSON endpoint if available.
- Config: Confirm env vars for the service (API keys, CORS, rate limits) match the last known good release.
| Check | Pass criteria |
|---|---|
MCP /health |
HTTP 200; JSON includes service identity and tools or equivalent |
| Latency | p95 within SLO (define per team; e.g. < 3s for health) |
| WASM / heavy paths | Sample compile or parse route returns within timeout budget |
Automate these in uptime monitoring; page when two regions or two consecutive checks fail.
| Symptom | Likely cause | First actions |
|---|---|---|
| 502 / upstream errors | Process crash, OOM, bad deploy | Check logs; roll back to last image; scale memory if OOM |
| Tool timeouts | Cold start, deadlock, downstream API | Increase timeout temporarily; isolate tool; disable noisy tool via feature flag |
| WASM init errors | Missing asset, wrong Content-Type, version skew |
Verify static asset deploy; match CLI/runtime versions |
| Auth / 401 from MCP | Rotated key, clock skew | Rotate keys in vault; sync NTP; verify Authorization header path |
| Stale graph / cache | CDN or edge cache | Purge cache for affected paths; bump cache-bust query if used |
- Declare severity (user-visible outage = SEV1; degraded = SEV2).
- Mitigate: Roll back deploy or toggle feature flag before root-cause deep dive.
- Communicate: Post to team board / status channel with ETA and workaround (e.g. “use stdio MCP instead of HTTP”).
- Resolve: Document timeline, root cause, and follow-up issue links.
- Postmortem (SEV1): Blameless notes within 5 business days.
- Revert to previous Railway deployment or previous container digest.
- Invalidate CDN if static assets drifted.
- Purge edge cache for
mcp.holoscript.net(or provider equivalent) after bad JSON or WASM served.
- MCP service: CPU for parse/compile bursts; memory for WASM and graph workloads — start from provider metrics, not guesses.
- DB / Redis (if any): Connection limits and eviction policy documented per environment.
- Logs: Structured JSON preferred; include
request_id,tool,duration_ms. - Metrics: RPS, error rate, p95 latency per route; saturation (CPU, memory).
- Alerts: Error rate spike, health check failure, certificate expiry.
Window: <UTC start> – <UTC end>
Services owned: MCP / Studio / Absorb / …
Open incidents: <links or none>
In-flight deploys: <none | link>
Known risks: <e.g. DB migration tonight>
Escalation: <name + phone/slack>
When the team shifts how it works (audit vs build vs stabilize, etc.), mode and objective on the HoloMesh board must stay in sync so autonomous agents and humans do not chase stale goals.
Canonical detail (SSOT): Strategic team modes and board objective sync — mode names, biases, and objective rules.
After changing mode, verify:
- Board —
GET /api/holomesh/team/{teamId}/boardshows the expectedmodeand a shortobjectivethat matches that mode (not leftover text from last week). - IDE directive — Session hooks or
team-connectmay write a mode summary for the IDE:- Windows:
%TEMP%\holomesh-mode-directive.md - macOS/Linux:
$TMPDIR/holomesh-mode-directive.md(see also REST examples — Local IDE integration).
- Windows:
- Control plane — Use one of:
- HTTP:
POST /api/holomesh/team/{teamId}/modewith{"mode":"<mode>","objective":"<short string>"}(team permissions required). - MCP:
holomesh_mode_setwithteam_id,mode, and optionalobjective— MCP examples.
- HTTP:
If mode and objective disagree, fix them in the same change so the next board marathon or team-connect --queue run reflects reality.
| Service | Path filter | Railway service name |
|---|---|---|
| MCP Server | packages/mcp-server/** |
mcp-server |
| Studio | packages/studio/** |
studio |
| Marketplace API | packages/marketplace-api/** |
marketplace-api |
| Export API | services/export-api/** |
export-api |
| LLM Service | services/llm-service/** |
llm-service |
| Absorb Service | packages/absorb-service/** |
absorb-service |
Actions → Deploy to Railway → Run workflow
service: choose specific service orallskip_preflight: use for emergency deploys only (bypasses heavy build checks)
| Mode | Command | Use case |
|---|---|---|
| stdio | pnpm start in packages/mcp-server |
Local IDE/Cursor MCP clients |
| HTTP | node dist/http-server.js |
Railway prod, remote agents, /health |
{
"status": "healthy | degraded | unhealthy",
"uptime": 12345,
"timestamp": "2026-04-19T...",
"checks": {
"registry": { "status": "ok", "agentCount": 8 },
"telemetry": { "status": "ok", "totalSpans": 42, "activeSpans": 1, "totalEvents": 100 },
"tools": { "status": "ok", "toolCount": 118 }
},
"version": "6.x.x"
}HOLOSCRIPT_API_KEY # MCP auth + knowledge store
HOLOMESH_API_KEY # HoloMesh gossip + board
DATABASE_URL # Postgres (token store, audit log)
# SEC-T09: If Postgres uses a private CA, set one of:
# PG_SSL_CA=/path/to/ca.pem or PG_SSL_CA_B64=<base64 of PEM>
# Managed cloud DBs with public roots need neither.
NODE_ENV=production- RAM: 512 MB min — WASM parser peaks at ~200 MB
- CPU: 0.5 vCPU — burst for parse/compile; steady-state is IO-bound
- Railway dashboard → Project → Service → Deployments
- Find last green deployment
- Click ⋯ → Redeploy
git revert <bad-commit-sha> --no-edit
git push origin main # CI auto-deploys- Strategic team modes and board objective sync
- TTFHW measurement protocol
- Marketplace publication readiness
- Integration Hub (connector APIs)
- NUMBERS.md — verification commands
- Dependency Audit — monthly npm/Rust audit results