Operator runbook — deployment, troubleshooting, incidents

Audience: People on call for HoloScript production surfaces (MCP mesh, Studio-adjacent APIs, absorb/orchestrator where applicable).
Goal: Repeatable checks, common failures, and rollback posture — not a substitute for provider-specific Railway/AWS docs.

1. System map (typical production)

Surface	Role	Health
`mcp.holoscript.net`	MCP HTTP + tool mesh	`GET /health`
Orchestrator (example)	Registry, federation	Team-specific `/health` URL
Absorb	Codebase intelligence	`GET /health` on absorb host
Studio	Next.js app	App `/` + API routes under `/api/*`

Exact hostnames change by environment — store canonical URLs in the team vault or internal wiki; this file stays pattern-based.

2. Deployment (Railway / container pattern)

Pre-deploy: CI green; no secrets in diff; version tag recorded.
Deploy: Use the project’s standard pipeline (e.g. Railway GitHub integration or railway up).
Post-deploy smoke:
- curl -sf https://mcp.holoscript.net/health | jq . — expect ok / tool count present.
- Hit one read-only MCP tool or public JSON endpoint if available.
Config: Confirm env vars for the service (API keys, CORS, rate limits) match the last known good release.

3. Health checks

Check	Pass criteria
MCP `/health`	HTTP 200; JSON includes service identity and `tools` or equivalent
Latency	p95 within SLO (define per team; e.g. < 3s for health)
WASM / heavy paths	Sample compile or parse route returns within timeout budget

Automate these in uptime monitoring; page when two regions or two consecutive checks fail.

4. Common failures

Symptom	Likely cause	First actions
502 / upstream errors	Process crash, OOM, bad deploy	Check logs; roll back to last image; scale memory if OOM
Tool timeouts	Cold start, deadlock, downstream API	Increase timeout temporarily; isolate tool; disable noisy tool via feature flag
WASM init errors	Missing asset, wrong `Content-Type`, version skew	Verify static asset deploy; match CLI/runtime versions
Auth / 401 from MCP	Rotated key, clock skew	Rotate keys in vault; sync NTP; verify `Authorization` header path
Stale graph / cache	CDN or edge cache	Purge cache for affected paths; bump cache-bust query if used

5. Incident response

Declare severity (user-visible outage = SEV1; degraded = SEV2).
Mitigate: Roll back deploy or toggle feature flag before root-cause deep dive.
Communicate: Post to team board / status channel with ETA and workaround (e.g. “use stdio MCP instead of HTTP”).
Resolve: Document timeline, root cause, and follow-up issue links.
Postmortem (SEV1): Blameless notes within 5 business days.

Rollback

Revert to previous Railway deployment or previous container digest.
Invalidate CDN if static assets drifted.

Cache invalidate

Purge edge cache for mcp.holoscript.net (or provider equivalent) after bad JSON or WASM served.

6. Resource expectations (baseline)

MCP service: CPU for parse/compile bursts; memory for WASM and graph workloads — start from provider metrics, not guesses.
DB / Redis (if any): Connection limits and eviction policy documented per environment.

7. Monitoring

Logs: Structured JSON preferred; include request_id, tool, duration_ms.
Metrics: RPS, error rate, p95 latency per route; saturation (CPU, memory).
Alerts: Error rate spike, health check failure, certificate expiry.

8. On-call handoff template

Window: <UTC start> – <UTC end>
Services owned: MCP / Studio / Absorb / …
Open incidents: <links or none>
In-flight deploys: <none | link>
Known risks: <e.g. DB migration tonight>
Escalation: <name + phone/slack>

9. Strategic team mode and board objective (operator checklist)

When the team shifts how it works (audit vs build vs stabilize, etc.), mode and objective on the HoloMesh board must stay in sync so autonomous agents and humans do not chase stale goals.

Canonical detail (SSOT): Strategic team modes and board objective sync — mode names, biases, and objective rules.

After changing mode, verify:

Board — GET /api/holomesh/team/{teamId}/board shows the expected mode and a short objective that matches that mode (not leftover text from last week).
IDE directive — Session hooks or team-connect may write a mode summary for the IDE:
- Windows: %TEMP%\holomesh-mode-directive.md
- macOS/Linux: $TMPDIR/holomesh-mode-directive.md (see also REST examples — Local IDE integration).
Control plane — Use one of:
- HTTP: POST /api/holomesh/team/{teamId}/mode with {"mode":"<mode>","objective":"<short string>"} (team permissions required).
- MCP: holomesh_mode_set with team_id, mode, and optional objective — MCP examples.

If mode and objective disagree, fix them in the same change so the next board marathon or team-connect --queue run reflects reality.

10. Railway-specific operations

Services managed via `deploy-railway.yml`

Service	Path filter	Railway service name
MCP Server	`packages/mcp-server/**`	`mcp-server`
Studio	`packages/studio/**`	`studio`
Marketplace API	`packages/marketplace-api/**`	`marketplace-api`
Export API	`services/export-api/**`	`export-api`
LLM Service	`services/llm-service/**`	`llm-service`
Absorb Service	`packages/absorb-service/**`	`absorb-service`

Manual deploy via GitHub Actions

Actions → Deploy to Railway → Run workflow

service: choose specific service or all
skip_preflight: use for emergency deploys only (bypasses heavy build checks)

MCP Server boot modes

Mode	Command	Use case
stdio	`pnpm start` in `packages/mcp-server`	Local IDE/Cursor MCP clients
HTTP	`node dist/http-server.js`	Railway prod, remote agents, `/health`

`/health` response schema

{
  "status": "healthy | degraded | unhealthy",
  "uptime": 12345,
  "timestamp": "2026-04-19T...",
  "checks": {
    "registry": { "status": "ok", "agentCount": 8 },
    "telemetry": { "status": "ok", "totalSpans": 42, "activeSpans": 1, "totalEvents": 100 },
    "tools": { "status": "ok", "toolCount": 118 }
  },
  "version": "6.x.x"
}

Required environment variables (mcp-server)

HOLOSCRIPT_API_KEY    # MCP auth + knowledge store
HOLOMESH_API_KEY      # HoloMesh gossip + board
DATABASE_URL          # Postgres (token store, audit log)
# SEC-T09: If Postgres uses a private CA, set one of:
#   PG_SSL_CA=/path/to/ca.pem   or   PG_SSL_CA_B64=<base64 of PEM>
# Managed cloud DBs with public roots need neither.
NODE_ENV=production

Resource baseline (mcp-server)

RAM: 512 MB min — WASM parser peaks at ~200 MB
CPU: 0.5 vCPU — burst for parse/compile; steady-state is IO-bound

Rollback via Railway UI

Railway dashboard → Project → Service → Deployments
Find last green deployment
Click ⋯ → Redeploy

Rollback via git revert

git revert <bad-commit-sha> --no-edit
git push origin main   # CI auto-deploys

Strategic team modes and board objective sync
TTFHW measurement protocol
Marketplace publication readiness
Integration Hub (connector APIs)
NUMBERS.md — verification commands
Dependency Audit — monthly npm/Rust audit results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Operator runbook — deployment, troubleshooting, incidents

1. System map (typical production)

2. Deployment (Railway / container pattern)

3. Health checks

4. Common failures

5. Incident response

Rollback

Cache invalidate

6. Resource expectations (baseline)

7. Monitoring

8. On-call handoff template

9. Strategic team mode and board objective (operator checklist)

10. Railway-specific operations

Services managed via `deploy-railway.yml`

Manual deploy via GitHub Actions

MCP Server boot modes

`/health` response schema

Required environment variables (mcp-server)

Resource baseline (mcp-server)

Rollback via Railway UI

Rollback via git revert

Related

Uh oh!

FilesExpand file tree

RUNBOOK.md

Latest commit

History

RUNBOOK.md

File metadata and controls

Operator runbook — deployment, troubleshooting, incidents

1. System map (typical production)

2. Deployment (Railway / container pattern)

3. Health checks

4. Common failures

5. Incident response

Rollback

Cache invalidate

6. Resource expectations (baseline)

7. Monitoring

8. On-call handoff template

9. Strategic team mode and board objective (operator checklist)

10. Railway-specific operations

Services managed via deploy-railway.yml

Manual deploy via GitHub Actions

MCP Server boot modes

/health response schema

Required environment variables (mcp-server)

Resource baseline (mcp-server)

Rollback via Railway UI

Rollback via git revert

Related

Services managed via `deploy-railway.yml`

`/health` response schema