feat(k8s, helm): Enable running OpenShell Gateway with multiple replicas

## Problem Statement

When running on Kubernetes, the OpenShell Gateway currently runs as a single-replica StatefulSet. This blocks rolling deployments (upgrades cause a full outage for existing supervisor connections), prevents horizontal scaling under load, and means any Gateway pod failure interrupts all active sandbox connections until the pod restarts.

## Proposed Design

Multi-replica support requires changes across four areas, delivered in sequence. SQLite remains the supported backend for single-replica deployments (dev, small, low-cost). PostgreSQL is required for multi-replica.

### Blockers to resolve

**1. Storage — SQLite cannot support multiple writers**

SQLite uses per-process file locking and the StatefulSet's `ReadWriteOnce` PVC physically prevents two pods from mounting the same volume on different nodes. The persistence layer already supports PostgreSQL (`crates/openshell-server/src/persistence/postgres.rs`). Multi-replica deployments must use PostgreSQL; the Helm chart should document this and warn when `replicaCount > 1` is set with a SQLite `dbUrl`. SQLite deployments remain on the current StatefulSet + PVC path unchanged.

**2. Reconciliation — concurrent loops cause data races**

`reconcile_loop()` and `watch_loop()` run independently on every replica (`compute/mod.rs:533–542`). With multiple replicas this causes double-deletes and conflicting updates on shared sandbox records. Replicas need a coordination mechanism so only the relevant replica reconciles a given sandbox. The design for this is tracked separately.

**3. Supervisor sessions — in-memory state is not shared**

The `SupervisorSessionRegistry` (`supervisor_session.rs:70`) is per-replica and in-memory. A supervisor reconnecting to a different replica after a pod restart will fail to find its session, breaking SSH relay. Replicas need to either own a stable subset of supervisors or share session state. The design for this is tracked separately.

**4. SSH connection limits — not globally enforced**

Connection slots are tracked in per-replica `Mutex<HashMap>` (`ssh_tunnel.rs:38`), making per-token and per-sandbox limits "per replica" rather than global. Once replica ownership is established these limits can be enforced locally; global enforcement will require the shared database.

### Phased delivery

| Phase | Change | Unblocks |
|-------|--------|---------|
| 1 | PostgreSQL backend + Deployment (replacing StatefulSet) | everything else |
| 2 | Replica ownership for reconciliation and session routing | safe rolling deploys |
| 3 | Persistent supervisor session state | transparent pod failure recovery |
| 4 | Shared connection limit accounting | correct global per-token limits |

Phases 1 and 2 are the minimum for safe rolling deployments. Phases 3 and 4 are required for full HA.

## Alternatives Considered

**ReadWriteMany PVC with SQLite over NFS:** Avoids the PostgreSQL dependency but SQLite over NFS is unreliable — WAL lock propagation is slow and network failures can corrupt the database. Not recommended.

**Kubernetes Lease-based leader election for reconciliation:** Solves the reconciliation race but ties multi-replica behavior to Kubernetes, breaking Docker and Podman deployments. A deployment-agnostic coordination mechanism is preferred.

## Agent Investigation

- PostgreSQL persistence is fully implemented at `crates/openshell-server/src/persistence/postgres.rs` — no new persistence code is needed for Phase 1.
- `reconcile_loop()` and `watch_loop()` are spawned unconditionally per replica at `compute/mod.rs:533–542`. The `sync_lock` mutex at line 1015 is process-local only.
- `SupervisorSessionRegistry` at `supervisor_session.rs:70–94` is purely in-memory with no cross-replica sharing.
- SSH connection slots are tracked in two `Mutex<HashMap>` fields on `ServerState` at `ssh_tunnel.rs:38–64`.
- The StatefulSet PVC uses `accessModes: ["ReadWriteOnce"]` (`templates/statefulset.yaml:179`), physically preventing multi-node pod scheduling with SQLite.

## Checklist

- [x] I've reviewed existing issues and the architecture docs
- [x] This is a design proposal, not a "please build this" request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(k8s, helm): Enable running OpenShell Gateway with multiple replicas #1021

Problem Statement

Proposed Design

Blockers to resolve

Phased delivery

Alternatives Considered

Agent Investigation

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Phase	Change	Unblocks
1	PostgreSQL backend + Deployment (replacing StatefulSet)	everything else
2	Replica ownership for reconciliation and session routing	safe rolling deploys
3	Persistent supervisor session state	transparent pod failure recovery
4	Shared connection limit accounting	correct global per-token limits

feat(k8s, helm): Enable running OpenShell Gateway with multiple replicas #1021

Description

Problem Statement

Proposed Design

Blockers to resolve

Phased delivery

Alternatives Considered

Agent Investigation

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions