Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ Expected port: `30051/tcp` (mapped to configurable host port, default 8080; set

Component images (server, sandbox) can reach kubelet via two paths:

**Local/external pull mode** (default local via `mise run cluster`): Local images are tagged to the configured local registry base (default `127.0.0.1:5000/openshell/*`), pushed to that registry, and pulled by k3s via `registries.yaml` mirror endpoint (typically `host.docker.internal:5000`). The `cluster` task pushes prebuilt local tags (`openshell/*:dev`, falling back to `localhost:5000/openshell/*:dev` or `127.0.0.1:5000/openshell/*:dev`).
**Local/external pull mode** (default local via `mise run cluster`): Local images are tagged to the configured local registry base (default `127.0.0.1:5000/openshell/*`), pushed to that registry, and pulled by k3s via `registries.yaml` mirror endpoint (typically `host.docker.internal:5000`). The `cluster` task rebuilds the local gateway image before tagging and pushing it, so a fresh bootstrap should not reuse stale `openshell/gateway:dev` bits from a previous source revision.

Gateway image builds now stage a partial Rust workspace from `deploy/docker/Dockerfile.images`. If cargo fails with a missing manifest under `/build/crates/...`, or an imported symbol exists locally but is missing in the image build, verify that every current gateway dependency crate (including `openshell-driver-docker`, `openshell-driver-kubernetes`, and `openshell-ocsf`) is copied into the staged workspace there.

Expand Down
12 changes: 6 additions & 6 deletions architecture/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ flowchart TB
end

CLI -- "gRPC / HTTPS" --> SERVER
CLI -- "SSH over HTTP CONNECT" --> SERVER
CLI -- "SSH over gRPC ForwardTcp" --> SERVER
SERVER -- "CRUD + Watch" --> DB
SERVER -- "Create / Delete Pods" --> SBX
SUPERVISOR -- "Fetch Policy + Credentials + Inference Bundle" --> SERVER
Expand Down Expand Up @@ -129,17 +129,17 @@ The first command installs the CLI. The second command bootstraps the cluster on

For more detail, see [Cluster Bootstrap Architecture](cluster-single-node.md).

### Sandbox Connect (SSH Tunneling)
### Sandbox Connect (SSH Forwarding)

Users can open interactive terminal sessions into running sandboxes. SSH traffic is tunneled through the gateway rather than exposing sandbox pods directly on the network.

The connection flow works as follows:

1. The CLI requests a session token from the gateway.
2. The CLI opens an HTTP CONNECT tunnel to the gateway's SSH tunnel endpoint, passing the token and sandbox identifier.
3. The gateway validates the token, confirms the sandbox is running, resolves the pod's network address, and establishes a TCP connection to the sandbox's embedded SSH server.
4. A cryptographic handshake (HMAC-verified) confirms the gateway's identity to the sandbox.
5. The CLI and sandbox exchange SSH traffic bidirectionally through the tunnel.
2. The CLI opens a bidirectional gRPC `ForwardTcp` stream with `target.ssh`, passing the token and sandbox identifier.
3. The gateway validates the token, confirms the sandbox is ready, and asks the already-connected supervisor to open an SSH-targeted relay.
4. The supervisor connects the relay to the sandbox's embedded SSH server over the local Unix socket.
5. The CLI and sandbox exchange SSH traffic bidirectionally through the gRPC stream and supervisor relay.

This design provides several benefits:

Expand Down
21 changes: 11 additions & 10 deletions architecture/gateway-security.md
Original file line number Diff line number Diff line change
Expand Up @@ -234,16 +234,17 @@ The sandbox calls two RPCs over this authenticated channel:
- `GetSandboxSettings` -- fetches the YAML policy that governs the sandbox's behavior.
- `GetSandboxProviderEnvironment` -- fetches provider credentials as environment variables.

## SSH Tunnel Authentication
## SSH Forward Authentication

SSH connections into sandboxes pass through the gateway's HTTP CONNECT tunnel at `/connect/ssh`. This adds a second authentication layer on top of mTLS.
SSH connections into sandboxes pass through the gateway's bidirectional gRPC `ForwardTcp` stream with `target.ssh`. This adds a second authorization layer on top of gateway mTLS.

### Request Headers
### Forward Initialization

| Header | Purpose |
| Field | Purpose |
|---|---|
| `x-sandbox-id` | Identifies the target sandbox |
| `x-sandbox-token` | Session token (created via `CreateSshSession` RPC) |
| `sandbox_id` | Identifies the target sandbox |
| `target.ssh` | Requests the built-in SSH Unix-socket target |
| `authorization_token` | Session token created via `CreateSshSession` |

The gateway validates the token against the stored `SshSession` record and checks:

Expand All @@ -269,16 +270,16 @@ The gateway enforces two concurrent connection limits to bound the impact of cre
| Per-token | 10 concurrent tunnels | Limits damage from a single leaked token |
| Per-sandbox | 20 concurrent tunnels | Prevents bypass via creating many tokens for one sandbox |

These limits are tracked in-memory and decremented when tunnels close. Exceeding either limit returns HTTP 429 (Too Many Requests).
These limits are tracked in-memory and decremented when streams close. Exceeding either limit returns gRPC `ResourceExhausted`.

### Supervisor-Initiated Relay Model

The gateway never dials the sandbox. Instead, the sandbox supervisor opens an outbound `ConnectSupervisor` bidirectional gRPC stream to the gateway on startup and keeps it alive for the sandbox lifetime. SSH traffic for `/connect/ssh` (and exec traffic for `ExecSandbox`) rides this same TCP+TLS+HTTP/2 connection as separate multiplexed HTTP/2 streams. The gateway-side registry and `RelayStream` handler live in `crates/openshell-server/src/supervisor_session.rs`; the supervisor-side bridge lives in `crates/openshell-sandbox/src/supervisor_session.rs`.
The gateway never dials the sandbox. Instead, the sandbox supervisor opens an outbound `ConnectSupervisor` bidirectional gRPC stream to the gateway on startup and keeps it alive for the sandbox lifetime. SSH traffic for `ForwardTcp(target.ssh)` and exec traffic for `ExecSandbox` ride this same TCP+TLS+HTTP/2 connection as separate multiplexed HTTP/2 streams. The gateway-side registry and `RelayStream` handler live in `crates/openshell-server/src/supervisor_session.rs`; the supervisor-side bridge lives in `crates/openshell-sandbox/src/supervisor_session.rs`.

Per-connection flow:

1. CLI presents `x-sandbox-id` + `x-sandbox-token` at `/connect/ssh` and passes gateway token validation.
2. Gateway calls `SupervisorSessionRegistry::open_relay(sandbox_id, ...)`, which allocates a `channel_id` (UUID) and sends a `RelayOpen` message to the supervisor over the already-established `ConnectSupervisor` stream. If no session is registered yet, it polls with exponential backoff up to a bounded timeout (30 s for `/connect/ssh`, 15 s for `ExecSandbox`).
1. CLI opens `ForwardTcp` with `TcpForwardInit { sandbox_id, target.ssh, authorization_token }` and passes gateway token validation.
2. Gateway calls `SupervisorSessionRegistry::open_relay_with_target(sandbox_id, SshRelayTarget, ...)`, which allocates a `channel_id` (UUID) and sends a `RelayOpen` message to the supervisor over the already-established `ConnectSupervisor` stream. If no session is registered yet, it polls with exponential backoff up to a bounded timeout.
3. The supervisor opens a new `RelayStream` RPC on the same `Channel` — a new HTTP/2 stream, no new TCP connection and no new TLS handshake. The first `RelayFrame` is a `RelayInit { channel_id }` that claims the pending slot on the gateway.
4. `claim_relay` pairs the gateway-side waiter with the supervisor-side RPC via a `tokio::io::duplex(64 KiB)` pair. Subsequent `RelayFrame::data` frames carry raw SSH bytes in both directions. The supervisor is a dumb byte bridge: it has no protocol awareness of the SSH bytes flowing through.
5. Inside the sandbox pod, the supervisor connects the relay to sshd over a Unix domain socket at `/run/openshell/ssh.sock` (see `crates/openshell-driver-kubernetes/src/main.rs`).
Expand Down
Loading
Loading