diff --git a/architecture/README.md b/architecture/README.md index 36b0a4978..8f7ba22ce 100644 --- a/architecture/README.md +++ b/architecture/README.md @@ -38,7 +38,7 @@ flowchart TB end CLI -- "gRPC / HTTPS" --> SERVER - CLI -- "SSH over HTTP CONNECT" --> SERVER + CLI -- "SSH over gRPC ForwardTcp" --> SERVER SERVER -- "CRUD + Watch" --> DB SERVER -- "Create / Delete Pods" --> SBX SUPERVISOR -- "Fetch Policy + Credentials + Inference Bundle" --> SERVER @@ -129,17 +129,17 @@ The first command installs the CLI. The second command bootstraps the cluster on For more detail, see [Cluster Bootstrap Architecture](cluster-single-node.md). -### Sandbox Connect (SSH Tunneling) +### Sandbox Connect (SSH Forwarding) Users can open interactive terminal sessions into running sandboxes. SSH traffic is tunneled through the gateway rather than exposing sandbox pods directly on the network. The connection flow works as follows: 1. The CLI requests a session token from the gateway. -2. The CLI opens an HTTP CONNECT tunnel to the gateway's SSH tunnel endpoint, passing the token and sandbox identifier. -3. The gateway validates the token, confirms the sandbox is running, resolves the pod's network address, and establishes a TCP connection to the sandbox's embedded SSH server. -4. A cryptographic handshake (HMAC-verified) confirms the gateway's identity to the sandbox. -5. The CLI and sandbox exchange SSH traffic bidirectionally through the tunnel. +2. The CLI opens a bidirectional gRPC `ForwardTcp` stream with `target.ssh`, passing the token and sandbox identifier. +3. The gateway validates the token, confirms the sandbox is ready, and asks the already-connected supervisor to open an SSH-targeted relay. +4. The supervisor connects the relay to the sandbox's embedded SSH server over the local Unix socket. +5. The CLI and sandbox exchange SSH traffic bidirectionally through the gRPC stream and supervisor relay. This design provides several benefits: diff --git a/architecture/gateway-security.md b/architecture/gateway-security.md index a32c3fb52..9eb29896d 100644 --- a/architecture/gateway-security.md +++ b/architecture/gateway-security.md @@ -234,16 +234,17 @@ The sandbox calls two RPCs over this authenticated channel: - `GetSandboxSettings` -- fetches the YAML policy that governs the sandbox's behavior. - `GetSandboxProviderEnvironment` -- fetches provider credentials as environment variables. -## SSH Tunnel Authentication +## SSH Forward Authentication -SSH connections into sandboxes pass through the gateway's HTTP CONNECT tunnel at `/connect/ssh`. This adds a second authentication layer on top of mTLS. +SSH connections into sandboxes pass through the gateway's bidirectional gRPC `ForwardTcp` stream with `target.ssh`. This adds a second authorization layer on top of gateway mTLS. -### Request Headers +### Forward Initialization -| Header | Purpose | +| Field | Purpose | |---|---| -| `x-sandbox-id` | Identifies the target sandbox | -| `x-sandbox-token` | Session token (created via `CreateSshSession` RPC) | +| `sandbox_id` | Identifies the target sandbox | +| `target.ssh` | Requests the built-in SSH Unix-socket target | +| `authorization_token` | Session token created via `CreateSshSession` | The gateway validates the token against the stored `SshSession` record and checks: @@ -269,16 +270,16 @@ The gateway enforces two concurrent connection limits to bound the impact of cre | Per-token | 10 concurrent tunnels | Limits damage from a single leaked token | | Per-sandbox | 20 concurrent tunnels | Prevents bypass via creating many tokens for one sandbox | -These limits are tracked in-memory and decremented when tunnels close. Exceeding either limit returns HTTP 429 (Too Many Requests). +These limits are tracked in-memory and decremented when streams close. Exceeding either limit returns gRPC `ResourceExhausted`. ### Supervisor-Initiated Relay Model -The gateway never dials the sandbox. Instead, the sandbox supervisor opens an outbound `ConnectSupervisor` bidirectional gRPC stream to the gateway on startup and keeps it alive for the sandbox lifetime. SSH traffic for `/connect/ssh` (and exec traffic for `ExecSandbox`) rides this same TCP+TLS+HTTP/2 connection as separate multiplexed HTTP/2 streams. The gateway-side registry and `RelayStream` handler live in `crates/openshell-server/src/supervisor_session.rs`; the supervisor-side bridge lives in `crates/openshell-sandbox/src/supervisor_session.rs`. +The gateway never dials the sandbox. Instead, the sandbox supervisor opens an outbound `ConnectSupervisor` bidirectional gRPC stream to the gateway on startup and keeps it alive for the sandbox lifetime. SSH traffic for `ForwardTcp(target.ssh)` and exec traffic for `ExecSandbox` ride this same TCP+TLS+HTTP/2 connection as separate multiplexed HTTP/2 streams. The gateway-side registry and `RelayStream` handler live in `crates/openshell-server/src/supervisor_session.rs`; the supervisor-side bridge lives in `crates/openshell-sandbox/src/supervisor_session.rs`. Per-connection flow: -1. CLI presents `x-sandbox-id` + `x-sandbox-token` at `/connect/ssh` and passes gateway token validation. -2. Gateway calls `SupervisorSessionRegistry::open_relay(sandbox_id, ...)`, which allocates a `channel_id` (UUID) and sends a `RelayOpen` message to the supervisor over the already-established `ConnectSupervisor` stream. If no session is registered yet, it polls with exponential backoff up to a bounded timeout (30 s for `/connect/ssh`, 15 s for `ExecSandbox`). +1. CLI opens `ForwardTcp` with `TcpForwardInit { sandbox_id, target.ssh, authorization_token }` and passes gateway token validation. +2. Gateway calls `SupervisorSessionRegistry::open_relay_with_target(sandbox_id, SshRelayTarget, ...)`, which allocates a `channel_id` (UUID) and sends a `RelayOpen` message to the supervisor over the already-established `ConnectSupervisor` stream. If no session is registered yet, it polls with exponential backoff up to a bounded timeout. 3. The supervisor opens a new `RelayStream` RPC on the same `Channel` — a new HTTP/2 stream, no new TCP connection and no new TLS handshake. The first `RelayFrame` is a `RelayInit { channel_id }` that claims the pending slot on the gateway. 4. `claim_relay` pairs the gateway-side waiter with the supervisor-side RPC via a `tokio::io::duplex(64 KiB)` pair. Subsequent `RelayFrame::data` frames carry raw SSH bytes in both directions. The supervisor is a dumb byte bridge: it has no protocol awareness of the SSH bytes flowing through. 5. Inside the sandbox pod, the supervisor connects the relay to sshd over a Unix domain socket at `/run/openshell/ssh.sock` (see `crates/openshell-driver-kubernetes/src/main.rs`). diff --git a/architecture/gateway.md b/architecture/gateway.md index e83640a43..65f9e7a90 100644 --- a/architecture/gateway.md +++ b/architecture/gateway.md @@ -2,9 +2,9 @@ ## Overview -`openshell-server` is the gateway -- the central control plane for a cluster. It exposes two gRPC services (OpenShell and Inference) and HTTP endpoints on a single multiplexed port, manages sandbox lifecycle through a pluggable compute driver, persists state in SQLite or Postgres, and brokers SSH access into sandboxes through supervisor-initiated relay streams. The gateway coordinates all interactions between clients, the compute backend, and the persistence layer. +`openshell-server` is the gateway -- the central control plane for a cluster. It exposes two gRPC services (OpenShell and Inference) and HTTP endpoints on a single multiplexed port, manages sandbox lifecycle through a pluggable compute driver, persists state in SQLite or Postgres, and brokers SSH and service access into sandboxes through supervisor-initiated relay streams. The gateway coordinates all interactions between clients, the compute backend, and the persistence layer. -Each sandbox supervisor opens a persistent inbound gRPC session (`ConnectSupervisor`); the gateway multiplexes per-invocation `RelayStream` RPCs onto the same HTTP/2 connection to move bytes between clients and the in-sandbox SSH Unix socket. The gateway does not need to know, resolve, or reach the sandbox's network address. +Each sandbox supervisor opens a persistent inbound gRPC session (`ConnectSupervisor`); the gateway multiplexes per-invocation `RelayStream` RPCs onto the same HTTP/2 connection to move bytes between clients and explicit in-sandbox targets such as the SSH Unix socket or a loopback TCP service. The gateway does not need to know, resolve, or reach the sandbox's network address. ## Architecture Diagram @@ -21,8 +21,8 @@ graph TD NAV["OpenShellServer
(OpenShell service)"] INF["InferenceServer
(Inference service)"] HTTP["HTTP Router
(Axum)"] - HEALTH["Health Endpoints"] - SSH_TUNNEL["SSH Tunnel
(/connect/ssh)"] + AUTH["Browser Auth
(/auth/connect)"] + WS["WebSocket Tunnel
(/_ws_tunnel)"] SUP_REG["SupervisorSessionRegistry"] STORE["Store
(SQLite / Postgres)"] COMPUTE["ComputeRuntime"] @@ -40,13 +40,11 @@ graph TD MUX -->|"other"| HTTP GRPC_ROUTER -->|"/openshell.inference.v1.Inference/*"| INF GRPC_ROUTER -->|"all other paths"| NAV - HTTP --> HEALTH - HTTP --> SSH_TUNNEL + HTTP --> AUTH + HTTP --> WS NAV --> STORE NAV --> COMPUTE NAV --> SUP_REG - SSH_TUNNEL --> STORE - SSH_TUNNEL --> SUP_REG INF --> STORE COMPUTE --> DRIVER COMPUTE --> STORE @@ -65,12 +63,11 @@ graph TD | Gateway runtime | `crates/openshell-server/src/lib.rs` | `ServerState` struct, `run_server()` accept loop | | Protocol mux | `crates/openshell-server/src/multiplex.rs` | `MultiplexService`, `MultiplexedService`, `GrpcRouter`, `BoxBody`, HTTP/2 adaptive-window tuning | | gRPC: OpenShell | `crates/openshell-server/src/grpc/mod.rs` | `OpenShellService` trait impl -- dispatches to per-concern handlers | -| gRPC: Sandbox/Exec | `crates/openshell-server/src/grpc/sandbox.rs` | Sandbox CRUD, `ExecSandbox`, SSH session handlers, relay-backed exec proxy | +| gRPC: Sandbox/Exec/Forward | `crates/openshell-server/src/grpc/sandbox.rs` | Sandbox CRUD, `ExecSandbox`, SSH session handlers, relay-backed exec proxy, `ForwardTcp` streams for SSH (`target.ssh`) and service forwarding (`target.tcp`) | | gRPC: Inference | `crates/openshell-server/src/inference.rs` | `InferenceService` -- cluster inference config and sandbox bundle delivery | | Supervisor sessions | `crates/openshell-server/src/supervisor_session.rs` | `SupervisorSessionRegistry`, `handle_connect_supervisor`, `handle_relay_stream`, reaper | -| HTTP | `crates/openshell-server/src/http.rs` | Health endpoints, merged with SSH tunnel router | +| HTTP | `crates/openshell-server/src/http.rs` | HTTP router for browser auth and WebSocket tunnel endpoints; health and metrics routers for dedicated listeners | | Browser auth | `crates/openshell-server/src/auth.rs` | Cloudflare browser login relay at `/auth/connect` | -| SSH tunnel | `crates/openshell-server/src/ssh_tunnel.rs` | HTTP CONNECT handler at `/connect/ssh` backed by `open_relay` | | WS tunnel | `crates/openshell-server/src/ws_tunnel.rs` | WebSocket tunnel handler at `/_ws_tunnel` for Cloudflare-fronted clients | | TLS | `crates/openshell-server/src/tls.rs` | `TlsAcceptor` wrapping rustls with ALPN | | Persistence | `crates/openshell-server/src/persistence/mod.rs` | `Store` enum (SQLite/Postgres), generic object CRUD, protobuf codec | @@ -86,7 +83,7 @@ Proto definitions consumed by the gateway: | Proto file | Package | Defines | |------------|---------|---------| -| `proto/openshell.proto` | `openshell.v1` | `OpenShell` service, public sandbox resource model, provider/SSH/watch/policy messages, supervisor session messages (`ConnectSupervisor`, `RelayStream`, `RelayFrame`) | +| `proto/openshell.proto` | `openshell.v1` | `OpenShell` service, public sandbox resource model, provider/SSH/watch/policy messages, CLI service-forward messages (`ForwardTcp`, `TcpForwardFrame`), supervisor session messages (`ConnectSupervisor`, `RelayStream`, `RelayFrame`) | | `proto/compute_driver.proto` | `openshell.compute.v1` | Internal `ComputeDriver` service, driver-native sandbox observations, compute watch stream envelopes | | `proto/inference.proto` | `openshell.inference.v1` | `Inference` service: `SetClusterInference`, `GetClusterInference`, `GetInferenceBundle` | | `proto/datamodel.proto` | `openshell.datamodel.v1` | `Provider` | @@ -109,7 +106,7 @@ The gateway boots in `cli::run_cli` (`crates/openshell-server/src/cli.rs`) and p 3. Build `ServerState` (shared via `Arc` across all handlers), including a fresh `SupervisorSessionRegistry`. 4. **Spawn background tasks**: - `ComputeRuntime::spawn_watchers` -- consumes the compute-driver watch stream, republishes platform events, and runs a periodic `ListSandboxes` snapshot reconcile. - - `ssh_tunnel::spawn_session_reaper` -- sweeps expired or revoked SSH session tokens from the store hourly. + - `ssh_sessions::spawn_session_reaper` -- sweeps expired or revoked SSH session tokens from the store hourly. - `supervisor_session::spawn_relay_reaper` -- sweeps orphaned pending relay channels every 30 seconds. 5. Create `MultiplexService`. 6. Bind `TcpListener` on `config.bind_address`. @@ -145,9 +142,8 @@ All configuration is via CLI flags with environment variable fallbacks. The `--d | `--vm-tls-key` | `OPENSHELL_VM_TLS_KEY` | None | Client private key copied into VM guests for gateway mTLS | | `--ssh-gateway-host` | `OPENSHELL_SSH_GATEWAY_HOST` | `127.0.0.1` | Public hostname returned in SSH session responses | | `--ssh-gateway-port` | `OPENSHELL_SSH_GATEWAY_PORT` | `8080` | Public port returned in SSH session responses | -| `--ssh-connect-path` | `OPENSHELL_SSH_CONNECT_PATH` | `/connect/ssh` | HTTP path for SSH CONNECT/upgrade | -The sandbox-side SSH listener is a Unix domain socket inside the sandbox. The path defaults to `/run/openshell/ssh.sock` and is configured on the compute driver (e.g. `openshell-driver-kubernetes --sandbox-ssh-socket-path`). The gateway never dials this socket itself; the supervisor bridges it onto a `RelayStream` when asked. +The sandbox-side SSH listener is a Unix domain socket inside the sandbox. The path defaults to `/run/openshell/ssh.sock` and is configured on the compute driver (e.g. `openshell-driver-kubernetes --sandbox-ssh-socket-path`). The gateway never dials this socket itself; the supervisor bridges it onto a `RelayStream` when `ForwardTcp` requests `target.ssh`. ## Shared State @@ -186,7 +182,7 @@ All traffic (gRPC and HTTP) shares a single TCP port. Multiplexing happens at th 1. Each accepted TCP stream (optionally TLS-wrapped) is passed to `hyper_util::server::conn::auto::Builder`, which auto-negotiates HTTP/1.1 or HTTP/2. 2. The HTTP/2 side is built with `adaptive_window(true)`. Hyper/h2 auto-sizes the per-stream flow-control window based on measured bandwidth-delay product, so bulk byte transfers on `RelayStream` (and `ExecSandbox` / `PushSandboxLogs`) are not throttled by the default 64 KiB window. Idle streams stay cheap; active streams grow as needed. -3. The builder calls `serve_connection_with_upgrades()`, which supports HTTP upgrades (needed for the SSH tunnel's CONNECT method). +3. The builder calls `serve_connection_with_upgrades()`, which supports HTTP upgrades used by WebSocket tunnel clients. 4. For each request, `MultiplexedService` inspects the `content-type` header: - **Starts with `application/grpc`** -- routes to `GrpcRouter`. - **Anything else** -- routes to the Axum HTTP router. @@ -218,17 +214,19 @@ When TLS is enabled (`crates/openshell-server/src/tls.rs`): The gateway brokers all byte-level access into a sandbox through a two-plane design on a single HTTP/2 connection initiated by the supervisor: -1. **Control plane** -- `ConnectSupervisor(stream SupervisorMessage) returns (stream GatewayMessage)`. Long-lived, one per sandbox. Carries `SupervisorHello`, `SessionAccepted`/`SessionRejected`, heartbeats, and `RelayOpen`/`RelayClose` control messages. -2. **Data plane** -- `RelayStream(stream RelayFrame) returns (stream RelayFrame)`. One short-lived call per SSH or exec invocation. The first inbound frame is a `RelayInit { channel_id }`; subsequent frames carry raw bytes in `RelayFrame.data` in either direction. +1. **Control plane** -- `ConnectSupervisor(stream SupervisorMessage) returns (stream GatewayMessage)`. Long-lived, one per sandbox. Carries `SupervisorHello`, `SessionAccepted`/`SessionRejected`, heartbeats, `RelayOpen`, `RelayOpenResult`, and `RelayClose` control messages. +2. **Data plane** -- `RelayStream(stream RelayFrame) returns (stream RelayFrame)`. One short-lived call per relay invocation. The first inbound frame is a `RelayInit { channel_id }`; subsequent frames carry raw bytes in `RelayFrame.data` in either direction. -Both RPCs are defined in `proto/openshell.proto` and ride the same TCP + TLS + HTTP/2 connection from the supervisor. No new TLS handshake, no reverse HTTP CONNECT, no direct gateway-to-pod dial. +Both RPCs are defined in `proto/openshell.proto` and ride the same TCP + TLS + HTTP/2 connection from the supervisor. No new TLS handshake, no reverse HTTP dialback, no direct gateway-to-pod dial. + +`RelayOpen` carries an optional explicit target in `proto/openshell.proto`: `SshRelayTarget` for the built-in SSH socket or `TcpRelayTarget` for loopback TCP targets. Supervisors treat an absent target as SSH for wire compatibility with older callers. ### `SupervisorSessionRegistry` `crates/openshell-server/src/supervisor_session.rs` defines `SupervisorSessionRegistry`, a single instance of which lives on `ServerState.supervisor_sessions`. It holds two maps guarded by `std::sync::Mutex`: - `sessions: HashMap` -- one entry per connected supervisor. Each `LiveSession` carries a unique `session_id`, the `mpsc::Sender` for the outbound stream, and a connection timestamp. -- `pending_relays: HashMap` -- one entry per in-flight `open_relay` call awaiting the supervisor's `RelayStream` dial-back. Each `PendingRelay` wraps a `oneshot::Sender` and a creation timestamp. +- `pending_relays: HashMap` -- one entry per in-flight relay request awaiting the supervisor's `RelayStream` dial-back. Each `PendingRelay` stores the sandbox ID, the full `RelayOpen` message for reconnect replay, a creation timestamp, and a `oneshot::Sender>` so the waiter can receive either the paired stream or a supervisor-reported target-open failure. Core operations: @@ -236,8 +234,10 @@ Core operations: |--------|---------| | `register(sandbox_id, session_id, tx)` | Insert a live session; returns the previous session's sender (if any) so the caller can close it. Used by `handle_connect_supervisor` when a supervisor reconnects. | | `remove_if_current(sandbox_id, session_id)` | Remove the session only if its `session_id` still matches. Guards against the supersede race where an old session's cleanup task fires after a newer session already registered. | -| `open_relay(sandbox_id, session_wait_timeout)` | Wait up to `session_wait_timeout` for a live session, allocate a fresh `channel_id` (UUID v4), insert the pending slot, send `RelayOpen { channel_id }` to the supervisor, and return `(channel_id, oneshot::Receiver)`. The receiver resolves once the supervisor's `RelayStream` arrives and `claim_relay` pairs them up. | -| `claim_relay(channel_id)` | Consume the pending slot, construct a `tokio::io::duplex(64 KiB)` pair, hand the gateway-side half to the waiter via the oneshot, and return the supervisor-side half to `handle_relay_stream`. | +| `open_relay(sandbox_id, session_wait_timeout)` | SSH-compatible wrapper around `open_relay_with_target`. It sends a `RelayOpen` with an explicit `SshRelayTarget` and returns `(channel_id, oneshot::Receiver>)`. | +| `open_relay_with_target(sandbox_id, target, service_id, session_wait_timeout)` | Wait up to `session_wait_timeout` for a live session, allocate a fresh `channel_id` (UUID v4), insert the pending slot, send the full `RelayOpen { channel_id, target, service_id }` to the supervisor, and return a receiver that resolves when `claim_relay` pairs a `RelayStream` or when `RelayOpenResult { success: false }` reports the target-open failure. The target can be `SshRelayTarget` or loopback-only `TcpRelayTarget`, which is the base for future service-target relays. | +| `fail_pending_relay(channel_id, error)` | Remove a pending relay and complete its waiter with `Status::unavailable(error)`. Called when the supervisor sends a failed `RelayOpenResult`, so callers fail promptly instead of waiting for the 10 s relay timeout. | +| `claim_relay(channel_id)` | Consume the pending slot, construct a `tokio::io::duplex(64 KiB)` pair, hand `Ok(gateway-side half)` to the waiter via the oneshot, and return the supervisor-side half to `handle_relay_stream`. | | `reap_expired_relays()` | Drop pending relays older than 10 s. Called by `spawn_relay_reaper` on a 30 s cadence. | Session wait uses exponential backoff from 100 ms to 2 s while polling the sessions map. Pending-relay expiry is fixed at `RELAY_PENDING_TIMEOUT = 10 s`. @@ -250,7 +250,7 @@ Lifecycle of a supervisor session: 2. Allocate a fresh `session_id` (UUID v4) and create an `mpsc::channel::(64)` for the outbound stream. 3. Call `registry.register(...)`. If it returns a previous sender, log that the previous session was superseded (dropping the previous `tx` closes the old outbound stream). 4. Send `SessionAccepted { session_id, heartbeat_interval_secs: 15 }`. If the send fails, call `remove_if_current` (so a concurrent reconnect isn't evicted) and return `Internal`. -5. Spawn a session loop that `select!`s between inbound messages and a 15 s heartbeat timer. Inbound heartbeats are silent; `RelayOpenResult` is logged; `RelayClose` is logged; unknown payloads are logged as warnings. +5. Spawn a session loop that `select!`s between inbound messages and a 15 s heartbeat timer. Inbound heartbeats are silent; successful `RelayOpenResult` messages are logged; failed `RelayOpenResult` messages call `fail_pending_relay` before logging; `RelayClose` is logged; unknown payloads are logged as warnings. 6. When the loop exits (inbound EOF, inbound error, or outbound channel closed), `remove_if_current` drops the registration -- unless a newer session has already replaced it. ### `handle_relay_stream` @@ -264,50 +264,53 @@ Lifecycle of one relay call: - **Gateway → supervisor**: read up to `RELAY_STREAM_CHUNK_SIZE = 16 KiB` at a time from the duplex read-half and emit `RelayFrame { Data }` messages on an outbound `mpsc::channel(16)`. 4. Return the outbound receiver as the RPC response stream. -### Connect Flow (SSH Tunnel) +### `ForwardTcp` Flow (SSH and Service Forwarding) + +`ForwardTcp` is the client-to-gateway byte stream for SSH and service forwarding. The first `TcpForwardFrame` must contain `TcpForwardInit`; all targets include the `authorization_token` issued by `CreateSshSession`. SSH connections use `target.ssh`, while service forwarding uses `target.tcp` with a loopback host and port. ```mermaid sequenceDiagram - participant Client as SSH client - participant GW as Gateway
(/connect/ssh) + participant Client as CLI + participant GW as Gateway
(ForwardTcp) participant Reg as SupervisorSessionRegistry participant Sup as Sandbox Supervisor - participant Daemon as In-sandbox sshd
(Unix socket) + participant Target as In-sandbox target
(SSH Unix socket or loopback TCP) - Client->>GW: CONNECT /connect/ssh
x-sandbox-id, x-sandbox-token - GW->>GW: validate session + sandbox Ready - GW->>Reg: open_relay(sandbox_id, 30s) - Reg->>Sup: GatewayMessage::RelayOpen { channel_id } + Client->>GW: ForwardTcp(TcpForwardInit { target, authorization_token }) + GW->>GW: validate sandbox Ready
validate token
validate loopback for target.tcp + GW->>Reg: open_relay_with_target(sandbox_id, target, service_id, 15s) + Reg->>Sup: GatewayMessage::RelayOpen { channel_id, target } Note over Reg: waits for RelayStream on channel_id - Sup->>Daemon: connect to Unix socket + Sup->>Target: dial SSH Unix socket or loopback TCP target + Sup-->>GW: RelayOpenResult { success/failure } + GW->>Reg: fail_pending_relay(channel_id) on failure Sup->>GW: RelayStream(RelayFrame::Init { channel_id }) GW->>Reg: claim_relay(channel_id) Reg-->>Sup: supervisor-side DuplexStream Reg-->>GW: gateway-side DuplexStream - GW-->>Client: 200 OK + HTTP upgrade - Client<<->>GW: copy_bidirectional(upgraded, duplex) + Client<<->>GW: TcpForwardFrame::Data in both directions GW<<->>Sup: RelayFrame::Data in both directions - Sup<<->>Daemon: raw SSH bytes + Sup<<->>Target: raw bytes ``` -Timeouts on the tunnel path: +Timeouts on the `ForwardTcp` path: -- `open_relay` session wait: **30 s**. A first `sandbox connect` immediately after `sandbox create` must cover the supervisor's initial TLS + gRPC handshake on a cold pod. -- `relay_rx` delivery timeout: 10 s. Covers the round-trip from the `RelayOpen` message to the supervisor's `RelayStream` dial-back. +- `open_relay_with_target` session wait: **15 s**. The gateway waits for a live supervisor session before sending `RelayOpen`. +- `relay_rx` delivery timeout: 10 s. Covers the round-trip from the `RelayOpen` message to the supervisor's `RelayOpenResult` and `RelayStream` dial-back. A failed `RelayOpenResult` completes the waiter immediately with `Unavailable`. -Per-token and per-sandbox concurrent-tunnel limits (3 and 20 respectively) are still enforced before the upgrade. +For all `ForwardTcp` targets, the gateway validates the `authorization_token` against the stored `SshSession`, rejects revoked, expired, or sandbox-mismatched tokens, and enforces per-token and per-sandbox concurrent connection limits (3 and 20 respectively). For `target.tcp`, the gateway also requires a loopback target host (`localhost`, `127.0.0.0/8`, or `::1`) and a port in `1..=65535`. ### Exec Flow `ExecSandbox` reuses the same machinery from `grpc/sandbox.rs`: 1. Validate the request (`sandbox_id`, `command`, env-key format, other field rules), fetch the sandbox, require `Ready` phase. -2. `state.supervisor_sessions.open_relay(&sandbox.id, 15s)` -- shorter timeout than SSH connect, because exec is typically called mid-lifetime after the supervisor session is already established. -3. Wait up to 10 s for the relay `DuplexStream`. +2. `state.supervisor_sessions.open_relay(&sandbox.id, 15s)` -- same session-wait timeout used by `ForwardTcp`, because exec is typically called mid-lifetime after the supervisor session is already established. +3. Wait up to 10 s for the relay `DuplexStream`; a failed `RelayOpenResult` returns the target-open error promptly instead of timing out. 4. `stream_exec_over_relay`: bind an ephemeral localhost TCP listener, bridge that single-use TCP socket to the relay duplex, and drive a `russh` client through the local port. The `russh` session opens a channel, executes the shell-escaped command, and streams `ExecSandboxStdout`/`ExecSandboxStderr` chunks to the caller. On completion, send `ExecSandboxExit { exit_code }`. 5. On timeout (if `timeout_seconds > 0`), emit exit code 124 (matching `timeout(1)`). -The supervisor-side SSH daemon is an SSH server bound to a Unix domain socket inside the sandbox's filesystem. Filesystem permissions on that socket are the only access-control boundary between the supervisor bridge and the daemon; all higher-level authorization is enforced at `CreateSshSession` / `ExecSandbox` in the gateway. +The supervisor-side SSH daemon is an SSH server bound to a Unix domain socket inside the sandbox's filesystem. Filesystem permissions on that socket are the only access-control boundary between the supervisor bridge and the daemon; all higher-level authorization is enforced by gateway RPCs (`CreateSshSession` + `ForwardTcp` for SSH, `ExecSandbox` for exec). ### Regression Coverage @@ -338,6 +341,7 @@ Defined in `proto/openshell.proto`, implemented in `crates/openshell-server/src/ | `DeleteSandbox` | Delete sandbox by name | Sets phase to `Deleting`, persists, notifies watch bus, then deletes via the compute driver. Cleans up store if the sandbox was already gone. | | `WatchSandbox` | Stream sandbox updates | Server-streaming RPC. See [Watch Sandbox Stream](#watch-sandbox-stream) below. | | `ExecSandbox` | Execute command in sandbox | Server-streaming RPC; data plane runs through `SupervisorSessionRegistry::open_relay`. See [Exec Flow](#exec-flow). | +| `ForwardTcp` | Forward one CLI-side TCP connection into a sandbox | Bidirectional stream; consumes a `CreateSshSession` token, with `target.ssh` for SSH and `target.tcp` for loopback TCP services. See [`ForwardTcp` Flow (SSH and Service Forwarding)](#forwardtcp-flow-ssh-and-service-forwarding). | #### Supervisor Session @@ -352,7 +356,7 @@ Neither RPC is called by end users. They are the private control/data plane betw | RPC | Description | |-----|-------------| -| `CreateSshSession` | Creates a session token for a `Ready` sandbox. Persists an `SshSession` record and returns gateway connection details (host, port, scheme, connect path). The resulting token is presented on the `/connect/ssh` HTTP CONNECT request. | +| `CreateSshSession` | Creates a session token for a `Ready` sandbox. Persists an `SshSession` record and returns gateway connection details (host, port, scheme) plus optional expiry. The resulting token is presented as `authorization_token` on a `ForwardTcp` stream. | | `RevokeSshSession` | Marks a session as revoked by setting `session.revoked = true` in the store. | #### Provider Management @@ -438,7 +442,7 @@ The `ClusterInferenceConfig` stored in the database contains only `provider_name ## HTTP Endpoints -The HTTP router (`crates/openshell-server/src/http.rs`) merges two sub-routers: +The main HTTP router (`crates/openshell-server/src/http.rs`) serves browser-auth and WebSocket tunnel endpoints on the multiplexed gateway port. Health and metrics routers are exposed on dedicated listeners when configured. ### Health Endpoints @@ -448,14 +452,6 @@ The HTTP router (`crates/openshell-server/src/http.rs`) merges two sub-routers: | `/healthz` | GET | `200 OK` (empty body) -- Kubernetes liveness probe | | `/readyz` | GET | `200 OK` with JSON `{"status": "healthy", "version": ""}` -- Kubernetes readiness probe | -### SSH Tunnel Endpoint - -| Path | Method | Response | -|------|--------|----------| -| `/connect/ssh` | CONNECT | Upgrades the connection to a bidirectional byte bridge tunneled through `SupervisorSessionRegistry::open_relay` | - -See [Connect Flow (SSH Tunnel)](#connect-flow-ssh-tunnel) for details. - ### Cloudflare Endpoints | Path | Method | Response | @@ -680,7 +676,7 @@ Supervisor session telemetry is currently emitted as plain `tracing` events from - `ResourceExhausted` for broadcast lag (missed messages). - `Cancelled` for closed broadcast channels. -- **HTTP errors**: The SSH tunnel handler returns HTTP status codes directly (`401`, `404`, `405`, `412`, `429`, `500`, `502`). `502` indicates the supervisor relay could not be opened; `429` indicates a per-token or per-sandbox concurrent-tunnel limit. +- **Forwarding errors**: `ForwardTcp` returns gRPC status codes. `Unauthenticated` indicates a missing or invalid session token; `ResourceExhausted` indicates a per-token or per-sandbox connection limit; `Unavailable` indicates the supervisor relay could not be opened. - **Connection errors**: Logged at `error` level but do not crash the gateway. TLS handshake failures and individual connection errors are caught and logged per-connection. diff --git a/architecture/podman-rootless-networking.md b/architecture/podman-rootless-networking.md index b267cfffa..de7e08b8c 100644 --- a/architecture/podman-rootless-networking.md +++ b/architecture/podman-rootless-networking.md @@ -250,9 +250,9 @@ A tmpfs is mounted at `/run/netns` in the container spec (`container.rs:458-463` ``` Client (CLI on user's machine) | - 1. gRPC: CreateSshSession -> gateway (returns token, connect_path) - 2. HTTP CONNECT /connect/ssh to gateway - (headers: x-sandbox-id, x-sandbox-token) + 1. gRPC: CreateSshSession -> gateway (returns token) + 2. gRPC: ForwardTcp(target.ssh) to gateway + (init: sandbox_id, authorization_token) | Gateway (host, port 8080) | @@ -364,9 +364,9 @@ Both drivers use the same reverse gRPC relay (`ConnectSupervisor` + `RelayStream | `crates/openshell-sandbox/src/sandbox/linux/netns.rs` | Inner network namespace: veth pair, IP addressing, iptables rules | | `crates/openshell-sandbox/src/proxy.rs` | HTTP CONNECT proxy: OPA policy, SSRF protection, L7 inspection | | `crates/openshell-sandbox/src/ssh.rs` | SSH daemon on Unix socket, shell process netns entry via `setns()` | -| `crates/openshell-sandbox/src/supervisor_session.rs` | gRPC ConnectSupervisor stream, RelayStream for SSH tunneling | +| `crates/openshell-sandbox/src/supervisor_session.rs` | gRPC ConnectSupervisor stream, RelayStream for SSH and TCP target bridging | | `crates/openshell-sandbox/src/grpc_client.rs` | gRPC channel to gateway (mTLS or plaintext, keep-alive, adaptive windowing) | -| `crates/openshell-server/src/ssh_tunnel.rs` | Gateway-side SSH tunnel: HTTP CONNECT endpoint, relay bridging | +| `crates/openshell-server/src/grpc/sandbox.rs` | Gateway-side `ForwardTcp` stream handling for SSH and TCP service forwarding | | `crates/openshell-server/src/supervisor_session.rs` | SupervisorSessionRegistry, relay claim/open lifecycle | | `crates/openshell-server/src/compute/mod.rs` | `ComputeRuntime::new_podman()` -- Podman compute driver initialization | | `crates/openshell-core/src/config.rs` | Default constants: ports, network name | diff --git a/architecture/sandbox-connect.md b/architecture/sandbox-connect.md index 499532fb9..2ab7c51ec 100644 --- a/architecture/sandbox-connect.md +++ b/architecture/sandbox-connect.md @@ -8,28 +8,47 @@ Sandbox connect provides secure remote access into running sandbox environments. 2. **Command execution** (`sandbox create -- `) -- runs a command over SSH with stdout/stderr piped back 3. **File sync** (`sandbox create --upload`) -- uploads local files into the sandbox before command execution -Gateway connectivity is **supervisor-initiated**: the gateway never dials the sandbox pod. On startup, each sandbox's supervisor opens a long-lived bidirectional gRPC stream (`ConnectSupervisor`) to the gateway and holds it for the sandbox's lifetime. **`CreateSshSession` → HTTP CONNECT and `ExecSandbox` both depend on that registration**: `open_relay` blocks until a live `ConnectSupervisor` entry exists for the `sandbox_id`; if the supervisor never registers (wrong endpoint, bad env, crash loop), the client hits the supervisor-session wait timeout instead of getting a relay. When a client asks the gateway for SSH, the gateway sends a `RelayOpen` message over that stream; the supervisor responds by initiating a `RelayStream` gRPC call that rides the same TCP+TLS+HTTP/2 connection as a new multiplexed stream. The supervisor bridges the bytes of that stream into a root-owned Unix socket where the embedded SSH daemon listens. **The in-container sshd is reached only on that local Unix socket** — the supervisor `UnixStream::connect`s to it. Do not assume the relay path terminates at a container-exposed TCP listener for sshd; any optional TCP surface is separate from the gateway relay bridge. +Gateway connectivity is **supervisor-initiated**: the gateway never dials the sandbox pod. On startup, each sandbox's supervisor opens a long-lived bidirectional gRPC stream (`ConnectSupervisor`) to the gateway and holds it for the sandbox's lifetime. **`CreateSshSession`, `ForwardTcp`, and `ExecSandbox` all depend on that registration**: `open_relay` blocks until a live `ConnectSupervisor` entry exists for the `sandbox_id`; if the supervisor never registers (wrong endpoint, bad env, crash loop), the client hits the supervisor-session wait timeout instead of getting a relay. When a client asks the gateway for SSH, the OpenSSH `ProxyCommand` runs `openshell ssh-proxy`, which opens a bidirectional gRPC `ForwardTcp` stream to the gateway. Its first frame is `TcpForwardInit { target.ssh, authorization_token }`, where the token comes from `CreateSshSession`. The gateway validates the token, then sends a `RelayOpen` message over `ConnectSupervisor` with an explicit `SshRelayTarget`; older targetless messages remain SSH-compatible. The supervisor validates and dials the requested target before reporting a successful `RelayOpenResult`, then initiates a `RelayStream` gRPC call that rides the same TCP+TLS+HTTP/2 connection as a new multiplexed stream. For SSH targets, the supervisor bridges the bytes of that stream into a root-owned Unix socket where the embedded SSH daemon listens. **The in-container sshd is reached only on that local Unix socket** — the supervisor `UnixStream::connect`s to it. Do not assume the relay path terminates at a container-exposed TCP listener for sshd; any optional TCP surface is separate from the gateway relay bridge. There is also a gateway-side `ExecSandbox` gRPC RPC that executes commands inside sandboxes without requiring an external SSH client. It uses the same relay mechanism. +The OS-88 forwarding path also carries arbitrary TCP services: `openshell forward service --target-port ` binds a local TCP listener, opens one `ForwardTcp` bidirectional gRPC stream per accepted local connection, and sends `TcpForwardInit { target.tcp, authorization_token }` with a `TcpRelayTarget` for the requested loopback port inside the sandbox. This avoids the SSH `direct-tcpip` transport and keeps gateway auth, typed routing, session-token authorization, and relay target validation in the OpenShell protocol. + ### Podman and relay environment -The **Podman** compute driver (`crates/openshell-driver-podman/src/container.rs`, `build_env` / `build_container_spec`) must inject the same **relay-critical** environment variables into the container as the Kubernetes driver: `OPENSHELL_ENDPOINT` (gateway gRPC), `OPENSHELL_SANDBOX_ID`, and `OPENSHELL_SSH_SOCKET_PATH` (Unix path the embedded sshd binds and the supervisor dials). Without `OPENSHELL_SSH_SOCKET_PATH`, the in-container `openshell-sandbox` process does not know where to create the socket; without `OPENSHELL_ENDPOINT` / `OPENSHELL_SANDBOX_ID`, the supervisor cannot complete `ConnectSupervisor`, so the gateway never has a session to target with `RelayOpen`. Driver-owned keys overwrite user spec/template env so these cannot be overridden. **Podman container readiness** (libpod `HealthConfig` in `build_container_spec`) treats the sandbox as ready when a sentinel file exists, **or** `test -S` passes on the configured `sandbox_ssh_socket_path` (**supervisor / Unix-socket path**), **or** a legacy TCP listen check on the published SSH port — so the `Ready` phase used by `CreateSshSession` and the SSH tunnel can reflect Unix-socket–based startup, not only a TCP listener. +The **Podman** compute driver (`crates/openshell-driver-podman/src/container.rs`, `build_env` / `build_container_spec`) must inject the same **relay-critical** environment variables into the container as the Kubernetes driver: `OPENSHELL_ENDPOINT` (gateway gRPC), `OPENSHELL_SANDBOX_ID`, and `OPENSHELL_SSH_SOCKET_PATH` (Unix path the embedded sshd binds and the supervisor dials). Without `OPENSHELL_SSH_SOCKET_PATH`, the in-container `openshell-sandbox` process does not know where to create the socket; without `OPENSHELL_ENDPOINT` / `OPENSHELL_SANDBOX_ID`, the supervisor cannot complete `ConnectSupervisor`, so the gateway never has a session to target with `RelayOpen`. Driver-owned keys overwrite user spec/template env so these cannot be overridden. **Podman container readiness** (libpod `HealthConfig` in `build_container_spec`) treats the sandbox as ready when a sentinel file exists, **or** `test -S` passes on the configured `sandbox_ssh_socket_path` (**supervisor / Unix-socket path**), **or** a legacy TCP listen check on the published SSH port — so the `Ready` phase used by `CreateSshSession` and `ForwardTcp` can reflect Unix-socket–based startup, not only a TCP listener. ## Two-Plane Architecture The supervisor and gateway maintain two logical planes over **one TCP+TLS connection**, multiplexed by HTTP/2 streams: -- **Control plane** -- the `ConnectSupervisor` bidirectional gRPC stream. Carries `SupervisorHello`, heartbeats, `RelayOpen`/`RelayClose` requests from the gateway, and `RelayOpenResult`/`RelayClose` replies from the supervisor. Lives for the lifetime of the sandbox supervisor process. -- **Data plane** -- one `RelayStream` bidirectional gRPC call per SSH connect or exec invocation. Each call is a new HTTP/2 stream on the same connection. Frames are opaque bytes except for the first frame from the supervisor, which is a typed `RelayInit { channel_id }` used to pair the stream with a pending relay slot on the gateway. +- **Control plane** -- the `ConnectSupervisor` bidirectional gRPC stream. Carries `SupervisorHello`, heartbeats, targetable `RelayOpen`/`RelayClose` requests from the gateway, and `RelayOpenResult`/`RelayClose` replies from the supervisor. Lives for the lifetime of the sandbox supervisor process. +- **Data plane** -- one `RelayStream` bidirectional gRPC call per accepted relay. Each call is a new HTTP/2 stream on the same connection. Frames are opaque bytes except for the first frame from the supervisor, which is a typed `RelayInit { channel_id }` used to pair the stream with a pending relay slot on the gateway. -Running both planes over one HTTP/2 connection means each relay avoids a fresh TLS handshake and benefits from a single authenticated transport boundary. Hyper/h2 `adaptive_window(true)` is enabled on both sides so bulk transfers (large file uploads, long exec stdout) aren't pinned to the default 64 KiB stream window. +Running both planes over one HTTP/2 connection means each relay avoids a fresh TLS handshake and benefits from a single authenticated transport boundary. Hyper/h2 adaptive windows are enabled on the gateway, the sandbox supervisor channel, and CLI gRPC channels so bulk transfers (large file uploads, long exec stdout) aren't pinned to the default 64 KiB stream window. The supervisor-initiated direction gives the model two properties: 1. The sandbox pod exposes no ingress surface. Network reachability is whatever the supervisor itself can reach outward. 2. Authentication reduces to one place: the existing gateway mTLS channel. There is no second application-layer handshake to design, rotate, or replay-protect. +### Targetable relay base + +`RelayOpen` is targetable but remains SSH-compatible by default. In `proto/openshell.proto`, `RelayOpen.target` is an optional `oneof` with: + +- `SshRelayTarget` -- the built-in SSH target. This is the explicit target used by the server-side `open_relay()` wrapper, so existing SSH connect and `ExecSandbox` flows continue to request SSH without each caller constructing the target. +- `TcpRelayTarget { host, port }` -- a supervisor-dialed TCP target inside the sandbox. The supervisor accepts only loopback hosts (`127.0.0.1`, `::1`, or `localhost`) and ports `1..=65535`. + +If `target` is absent, the supervisor treats the relay as `SshRelayTarget` for compatibility with older gateways or messages. The supervisor opens the target before sending `RelayOpenResult { success: true }`; if validation or dialing fails, it sends `success: false` with the error and does not start a `RelayStream`. + +### CLI forward service over gRPC + +**Files**: `proto/openshell.proto`, `crates/openshell-cli/src/run.rs`, `crates/openshell-server/src/grpc/sandbox.rs` + +`ForwardTcp` is a bidirectional gRPC stream between the CLI and gateway. The CLI sends `TcpForwardFrame::Init { sandbox_id, service_id, target.tcp, authorization_token }` as the first frame, followed by `TcpForwardFrame::Data` chunks from the accepted local TCP connection. The gateway validates that the sandbox exists and is `Ready`, validates the session token, validates that the target is loopback-only, calls `open_relay_with_target(TcpRelayTarget)`, waits for the supervisor's `RelayStream`, and bridges opaque bytes between the CLI stream and the relay stream. + +The spike command does not create persistent `SandboxService` records yet. It takes the target directly from CLI flags and uses the same loopback-only target restrictions that the supervisor enforces again at relay-open time. + ## Components ### CLI SSH module @@ -52,7 +71,7 @@ These helpers are re-exported from `crates/openshell-cli/src/run.rs` for backwar **File**: `crates/openshell-cli/src/main.rs` (`Commands::SshProxy`) -A top-level CLI subcommand (`ssh-proxy`) that the SSH `ProxyCommand` invokes. It receives `--gateway`, `--sandbox-id`, `--token`, and `--gateway-name` flags, then delegates to `sandbox_ssh_proxy()`. This process has no TTY of its own -- it pipes stdin/stdout directly to the gateway tunnel. +A top-level CLI subcommand (`ssh-proxy`) that the SSH `ProxyCommand` invokes. It receives `--gateway`, `--sandbox-id`, `--token`, and `--gateway-name` flags, then delegates to `sandbox_ssh_proxy()`. This process has no TTY of its own -- it pipes stdin/stdout directly to the gateway `ForwardTcp` stream. ### gRPC session bootstrap @@ -60,7 +79,7 @@ A top-level CLI subcommand (`ssh-proxy`) that the SSH `ProxyCommand` invokes. It Two RPCs manage SSH session tokens: -- `CreateSshSession(sandbox_id)` -- validates the sandbox exists and is `Ready`, generates a UUID token, persists an `SshSession` record, and returns the token plus gateway connection details (host, port, scheme, connect path, optional TTL). +- `CreateSshSession(sandbox_id)` -- validates the sandbox exists and is `Ready`, generates a UUID token, persists an `SshSession` record, and returns the token plus gateway connection details (host, port, scheme, optional TTL). - `RevokeSshSession(token)` -- marks the session's `revoked` flag to `true` in the persistence layer. ### Supervisor session registry @@ -76,11 +95,13 @@ Key operations: - `register(sandbox_id, session_id, tx)` -- inserts a new session and returns the previous sender if it superseded one. Used by `handle_connect_supervisor` to accept a new stream. - `remove_if_current(sandbox_id, session_id)` -- removes only if the stored `session_id` matches. Guards against the supersede race where an old session's cleanup runs after a newer session has already registered. -- `open_relay(sandbox_id, timeout)` -- called by the gateway tunnel and exec handlers. Waits up to `timeout` for a supervisor session to appear (with exponential backoff 100 ms → 2 s), registers a pending relay slot keyed by a fresh `channel_id`, sends `RelayOpen` to the supervisor, and returns a `oneshot::Receiver` that resolves when the supervisor claims the slot. +- `open_relay(sandbox_id, timeout)` -- called by exec handlers. Wraps `open_relay_with_target()` with `SshRelayTarget`, waits up to `timeout` for a supervisor session to appear (with exponential backoff 100 ms → 2 s), registers a pending relay slot keyed by a fresh `channel_id`, sends `RelayOpen` to the supervisor, and returns a `oneshot::Receiver>` that resolves when the supervisor claims the slot or reports target-open failure. +- `open_relay_with_target(sandbox_id, target, service_id, timeout)` -- lower-level relay opener for explicit `RelayOpen.target` values. It stores the full `RelayOpen` in the pending slot so replay after supervisor supersede preserves the requested target. +- `fail_pending_relay(channel_id, error)` -- removes a pending relay and wakes the caller with `Status::unavailable` when the supervisor sends `RelayOpenResult { success: false }`. - `claim_relay(channel_id)` -- called by `handle_relay_stream` when the supervisor's first `RelayFrame::Init` arrives. Removes the pending entry, enforces a 10-second staleness bound (`RELAY_PENDING_TIMEOUT`), creates a 64 KiB `tokio::io::duplex` pair, hands the gateway-side half to the waiter, and returns the supervisor-side half to be bridged against the inbound/outbound `RelayFrame` streams. - `reap_expired_relays()` -- bounds leaks from pending slots the supervisor never claimed (e.g., supervisor crashed between `RelayOpen` and `RelayStream`). Scheduled every 30 s by `spawn_relay_reaper()` during server startup. -The `ConnectSupervisor` handler (`handle_connect_supervisor`) validates `SupervisorHello`, assigns a fresh `session_id`, sends `SessionAccepted { heartbeat_interval_secs: 15 }`, spawns a loop that processes inbound messages (`Heartbeat`, `RelayOpenResult`, `RelayClose`), and emits a `GatewayHeartbeat` every 15 seconds. +The `ConnectSupervisor` handler (`handle_connect_supervisor`) validates `SupervisorHello`, assigns a fresh `session_id`, sends `SessionAccepted { heartbeat_interval_secs: 15 }`, spawns a loop that processes inbound messages (`Heartbeat`, `RelayOpenResult`, `RelayClose`), and emits a `GatewayHeartbeat` every 15 seconds. Successful `RelayOpenResult` values are informational; failed results wake the pending relay waiter via `fail_pending_relay()` instead of only being logged. ### RelayStream handler @@ -93,18 +114,19 @@ Accepts one inbound `RelayFrame` to extract `channel_id` from `RelayInit`, claim The first frame that isn't `RelayInit` is rejected (`invalid_argument`). Any non-data frame after init closes the relay. -### Gateway tunnel handler +### Gateway `ForwardTcp` handler -**File**: `crates/openshell-server/src/ssh_tunnel.rs` +**File**: `crates/openshell-server/src/grpc/sandbox.rs` (`handle_forward_tcp`) -An Axum route at `/connect/ssh` on the shared gateway port. Handles HTTP CONNECT requests by: +Handles one CLI-to-gateway bidirectional `ForwardTcp` stream by: -1. Validating the session token (present, not revoked, bound to the sandbox id in `X-Sandbox-Id`, not expired). +1. Reading the first `TcpForwardFrame` and requiring `TcpForwardInit`. 2. Confirming the sandbox is in `Ready` phase. -3. Enforcing per-token (max 3) and per-sandbox (max 20) concurrent connection limits. -4. Calling `supervisor_sessions.open_relay(sandbox_id, 30s)` -- the 30-second wait covers the supervisor's initial mTLS + `ConnectSupervisor` handshake on a freshly-scheduled pod. -5. Waiting up to 10 seconds for the supervisor to open its `RelayStream` and deliver the gateway-side `DuplexStream`. -6. Performing the HTTP CONNECT upgrade on the client connection and calling `copy_bidirectional` between the upgraded client socket and the relay stream. +3. Validating `authorization_token` against the `SshSession` row and enforcing per-token (max 3) and per-sandbox (max 20) concurrent connection limits. +4. For `target.tcp`, validating that the target host is loopback-only and the port is `1..=65535`. +5. Calling `supervisor_sessions.open_relay_with_target(...)` with the validated `SshRelayTarget` or `TcpRelayTarget`. +6. Waiting up to 10 seconds for the supervisor to open its `RelayStream` and deliver the gateway-side `DuplexStream`, or to report target-open failure. +7. Bridging opaque `TcpForwardFrame::Data` chunks between the CLI stream and the relay stream. There is no gateway-to-sandbox TCP dial, handshake preface, or pod-IP resolution in this path. @@ -112,23 +134,23 @@ There is no gateway-to-sandbox TCP dial, handshake preface, or pod-IP resolution **File**: `crates/openshell-server/src/multiplex.rs` -The gateway runs a single listener that multiplexes gRPC and HTTP on the same port. `MultiplexedService` routes based on the `content-type` header: requests with `application/grpc` go to the gRPC router; all others (including HTTP CONNECT) go to the HTTP router. The HTTP router (`crates/openshell-server/src/http.rs`) merges health endpoints with the SSH tunnel router. Hyper is configured with `http2().adaptive_window(true)` so the HTTP/2 stream windows grow under load rather than throttling `RelayStream` to the default 64 KiB window. +The gateway runs a single listener that multiplexes gRPC and HTTP on the same port. `MultiplexedService` routes based on the `content-type` header: requests with `application/grpc` go to the gRPC router; all others go to the HTTP router for health endpoints. Hyper is configured with `http2().adaptive_window(true)` so the HTTP/2 stream windows grow under load rather than throttling `ForwardTcp` or `RelayStream` to the default 64 KiB window. ### Sandbox supervisor session **File**: `crates/openshell-sandbox/src/supervisor_session.rs` -`spawn(endpoint, sandbox_id, ssh_socket_path)` starts a background task that: +`spawn(endpoint, sandbox_id, ssh_socket_path, netns_fd)` starts a background task that: 1. Opens a gRPC `Channel` to the gateway (`http2_adaptive_window(true)`). The same channel multiplexes the control stream and every relay. 2. Sends `SupervisorHello { sandbox_id, instance_id }` as the first outbound message. 3. Waits for `SessionAccepted` (or fails fast on `SessionRejected`). 4. Runs a loop that reads inbound `GatewayMessage` values and emits `SupervisorHeartbeat` at the accepted interval (min 5 s, usually 15 s). -5. On `RelayOpen`, spawns `handle_relay_open()` which opens a new `RelayStream` RPC on the existing channel, sends `RelayInit { channel_id }` as the first frame, dials the local SSH Unix socket, and bridges bytes in both directions in 16 KiB chunks. +5. On `RelayOpen`, spawns `handle_relay_open()` which resolves the target (`SshRelayTarget`, `TcpRelayTarget`, or targetless-as-SSH), validates loopback-only TCP targets, dials SSH through the Unix socket or TCP from the sandbox network namespace, sends `RelayOpenResult`, opens a new `RelayStream` RPC on the existing channel, sends `RelayInit { channel_id }` as the first frame, and bridges bytes in both directions in 16 KiB chunks. Reconnect policy: the session loop wraps `run_single_session()` with exponential backoff (1 s → 30 s) on any error. A `session_established` / `session_failed` OCSF event is emitted on each attempt. -The supervisor is a dumb byte bridge with no awareness of the SSH protocol flowing through it. +After target selection, the supervisor is a dumb byte bridge with no awareness of the SSH protocol flowing through it. ### Sandbox SSH daemon @@ -151,7 +173,7 @@ The `ExecSandbox` gRPC RPC provides programmatic command execution without requi 1. Validates `sandbox_id`, `command`, env keys, and field sizes; confirms the sandbox is `Ready`. 2. Calls `supervisor_sessions.open_relay(sandbox_id, 15s)` -- a shorter wait than connect because exec runs in steady state, not on cold start. 3. Waits up to 10 seconds for the relay `DuplexStream` to arrive. -4. Starts a single-use localhost TCP listener on `127.0.0.1:0` and spawns a task that bridges a single accept to the `DuplexStream` with `copy_bidirectional`. This adapts the `DuplexStream` to something `russh::client::connect_stream` can dial. +4. Starts a single-use localhost TCP listener on `127.0.0.1:0` and spawns a task that bridges a single accept to the `DuplexStream` with `copy_bidirectional`. This adapts the SSH-targeted `DuplexStream` to something `russh::client::connect_stream` can dial. 5. Connects `russh` to the local proxy, authenticates `none` as user `sandbox`, opens a channel, optionally requests a PTY, and executes the shell-escaped command. 6. Streams `stdout`/`stderr`/`exit` events back to the gRPC caller. @@ -180,27 +202,26 @@ sequenceDiagram User->>CLI: openshell sandbox connect foo CLI->>GW: GetSandbox(name) -> sandbox.id CLI->>GW: CreateSshSession(sandbox_id) - GW-->>CLI: token, gateway_host, gateway_port, scheme, connect_path + GW-->>CLI: token, gateway_host, gateway_port, scheme Note over CLI: Builds ProxyCommand string: exec()s ssh User->>CLI: ssh spawns ssh-proxy subprocess - CLI->>GW: CONNECT /connect/ssh
X-Sandbox-Id, X-Sandbox-Token - GW->>GW: Validate token + sandbox Ready - GW->>Reg: open_relay(sandbox_id, 30s) + CLI->>GW: ForwardTcp stream
TcpForwardInit{target.ssh, authorization_token} + GW->>GW: Validate authorization_token + sandbox Ready + GW->>Reg: open_relay_with_target(sandbox_id, target=ssh, 15s) Reg-->>GW: (channel_id, relay_rx) - GW->>Sup: RelayOpen{channel_id} (over ConnectSupervisor) + GW->>Sup: RelayOpen{channel_id, target=ssh} (over ConnectSupervisor) + Sup->>Sock: UnixStream::connect(/run/openshell/ssh.sock) + Sock-->>SSHD: connection accepted + Sup->>GW: RelayOpenResult{success=true} Sup->>GW: RelayStream RPC (new HTTP/2 stream) Sup->>GW: RelayFrame::Init{channel_id} GW->>Reg: claim_relay(channel_id) -> DuplexStream pair Reg-->>GW: gateway-side DuplexStream (via relay_rx) - Sup->>Sock: UnixStream::connect(/run/openshell/ssh.sock) - Sock-->>SSHD: connection accepted - GW-->>CLI: 200 OK (upgrade) - - Note over CLI,SSHD: SSH protocol over:
CLI↔GW (HTTP CONNECT) ↔ RelayStream frames ↔ Sup ↔ Unix socket ↔ SSHD + Note over CLI,SSHD: SSH protocol over:
CLI↔GW (ForwardTcp gRPC) ↔ RelayStream frames ↔ Sup ↔ Unix socket ↔ SSHD CLI->>SSHD: SSH handshake + auth_none SSHD-->>CLI: Auth accepted @@ -225,9 +246,9 @@ sequenceDiagram - `-o SetEnv=TERM=xterm-256color` - `sandbox` as the SSH user 4. If stdin is a terminal (interactive), the CLI calls `exec()` (Unix) to replace itself with the `ssh` process. Otherwise it spawns and waits. -5. `sandbox_ssh_proxy()` connects via TCP (plain) or TLS (mTLS) to the gateway, sends a raw HTTP CONNECT request with `X-Sandbox-Id` and `X-Sandbox-Token` headers, and on a 200 response spawns two tasks to copy bytes between stdin/stdout and the tunnel. -6. Gateway-side: `ssh_connect()` in `ssh_tunnel.rs` authorizes the request, opens a relay, waits for the supervisor's `RelayStream`, and bridges the upgraded HTTP connection to the relay with `tokio::io::copy_bidirectional`. -7. Supervisor-side: on `RelayOpen`, `handle_relay_open()` in `crates/openshell-sandbox/src/supervisor_session.rs` opens a `RelayStream` RPC, sends `RelayInit`, dials `/run/openshell/ssh.sock`, and bridges the frames to the Unix socket. +5. `sandbox_ssh_proxy()` opens a gRPC `ForwardTcp` stream, sends `TcpForwardInit { sandbox_id, service_id: "ssh-proxy:", target.ssh, authorization_token: token }`, and spawns two tasks to copy bytes between stdin/stdout and `TcpForwardFrame::Data` messages. +6. Gateway-side: `handle_forward_tcp()` authorizes the SSH target with `authorization_token`, opens an SSH-targeted relay through `SupervisorSessionRegistry::open_relay_with_target()`, waits for the supervisor's `RelayStream`, and bridges `TcpForwardFrame::Data` to the relay stream. +7. Supervisor-side: on `RelayOpen`, `handle_relay_open()` in `crates/openshell-sandbox/src/supervisor_session.rs` dials `/run/openshell/ssh.sock`, reports `RelayOpenResult { success: true }`, opens a `RelayStream` RPC, sends `RelayInit`, and bridges the frames to the Unix socket. ### Command Execution (CLI) @@ -303,13 +324,14 @@ sequenceDiagram Client->>GW: ExecSandbox(sandbox_id, command, stdin, timeout) GW->>GW: Validate sandbox exists + Ready - GW->>Reg: open_relay(sandbox_id, 15s) + GW->>Reg: open_relay(sandbox_id, 15s) -> target=ssh Reg-->>GW: (channel_id, relay_rx) - GW->>Sup: RelayOpen{channel_id} + GW->>Sup: RelayOpen{channel_id, target=ssh} + Sup->>SSHD: connect /run/openshell/ssh.sock + Sup->>GW: RelayOpenResult{success=true} Sup->>GW: RelayStream + RelayInit{channel_id} GW->>Reg: claim_relay -> DuplexStream - Sup->>SSHD: connect /run/openshell/ssh.sock Note over GW: start_single_use_ssh_proxy_over_relay
(127.0.0.1:ephemeral -> DuplexStream) @@ -412,7 +434,7 @@ Tests in `supervisor_session.rs` pin this behavior: All gRPC traffic (control plane + data plane + other RPCs) rides one mTLS-authenticated TCP+TLS+HTTP/2 connection from the supervisor to the gateway. Client certificates prove the supervisor's identity; the server certificate proves the gateway's. Nothing sits between the supervisor and the SSH daemon except the Unix socket's filesystem permissions. -The CLI continues to authenticate to the gateway with its own mTLS credentials (or Cloudflare bearer token in reverse-proxy deployments) and a per-session token returned by `CreateSshSession`. The session token is enforced at the gateway: token scope (sandbox id), revocation state, and optional expiry are all checked in `ssh_connect()` before `open_relay()` is called. +The CLI continues to authenticate to the gateway with its own mTLS credentials (or Cloudflare bearer token in reverse-proxy deployments) and a per-session token returned by `CreateSshSession`. The session token is enforced at the gateway: token scope (sandbox id), revocation state, and optional expiry are all checked in `handle_forward_tcp()` before `open_relay_with_target()` is called for `target.ssh`. ### Unix socket access control @@ -445,7 +467,7 @@ The sandbox generates a fresh Ed25519 host key on every startup. The CLI disable ## Sandbox Target Resolution -The gateway does not resolve a sandbox's network address or port. The only identifier that matters is `sandbox_id`, which keys into the supervisor session registry. +The gateway does not resolve a sandbox pod network address or port. The `sandbox_id` keys into the supervisor session registry, and the optional `RelayOpen.target` tells the already-connected supervisor what local target to dial inside the sandbox. SSH callers use `SshRelayTarget`; targetless messages also resolve to SSH. TCP relay targets are valid only for loopback destinations and are rejected by the supervisor before any `RelayStream` starts. ## API and Persistence @@ -464,7 +486,6 @@ Response: - `gateway_host` (string) -- resolved from `Config::ssh_gateway_host` (defaults to bind address if empty) - `gateway_port` (uint32) -- resolved from `Config::ssh_gateway_port` (defaults to bind port if 0) - `gateway_scheme` (string) -- `"https"` if TLS is configured, otherwise `"http"` -- `connect_path` (string) -- from `Config::ssh_connect_path` (default: `/connect/ssh`) - `host_key_fingerprint` (string) -- currently unused (empty) - `expires_at_ms` (int64) -- session expiry; 0 disables expiry @@ -478,6 +499,18 @@ Response: - `revoked` (bool) -- true if a session was found and revoked +### ForwardTcp + +**Proto**: `proto/openshell.proto` -- `TcpForwardFrame` / `TcpForwardInit` + +`ForwardTcp(stream TcpForwardFrame) returns (stream TcpForwardFrame)` carries opaque bytes between the CLI and gateway. The first frame must be `TcpForwardInit`: + +- `sandbox_id` (string) -- sandbox to connect to +- `service_id` (string) -- optional audit/correlation identifier +- `target.ssh` -- SSH target used by `ssh-proxy` +- `target.tcp` -- loopback TCP target used by service forwarding +- `authorization_token` (string) -- short-lived session token from `CreateSshSession`, required for all targets + ### SshSession persistence **Proto**: `proto/openshell.proto` -- `SshSession` message @@ -512,8 +545,10 @@ Key messages: | `SessionRejected` | gw → sup | `reason` | | `SupervisorHeartbeat` | sup → gw | (empty) | | `GatewayHeartbeat` | gw → sup | (empty) | -| `RelayOpen` | gw → sup | `channel_id` (UUID) | -| `RelayOpenResult` | sup → gw | `channel_id`, `success`, `error` | +| `RelayOpen` | gw → sup | `channel_id` (UUID), optional `target` (`SshRelayTarget` or loopback-only `TcpRelayTarget`), `service_id` | +| `SshRelayTarget` | gw → sup | Empty built-in SSH target; absence of `target` is treated the same way | +| `TcpRelayTarget` | gw → sup | `host`, `port`; supervisor accepts only `127.0.0.1`, `::1`, or `localhost` and ports `1..=65535` | +| `RelayOpenResult` | sup → gw | `channel_id`, `success`, `error`; failure wakes the pending gateway waiter | | `RelayClose` | either | `channel_id`, `reason` | | `RelayInit` | sup → gw (first `RelayFrame`) | `channel_id` | | `RelayFrame` | either | `oneof { RelayInit init, bytes data }` | @@ -554,7 +589,7 @@ This function is shared between the CLI and TUI via the `openshell-core::forward | Stage | Duration | Where | |---|---|---| -| Supervisor session wait (SSH connect) | 30 s | `ssh_tunnel::ssh_connect` -> `open_relay` | +| Supervisor session wait (`ForwardTcp`) | 15 s | `handle_forward_tcp` -> `open_relay_with_target` | | Supervisor session wait (ExecSandbox) | 15 s | `handle_exec_sandbox` -> `open_relay` | | Wait for supervisor to claim relay | 10 s | `relay_rx` wrapped in `tokio::time::timeout` | | Pending-relay TTL (reaper) | 10 s | `RELAY_PENDING_TIMEOUT` in registry | @@ -569,18 +604,18 @@ This function is shared between the CLI and TUI via the `openshell-core::forward | Scenario | Status / Behavior | Source | |---|---|---| -| Missing `X-Sandbox-Id` or `X-Sandbox-Token` header | `401 Unauthorized` | `ssh_tunnel.rs` -- `header_value()` | -| Empty header value | `400 Bad Request` | `ssh_tunnel.rs` -- `header_value()` | -| Non-CONNECT method on `/connect/ssh` | `405 Method Not Allowed` | `ssh_tunnel.rs` -- `ssh_connect()` | -| Token not found in persistence | `401 Unauthorized` | `ssh_tunnel.rs` -- `ssh_connect()` | -| Token revoked or sandbox ID mismatch | `401 Unauthorized` | `ssh_tunnel.rs` -- `ssh_connect()` | -| Token expired | `401 Unauthorized` | `ssh_tunnel.rs` -- `ssh_connect()` | -| Sandbox not found | `404 Not Found` | `ssh_tunnel.rs` -- `ssh_connect()` | -| Sandbox not in `Ready` phase | `412 Precondition Failed` | `ssh_tunnel.rs` -- `ssh_connect()` | -| Per-token or per-sandbox concurrency limit hit | `429 Too Many Requests` | `ssh_tunnel.rs` -- `ssh_connect()` | -| Supervisor session not connected after 30 s | `502 Bad Gateway` | `ssh_tunnel.rs` -- `ssh_connect()` | -| Supervisor failed to claim relay within 10 s | Tunnel closed; `"relay open timed out"` logged | `ssh_tunnel.rs` -- spawned tunnel task | -| Relay channel oneshot dropped | Tunnel closed; `"relay channel dropped"` logged | `ssh_tunnel.rs` -- spawned tunnel task | +| Empty `ForwardTcp` stream or first frame is not `TcpForwardInit` | `invalid_argument` | `handle_forward_tcp` | +| Missing `authorization_token` for `ForwardTcp` | `unauthenticated` | `acquire_forward_connection_guard` | +| Token not found in persistence | `unauthenticated` | `validate_ssh_forward_token` | +| Token revoked or sandbox ID mismatch | `unauthenticated` | `validate_ssh_forward_token` | +| Token expired | `unauthenticated` | `validate_ssh_forward_token` | +| Sandbox not found | `not_found` | `handle_forward_tcp` | +| Sandbox not in `Ready` phase | `failed_precondition` | `handle_forward_tcp` | +| Per-token or per-sandbox concurrency limit hit | `resource_exhausted` | `acquire_ssh_connection_slots` | +| Supervisor session not connected after 15 s | `unavailable` | `handle_forward_tcp` | +| Supervisor rejects relay target or cannot dial it | `ForwardTcp` stream returns the supervisor error, or `ExecSandbox` returns `unavailable`; pending relay waiter is woken with the supervisor error | `handle_relay_open`, `fail_pending_relay` | +| Supervisor failed to claim relay within 10 s | `deadline_exceeded`; `"ForwardTcp: relay open timed out"` logged | `handle_forward_tcp` spawned task | +| Relay channel oneshot dropped | `unavailable`; `"ForwardTcp: relay channel dropped"` logged | `handle_forward_tcp` spawned task | | First `RelayFrame` not `RelayInit` or empty `channel_id` | `invalid_argument` on `RelayStream` | `supervisor_session.rs` -- `handle_relay_stream` | | `RelayStream` arrives after pending entry expired (>10 s) | `deadline_exceeded` | `supervisor_session.rs` -- `claim_relay` | | Gateway restart during live relay | CLI SSH detects via keepalive within ~45 s; relays are torn down with the TCP connection | CLI `ServerAliveInterval=15`, `ServerAliveCountMax=3` | @@ -591,9 +626,9 @@ This function is shared between the CLI and TUI via the `openshell-core::forward ## Graceful Shutdown -### Gateway tunnel teardown +### Gateway forward teardown -After `copy_bidirectional` completes on either side, `ssh_connect()` calls `AsyncWriteExt::shutdown()` on the upgraded client connection so SSH sees a clean EOF and can read any remaining protocol data (e.g., exit-status) before exiting. +When the CLI-to-gateway stream ends, `bridge_forward_tcp_stream()` shuts down the relay write half so SSH sees a clean EOF and can read any remaining protocol data (e.g., exit-status) before exiting. ### RelayStream teardown @@ -617,7 +652,6 @@ The sandbox SSH daemon's exit thread waits for the reader thread to finish forwa |---|---|---| | `ssh_gateway_host` | `127.0.0.1` | Public hostname/IP advertised in `CreateSshSessionResponse` | | `ssh_gateway_port` | `8080` | Public port for gateway connections (0 = use bind port) | -| `ssh_connect_path` | `/connect/ssh` | HTTP path for CONNECT requests | | `sandbox_ssh_socket_path` | `/run/openshell/ssh.sock` | Path the supervisor binds its Unix socket on; passed to the sandbox as `OPENSHELL_SSH_SOCKET_PATH` | | `ssh_session_ttl_secs` | (default in code) | Default TTL applied to new `SshSession` rows; 0 disables expiry | diff --git a/architecture/system-architecture.md b/architecture/system-architecture.md index 5c7fcdcf7..b21b22331 100644 --- a/architecture/system-architecture.md +++ b/architecture/system-architecture.md @@ -94,7 +94,7 @@ graph TB CLI -- "gRPC over HTTPS (mTLS)
:30051 NodePort" --> Gateway TUI -- "gRPC polling (mTLS)
every 2s" --> Gateway SDK -- "gRPC over HTTPS (mTLS)" --> Gateway - CLI -- "HTTP CONNECT upgrade
/connect/ssh (mTLS)" --> Gateway + CLI -- "gRPC ForwardTcp
target.ssh (mTLS)" --> Gateway CLI -. "reads mTLS certs" .-> LocalConfig %% ============================================================ @@ -108,9 +108,9 @@ graph TB %% ============================================================ %% CONNECTIONS: Supervisor session (inbound from sandbox) %% ============================================================ - RelayBridge -- "ConnectSupervisor
(persistent bidi stream)" --> SupRegistry - RelayBridge -- "RelayStream
(per-invocation byte bridge,
same HTTP/2 connection)" --> SupRegistry - RelayBridge -- "Unix socket
SSH bytes" --> SSHServer + RelayBridge -- "ConnectSupervisor
(persistent bidi stream,
targetable RelayOpen)" --> SupRegistry + RelayBridge -- "RelayStream
(per-accepted-relay byte bridge,
same HTTP/2 connection)" --> SupRegistry + RelayBridge -- "Unix socket
SSH target bytes" --> SSHServer %% ============================================================ %% CONNECTIONS: CRD Controller @@ -153,7 +153,7 @@ graph TB %% ============================================================ %% CLIENT SSH / EXEC (bytes tunneled via supervisor relay) %% ============================================================ - CLI -- "HTTP CONNECT /connect/ssh
+ tar-over-SSH file sync
(bytes bridged through
SupervisorSessionRegistry)" --> Gateway + CLI -- "gRPC ForwardTcp(target.ssh)
+ tar-over-SSH file sync
(bytes bridged through
SupervisorSessionRegistry)" --> Gateway %% ============================================================ %% STYLES @@ -195,9 +195,9 @@ graph TB 1. **CLI/SDK to Gateway**: All control-plane traffic uses gRPC over HTTPS with mutual TLS (mTLS). Single multiplexed port (8080 inside cluster, 30051 NodePort). -2. **Supervisor Session (inbound from sandbox)**: Each sandbox supervisor opens a persistent `ConnectSupervisor` bidi gRPC stream to the gateway over mTLS. The gateway tracks these in `SupervisorSessionRegistry`. When SSH or exec access is needed, the gateway sends `RelayOpen { channel_id }` on that stream; the supervisor responds by initiating a `RelayStream` RPC on the same HTTP/2 connection whose first frame is a `RelayInit { channel_id }`. Subsequent frames carry raw bytes in both directions. The gateway never dials the sandbox pod. +2. **Supervisor Session (inbound from sandbox)**: Each sandbox supervisor opens a persistent `ConnectSupervisor` bidi gRPC stream to the gateway over mTLS. The gateway tracks these in `SupervisorSessionRegistry`. When SSH or exec access is needed, the gateway sends `RelayOpen { channel_id, target = SshRelayTarget }` on that stream; targetless relay requests remain SSH-compatible, and TCP targets are supervisor-validated as loopback-only. The supervisor dials the target before reporting a successful `RelayOpenResult`, then initiates a `RelayStream` RPC on the same HTTP/2 connection whose first frame is a `RelayInit { channel_id }`. Subsequent frames carry raw bytes in both directions. The gateway never dials the sandbox pod. -3. **SSH / Exec Access**: CLI connects via HTTP CONNECT upgrade at `/connect/ssh` (or calls `ExecSandbox` gRPC). The gateway authenticates, calls `open_relay`, and bridges the client bytes through the supervisor's `RelayStream` to the supervisor's in-sandbox SSH daemon, which binds to a Unix socket (`/run/openshell/ssh.sock`) rather than a TCP port. +3. **SSH / Exec Access**: CLI connects via the bidirectional gRPC `ForwardTcp` stream with `TcpForwardInit.target = SshRelayTarget` (or calls `ExecSandbox` gRPC). The gateway authenticates the SSH target with the short-lived session token, calls `open_relay_with_target(SshRelayTarget)`, and bridges the client bytes through the supervisor's `RelayStream` to the supervisor's in-sandbox SSH daemon, which binds to a Unix socket (`/run/openshell/ssh.sock`) rather than a TCP port. 4. **File Sync**: tar archives streamed over the relay-tunneled SSH session (no rsync dependency). diff --git a/crates/openshell-cli/Cargo.toml b/crates/openshell-cli/Cargo.toml index b3a006fdd..bf6065194 100644 --- a/crates/openshell-cli/Cargo.toml +++ b/crates/openshell-cli/Cargo.toml @@ -63,6 +63,7 @@ tokio-tungstenite = { workspace = true } # Streams futures = { workspace = true } +tokio-stream = { workspace = true } nix = { workspace = true } # URL parsing diff --git a/crates/openshell-cli/src/main.rs b/crates/openshell-cli/src/main.rs index 3502c2b07..38e61c279 100644 --- a/crates/openshell-cli/src/main.rs +++ b/crates/openshell-cli/src/main.rs @@ -8,6 +8,7 @@ use clap_complete::engine::ArgValueCompleter; use clap_complete::env::CompleteEnv; use miette::Result; use owo_colors::OwoColorize; +use std::collections::HashMap; use std::io::Write; use openshell_bootstrap::{ @@ -234,6 +235,7 @@ const FORWARD_EXAMPLES: &str = "\x1b[1mALIAS\x1b[0m \x1b[1mEXAMPLES\x1b[0m $ openshell forward start 8080 $ openshell forward start 3000 my-sandbox + $ openshell forward service my-sandbox --target-port 8000 --local 8000 $ openshell forward stop 8080 $ openshell forward list "; @@ -1667,6 +1669,26 @@ enum ForwardCommands { /// List active port forwards. #[command(help_template = LEAF_HELP_TEMPLATE, next_help_heading = "FLAGS")] List, + + /// Forward a local TCP port to a loopback service inside a sandbox over gRPC. + #[command(help_template = LEAF_HELP_TEMPLATE, next_help_heading = "FLAGS")] + Service { + /// Sandbox name (defaults to last-used sandbox). + #[arg(add = ArgValueCompleter::new(completers::complete_sandbox_names))] + name: Option, + + /// Target service port inside the sandbox. + #[arg(long)] + target_port: u16, + + /// Target service host inside the sandbox. Phase 1 accepts loopback only. + #[arg(long, default_value = "127.0.0.1")] + target_host: String, + + /// Local bind address and port: [bind_address:]port. Defaults to the target port. Use port 0 for dynamic assignment. + #[arg(long)] + local: Option, + }, } #[tokio::main] @@ -1954,6 +1976,27 @@ async fn main() -> Result<()> { } } } + ForwardCommands::Service { + name, + target_port, + target_host, + local, + } => { + let ctx = resolve_gateway(&cli.gateway, &cli.gateway_endpoint)?; + let mut tls = tls.with_gateway_name(&ctx.name); + apply_edge_auth(&mut tls, &ctx.name); + let name = resolve_sandbox_name(name, &ctx.name)?; + let local = local.unwrap_or_else(|| target_port.to_string()); + run::service_forward_tcp( + &ctx.endpoint, + &name, + Some(&local), + &target_host, + target_port, + &tls, + ) + .await?; + } ForwardCommands::Start { port, name, @@ -2350,7 +2393,7 @@ async fn main() -> Result<()> { }; // Parse --label flags into a HashMap. - let mut labels_map = std::collections::HashMap::new(); + let mut labels_map = HashMap::new(); for label_str in &labels { let parts: Vec<&str> = label_str.splitn(2, '=').collect(); if parts.len() != 2 { diff --git a/crates/openshell-cli/src/run.rs b/crates/openshell-cli/src/run.rs index 87489014a..442e45f1a 100644 --- a/crates/openshell-cli/src/run.rs +++ b/crates/openshell-cli/src/run.rs @@ -24,14 +24,16 @@ use openshell_bootstrap::{ }; use openshell_core::proto::{ ApproveAllDraftChunksRequest, ApproveDraftChunkRequest, ClearDraftChunksRequest, - CreateProviderRequest, CreateSandboxRequest, DeleteProviderRequest, DeleteSandboxRequest, - ExecSandboxRequest, GetClusterInferenceRequest, GetDraftHistoryRequest, GetDraftPolicyRequest, - GetGatewayConfigRequest, GetProviderRequest, GetSandboxConfigRequest, GetSandboxLogsRequest, - GetSandboxPolicyStatusRequest, GetSandboxRequest, HealthRequest, ListProvidersRequest, - ListSandboxPoliciesRequest, ListSandboxesRequest, PolicySource, PolicyStatus, Provider, - RejectDraftChunkRequest, Sandbox, SandboxPhase, SandboxPolicy, SandboxSpec, SandboxTemplate, - SetClusterInferenceRequest, SettingScope, SettingValue, UpdateConfigRequest, - UpdateProviderRequest, WatchSandboxRequest, exec_sandbox_event, setting_value, + CreateProviderRequest, CreateSandboxRequest, CreateSshSessionRequest, DeleteProviderRequest, + DeleteSandboxRequest, ExecSandboxRequest, GetClusterInferenceRequest, GetDraftHistoryRequest, + GetDraftPolicyRequest, GetGatewayConfigRequest, GetProviderRequest, GetSandboxConfigRequest, + GetSandboxLogsRequest, GetSandboxPolicyStatusRequest, GetSandboxRequest, HealthRequest, + ListProvidersRequest, ListSandboxPoliciesRequest, ListSandboxesRequest, PolicySource, + PolicyStatus, Provider, RejectDraftChunkRequest, RevokeSshSessionRequest, Sandbox, + SandboxPhase, SandboxPolicy, SandboxSpec, SandboxTemplate, SetClusterInferenceRequest, + SettingScope, SettingValue, TcpForwardFrame, TcpForwardInit, TcpRelayTarget, + UpdateConfigRequest, UpdateProviderRequest, WatchSandboxRequest, exec_sandbox_event, + setting_value, tcp_forward_init, }; use openshell_core::settings::{self, SettingValueKind}; use openshell_core::{ObjectId, ObjectName}; @@ -1964,7 +1966,7 @@ pub async fn sandbox_create_with_bootstrap( tty_override, Some(false), auto_providers_override, - &std::collections::HashMap::new(), + &HashMap::new(), &tls, ) .await @@ -2020,7 +2022,7 @@ pub async fn sandbox_create( tty_override: Option, bootstrap_override: Option, auto_providers_override: Option, - labels: &std::collections::HashMap, + labels: &HashMap, tls: &TlsOptions, ) -> Result<()> { if editor.is_some() && !command.is_empty() { @@ -2134,7 +2136,7 @@ pub async fn sandbox_create( status.message() )); } - Err(status) => return Err(status).into_diagnostic(), + Err(status) => return Err(miette::miette!(status.to_string())), }; let sandbox = response .into_inner() @@ -2967,6 +2969,295 @@ pub async fn sandbox_exec_grpc( Ok(exit_code) } +pub async fn service_forward_tcp( + server: &str, + name: &str, + local: Option<&str>, + target_host: &str, + target_port: u16, + tls: &TlsOptions, +) -> Result<()> { + let (bind_addr, bind_port) = parse_tcp_forward_spec(local, target_port)?; + let mut client = grpc_client(server, tls).await?; + + let sandbox = fetch_ready_sandbox_for_forward(&mut client, name).await?; + + let listener = tokio::net::TcpListener::bind((bind_addr.as_str(), bind_port)) + .await + .into_diagnostic() + .wrap_err_with(|| format!("failed to bind local forward on {bind_addr}:{bind_port}"))?; + let local_addr = listener + .local_addr() + .into_diagnostic() + .wrap_err("failed to read local forward address")?; + eprintln!( + "{} Forwarding {} -> {}:{} in sandbox {} via gRPC", + "✓".green().bold(), + local_addr, + target_host, + target_port, + name, + ); + + let sandbox_id = sandbox.object_id().to_string(); + let (fatal_tx, mut fatal_rx) = tokio::sync::mpsc::channel::(1); + let mut health_check = tokio::time::interval(Duration::from_secs(2)); + health_check.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Delay); + loop { + tokio::select! { + Some(reason) = fatal_rx.recv() => { + return Err(miette::miette!("service forward stopped: {reason}")); + } + + _ = health_check.tick() => { + fetch_ready_sandbox_for_forward(&mut client, name).await?; + } + + accepted = listener.accept() => { + let (socket, peer) = accepted + .into_diagnostic() + .wrap_err("failed to accept local forward connection")?; + let mut client = client.clone(); + let sandbox_id = sandbox_id.clone(); + let target_host = target_host.to_string(); + let service_id = format!("service-forward:{name}:{target_host}:{target_port}"); + let fatal_tx = fatal_tx.clone(); + tokio::spawn(async move { + let token = match create_forward_session_token(&mut client, &sandbox_id).await { + Ok(token) => token, + Err(err) => { + tracing::warn!(peer = %peer, error = %err, "service forward session creation failed"); + if err.fatal { + let _ = fatal_tx.send(err.message).await; + } + return; + } + }; + if let Err(err) = forward_one_tcp_connection( + &mut client, + socket, + sandbox_id, + target_host, + target_port, + service_id, + token.clone(), + ) + .await + { + tracing::warn!(peer = %peer, error = %err, "service forward connection failed"); + if err.fatal { + let _ = fatal_tx.send(err.message).await; + } + } + let _ = client + .revoke_ssh_session(RevokeSshSessionRequest { token }) + .await; + }); + } + } + } +} + +async fn create_forward_session_token( + client: &mut crate::tls::GrpcClient, + sandbox_id: &str, +) -> std::result::Result { + let response = client + .create_ssh_session(CreateSshSessionRequest { + sandbox_id: sandbox_id.to_string(), + }) + .await + .map_err(ForwardTcpConnectionError::from_status)?; + Ok(response.into_inner().token) +} + +async fn fetch_ready_sandbox_for_forward( + client: &mut crate::tls::GrpcClient, + name: &str, +) -> Result { + let response = match client + .get_sandbox(GetSandboxRequest { + name: name.to_string(), + }) + .await + { + Ok(response) => response, + Err(status) if status.code() == Code::NotFound => { + return Err(miette::miette!( + "sandbox '{name}' no longer exists; stopping service forward" + )); + } + Err(status) => return Err(status).into_diagnostic(), + }; + + let sandbox = response + .into_inner() + .sandbox + .ok_or_else(|| miette::miette!("sandbox '{name}' not found"))?; + + if SandboxPhase::try_from(sandbox.phase) != Ok(SandboxPhase::Ready) { + return Err(miette::miette!( + "sandbox '{}' is no longer ready (phase: {}); stopping service forward", + name, + phase_name(sandbox.phase) + )); + } + + Ok(sandbox) +} + +#[derive(Debug)] +struct ForwardTcpConnectionError { + message: String, + fatal: bool, +} + +impl ForwardTcpConnectionError { + fn transient(message: impl Into) -> Self { + Self { + message: message.into(), + fatal: false, + } + } + + fn from_status(status: Status) -> Self { + let fatal = matches!(status.code(), Code::NotFound | Code::FailedPrecondition); + Self { + message: status.to_string(), + fatal, + } + } +} + +impl std::fmt::Display for ForwardTcpConnectionError { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + f.write_str(&self.message) + } +} + +impl std::error::Error for ForwardTcpConnectionError {} + +fn parse_tcp_forward_spec(local: Option<&str>, default_port: u16) -> Result<(String, u16)> { + let Some(spec) = local else { + return Ok(("127.0.0.1".to_string(), default_port)); + }; + + if let Some(pos) = spec.rfind(':') { + let addr = &spec[..pos]; + let port_str = &spec[pos + 1..]; + if let Ok(port) = port_str.parse::() { + if addr.is_empty() { + return Err(miette::miette!("bind address is required before ':'")); + } + return Ok((addr.to_string(), port)); + } + } + + let port: u16 = spec.parse().map_err(|_| { + miette::miette!("invalid local forward spec '{spec}': expected [bind_address:]port") + })?; + Ok(("127.0.0.1".to_string(), port)) +} + +async fn forward_one_tcp_connection( + client: &mut crate::tls::GrpcClient, + socket: tokio::net::TcpStream, + sandbox_id: String, + target_host: String, + target_port: u16, + service_id: String, + authorization_token: String, +) -> std::result::Result<(), ForwardTcpConnectionError> { + use tokio::io::{AsyncReadExt, AsyncWriteExt}; + use tokio_stream::wrappers::ReceiverStream; + + let (tx, rx) = tokio::sync::mpsc::channel::(16); + tx.send(TcpForwardFrame { + payload: Some(openshell_core::proto::tcp_forward_frame::Payload::Init( + TcpForwardInit { + sandbox_id, + service_id, + target: Some(tcp_forward_init::Target::Tcp(TcpRelayTarget { + host: target_host, + port: u32::from(target_port), + })), + authorization_token, + }, + )), + }) + .await + .map_err(|_| ForwardTcpConnectionError::transient("failed to initialize forward stream"))?; + + let mut response = match client.forward_tcp(ReceiverStream::new(rx)).await { + Ok(response) => response.into_inner(), + Err(status) => { + let err = ForwardTcpConnectionError::from_status(status); + drain_and_shutdown_local_socket(socket).await; + return Err(err); + } + }; + + let (mut local_read, mut local_write) = socket.into_split(); + + let to_gateway = tokio::spawn(async move { + let mut buf = vec![0u8; 64 * 1024]; + loop { + let n = local_read.read(&mut buf).await?; + if n == 0 { + break; + } + if tx + .send(TcpForwardFrame { + payload: Some(openshell_core::proto::tcp_forward_frame::Payload::Data( + buf[..n].to_vec(), + )), + }) + .await + .is_err() + { + break; + } + } + Ok::<(), std::io::Error>(()) + }); + + while let Some(frame) = response + .message() + .await + .map_err(ForwardTcpConnectionError::from_status)? + { + let Some(openshell_core::proto::tcp_forward_frame::Payload::Data(data)) = frame.payload + else { + continue; + }; + if data.is_empty() { + continue; + } + local_write + .write_all(&data) + .await + .map_err(|err| ForwardTcpConnectionError::transient(err.to_string()))?; + } + + let _ = local_write.shutdown().await; + to_gateway.abort(); + Ok(()) +} + +async fn drain_and_shutdown_local_socket(mut socket: tokio::net::TcpStream) { + use tokio::io::{AsyncReadExt, AsyncWriteExt}; + + let mut buf = [0u8; 4096]; + loop { + match tokio::time::timeout(Duration::from_millis(25), socket.read(&mut buf)).await { + Ok(Ok(0)) | Err(_) => break, + Ok(Ok(_)) => continue, + Ok(Err(_)) => break, + } + } + let _ = socket.shutdown().await; +} + /// Print a single YAML line with dimmed keys and regular values. fn print_yaml_line(line: &str) { // Find leading whitespace @@ -3371,7 +3662,7 @@ async fn auto_create_provider( id: String::new(), name: exact_name.to_string(), created_at_ms: 0, - labels: std::collections::HashMap::new(), + labels: HashMap::new(), }), r#type: provider_type.to_string(), credentials: discovered.credentials.clone(), @@ -3411,7 +3702,7 @@ async fn auto_create_provider( id: String::new(), name: name.clone(), created_at_ms: 0, - labels: std::collections::HashMap::new(), + labels: HashMap::new(), }), r#type: provider_type.to_string(), credentials: discovered.credentials.clone(), @@ -3569,7 +3860,7 @@ pub async fn provider_create( id: String::new(), name: name.to_string(), created_at_ms: 0, - labels: std::collections::HashMap::new(), + labels: HashMap::new(), }), r#type: provider_type, credentials: credential_map, @@ -3759,7 +4050,7 @@ pub async fn provider_update( id: String::new(), name: name.to_string(), created_at_ms: 0, - labels: std::collections::HashMap::new(), + labels: HashMap::new(), }), r#type: String::new(), credentials: credential_map, diff --git a/crates/openshell-cli/src/ssh.rs b/crates/openshell-cli/src/ssh.rs index f883d9684..830dfc19c 100644 --- a/crates/openshell-cli/src/ssh.rs +++ b/crates/openshell-cli/src/ssh.rs @@ -3,30 +3,30 @@ //! SSH connection and proxy utilities. -use crate::tls::{TlsOptions, build_rustls_config, grpc_client, require_tls_materials}; +use crate::tls::{TlsOptions, grpc_client}; use miette::{IntoDiagnostic, Result, WrapErr}; #[cfg(unix)] use nix::sys::signal::{SaFlags, SigAction, SigHandler, SigSet, Signal, sigaction}; use openshell_core::ObjectId; use openshell_core::forward::{ - build_proxy_command, find_ssh_forward_pid, resolve_ssh_gateway, shell_escape, - validate_ssh_session_response, write_forward_pid, + build_proxy_command, find_ssh_forward_pid, format_gateway_url, resolve_ssh_gateway, + shell_escape, validate_ssh_session_response, write_forward_pid, +}; +use openshell_core::proto::{ + CreateSshSessionRequest, GetSandboxRequest, SshRelayTarget, TcpForwardFrame, TcpForwardInit, + tcp_forward_init, }; -use openshell_core::proto::{CreateSshSessionRequest, GetSandboxRequest}; use owo_colors::OwoColorize; -use rustls::pki_types::ServerName; use std::fs; use std::io::IsTerminal; #[cfg(unix)] use std::os::unix::process::CommandExt; use std::path::{Path, PathBuf}; use std::process::{Command, Stdio}; -use std::sync::Arc; use std::time::Duration; -use tokio::io::{AsyncRead, AsyncReadExt, AsyncWrite, AsyncWriteExt, BufReader}; -use tokio::net::TcpStream; +use tokio::io::{AsyncReadExt, AsyncWriteExt}; use tokio::process::Command as TokioCommand; -use tokio_rustls::TlsConnector; +use tokio_stream::wrappers::ReceiverStream; const FOREGROUND_FORWARD_STARTUP_GRACE_PERIOD: Duration = Duration::from_secs(2); @@ -100,8 +100,7 @@ async fn ssh_session_config( // external tunnel endpoint (the cluster URL), not the server's internal // scheme/host/port which may be plaintext HTTP on 127.0.0.1. let gateway_url = if tls.is_bearer_auth() { - let base = server.trim_end_matches('/'); - format!("{base}{}", session.connect_path) + server.trim_end_matches('/').to_string() } else { // If the server returned a loopback gateway address, override it with the // cluster endpoint's host. This handles the case where the server defaults @@ -110,10 +109,7 @@ async fn ssh_session_config( let gateway_port_u16 = session.gateway_port as u16; let (gateway_host, gateway_port) = resolve_ssh_gateway(&session.gateway_host, gateway_port_u16, server); - format!( - "{}://{}:{}{}", - session.gateway_scheme, gateway_host, gateway_port, session.connect_path - ) + format_gateway_url(&session.gateway_scheme, &gateway_host, gateway_port) }; let gateway_name = tls .gateway_name() @@ -793,11 +789,86 @@ pub async fn sandbox_ssh_proxy( token: &str, tls: &TlsOptions, ) -> Result<()> { + let server = grpc_server_from_ssh_gateway_url(gateway_url)?; + let mut client = grpc_client(&server, tls).await?; + + let (tx, rx) = tokio::sync::mpsc::channel::(16); + tx.send(TcpForwardFrame { + payload: Some(openshell_core::proto::tcp_forward_frame::Payload::Init( + TcpForwardInit { + sandbox_id: sandbox_id.to_string(), + service_id: format!("ssh-proxy:{sandbox_id}"), + target: Some(tcp_forward_init::Target::Ssh(SshRelayTarget {})), + authorization_token: token.to_string(), + }, + )), + }) + .await + .map_err(|_| miette::miette!("failed to initialize SSH forward stream"))?; + + let mut response = client + .forward_tcp(ReceiverStream::new(rx)) + .await + .into_diagnostic()? + .into_inner(); + + let stdin = tokio::io::stdin(); + let stdout = tokio::io::stdout(); + + let to_remote = tokio::spawn(async move { + let mut stdin = stdin; + let mut buf = vec![0u8; 64 * 1024]; + loop { + let Ok(n) = stdin.read(&mut buf).await else { + break; + }; + if n == 0 { + break; + } + if tx + .send(TcpForwardFrame { + payload: Some(openshell_core::proto::tcp_forward_frame::Payload::Data( + buf[..n].to_vec(), + )), + }) + .await + .is_err() + { + break; + } + } + }); + let from_remote = tokio::spawn(async move { + let mut stdout = stdout; + loop { + let frame = match response.message().await { + Ok(Some(frame)) => frame, + Ok(None) | Err(_) => break, + }; + let Some(openshell_core::proto::tcp_forward_frame::Payload::Data(data)) = frame.payload + else { + continue; + }; + if data.is_empty() { + continue; + } + if stdout.write_all(&data).await.is_err() { + break; + } + let _ = stdout.flush().await; + } + }); + let _ = from_remote.await; + to_remote.abort(); + + Ok(()) +} + +fn grpc_server_from_ssh_gateway_url(gateway_url: &str) -> Result { let url: url::Url = gateway_url .parse() .into_diagnostic() .wrap_err("invalid gateway URL")?; - let scheme = url.scheme(); let gateway_host = url .host_str() @@ -805,76 +876,7 @@ pub async fn sandbox_ssh_proxy( let gateway_port = url .port_or_known_default() .ok_or_else(|| miette::miette!("gateway URL missing port"))?; - let connect_path = url.path(); - - let request = format!( - "CONNECT {connect_path} HTTP/1.1\r\nHost: {gateway_host}\r\nX-Sandbox-Id: {sandbox_id}\r\nX-Sandbox-Token: {token}\r\n\r\n" - ); - - // The gateway returns 412 (Precondition Failed) when the sandbox pod - // exists but hasn't reached Ready phase yet. This is a transient state - // after sandbox allocation — retry with backoff instead of failing - // immediately. - const MAX_CONNECT_WAIT: Duration = Duration::from_secs(60); - const INITIAL_BACKOFF: Duration = Duration::from_secs(1); - - let start = std::time::Instant::now(); - let mut backoff = INITIAL_BACKOFF; - let mut buf_stream; - - loop { - let mut stream: Box = - connect_gateway(scheme, gateway_host, gateway_port, tls).await?; - stream - .write_all(request.as_bytes()) - .await - .into_diagnostic()?; - - // Wrap in a BufReader **before** reading the HTTP response. The gateway - // may send the 200 OK response and the first SSH protocol bytes in the - // same TCP segment / WebSocket frame. A plain `read()` would consume - // those SSH bytes into our buffer and discard them, causing SSH to see a - // truncated protocol banner and exit with code 255. BufReader ensures - // any bytes read past the `\r\n\r\n` header boundary stay buffered and - // are returned by subsequent reads during the bidirectional copy phase. - buf_stream = BufReader::new(stream); - let status = read_connect_status(&mut buf_stream).await?; - if status == 200 { - break; - } - if status == 412 && start.elapsed() < MAX_CONNECT_WAIT { - tracing::debug!( - elapsed = ?start.elapsed(), - "sandbox not yet ready (HTTP 412), retrying in {backoff:?}" - ); - tokio::time::sleep(backoff).await; - backoff = (backoff * 2).min(Duration::from_secs(8)); - continue; - } - return Err(miette::miette!( - "gateway CONNECT failed with status {status}" - )); - } - - let (reader, writer) = tokio::io::split(buf_stream); - let stdin = tokio::io::stdin(); - let stdout = tokio::io::stdout(); - - // Spawn both copy directions as independent tasks. Using separate spawned - // tasks (instead of try_join!/select!) ensures that when one direction - // completes or errors, the other continues independently until it also - // finishes. This is critical: when the remote side closes the connection, - // we must keep the stdin→gateway copy alive so SSH can finish sending its - // protocol-close packets, and vice-versa. - let to_remote = tokio::spawn(copy_ignoring_errors(stdin, writer)); - let from_remote = tokio::spawn(copy_ignoring_errors(reader, stdout)); - let _ = from_remote.await; - // Once the remote→stdout direction is done, SSH has received all the data - // it needs. Drop the stdin→gateway task – SSH will close its pipe when - // it's done regardless. - to_remote.abort(); - - Ok(()) + Ok(format_gateway_url(scheme, gateway_host, gateway_port)) } /// Run the SSH proxy in "name mode": create a session on the fly, then proxy. @@ -1095,97 +1097,6 @@ pub fn print_ssh_config(gateway: &str, name: &str) { print!("{}", render_ssh_config(gateway, name)); } -/// Copy all bytes from `reader` to `writer`, flushing on completion. -/// Errors are intentionally discarded – connection teardown errors are -/// expected during normal SSH session shutdown. -async fn copy_ignoring_errors(mut reader: R, mut writer: W) -where - R: AsyncRead + Unpin, - W: AsyncWrite + Unpin, -{ - let _ = tokio::io::copy(&mut reader, &mut writer).await; - let _ = AsyncWriteExt::flush(&mut writer).await; - let _ = AsyncWriteExt::shutdown(&mut writer).await; -} - -async fn connect_gateway( - scheme: &str, - host: &str, - port: u16, - tls: &TlsOptions, -) -> Result> { - // When using edge bearer auth, route through the WebSocket tunnel proxy - // regardless of the origin scheme. The proxy handles edge auth headers - // and TLS termination at the edge; the origin may be plaintext HTTP - // behind the tunnel. - if tls.is_bearer_auth() { - let token = tls - .edge_token - .as_deref() - .ok_or_else(|| miette::miette!("edge token required for tunnel"))?; - let gateway_url = format!("https://{host}:{port}"); - let proxy = crate::edge_tunnel::start_tunnel_proxy(&gateway_url, token).await?; - let tcp = TcpStream::connect(proxy.local_addr) - .await - .into_diagnostic()?; - tcp.set_nodelay(true).into_diagnostic()?; - return Ok(Box::new(tcp)); - } - - let tcp = TcpStream::connect((host, port)).await.into_diagnostic()?; - tcp.set_nodelay(true).into_diagnostic()?; - if scheme.eq_ignore_ascii_case("https") { - let materials = require_tls_materials(&format!("https://{host}:{port}"), tls)?; - let config = build_rustls_config(&materials)?; - let connector = TlsConnector::from(Arc::new(config)); - let server_name = ServerName::try_from(host.to_string()) - .map_err(|_| miette::miette!("invalid server name: {host}"))?; - let tls = connector - .connect(server_name, tcp) - .await - .into_diagnostic()?; - Ok(Box::new(tls)) - } else { - Ok(Box::new(tcp)) - } -} - -/// Read exactly the HTTP response status line and headers up to `\r\n\r\n`. -/// -/// Uses byte-at-a-time reads so that the caller's `BufReader` retains any -/// bytes that arrived after the header boundary (e.g. the SSH protocol -/// banner that the gateway may send in the same TCP segment). -async fn read_connect_status(stream: &mut R) -> Result { - let mut buf = Vec::new(); - let mut byte = [0u8; 1]; - loop { - let n = stream.read(&mut byte).await.into_diagnostic()?; - if n == 0 { - break; - } - buf.push(byte[0]); - if buf.len() >= 4 && &buf[buf.len() - 4..] == b"\r\n\r\n" { - break; - } - if buf.len() > 8192 { - break; - } - } - let text = String::from_utf8_lossy(&buf); - let line = text.lines().next().unwrap_or(""); - let status = line - .split_whitespace() - .nth(1) - .unwrap_or("0") - .parse::() - .unwrap_or(0); - Ok(status) -} - -trait ProxyStream: AsyncRead + AsyncWrite + Unpin + Send {} - -impl ProxyStream for T where T: AsyncRead + AsyncWrite + Unpin + Send {} - #[cfg(test)] mod tests { use super::*; diff --git a/crates/openshell-cli/src/tls.rs b/crates/openshell-cli/src/tls.rs index cd6483530..dcb282d83 100644 --- a/crates/openshell-cli/src/tls.rs +++ b/crates/openshell-cli/src/tls.rs @@ -253,6 +253,7 @@ pub async fn build_channel(server: &str, tls: &TlsOptions) -> Result { let endpoint = Endpoint::from_shared(server.to_string()) .into_diagnostic()? .connect_timeout(Duration::from_secs(10)) + .http2_adaptive_window(true) .http2_keep_alive_interval(Duration::from_secs(10)) .keep_alive_while_idle(true); return endpoint.connect().await.into_diagnostic(); @@ -272,6 +273,7 @@ pub async fn build_channel(server: &str, tls: &TlsOptions) -> Result { let endpoint = Endpoint::from_shared(local_url) .into_diagnostic()? .connect_timeout(Duration::from_secs(10)) + .http2_adaptive_window(true) .http2_keep_alive_interval(Duration::from_secs(10)) .keep_alive_while_idle(true); return endpoint.connect().await.into_diagnostic(); @@ -280,6 +282,7 @@ pub async fn build_channel(server: &str, tls: &TlsOptions) -> Result { let mut endpoint = Endpoint::from_shared(server.to_string()) .into_diagnostic()? .connect_timeout(Duration::from_secs(10)) + .http2_adaptive_window(true) .http2_keep_alive_interval(Duration::from_secs(10)) .keep_alive_while_idle(true); diff --git a/crates/openshell-cli/tests/ensure_providers_integration.rs b/crates/openshell-cli/tests/ensure_providers_integration.rs index 34485377d..2760e20bc 100644 --- a/crates/openshell-cli/tests/ensure_providers_integration.rs +++ b/crates/openshell-cli/tests/ensure_providers_integration.rs @@ -459,6 +459,17 @@ impl OpenShell for TestOpenShell { ) -> Result, tonic::Status> { Err(tonic::Status::unimplemented("not implemented in test")) } + + type ForwardTcpStream = tokio_stream::wrappers::ReceiverStream< + Result, + >; + + async fn forward_tcp( + &self, + _request: tonic::Request>, + ) -> Result, tonic::Status> { + Err(tonic::Status::unimplemented("not implemented in test")) + } } // ── TLS helpers ────────────────────────────────────────────────────── diff --git a/crates/openshell-cli/tests/mtls_integration.rs b/crates/openshell-cli/tests/mtls_integration.rs index e78c91578..307e339ce 100644 --- a/crates/openshell-cli/tests/mtls_integration.rs +++ b/crates/openshell-cli/tests/mtls_integration.rs @@ -346,6 +346,17 @@ impl OpenShell for TestOpenShell { ) -> Result, tonic::Status> { Err(tonic::Status::unimplemented("not implemented in test")) } + + type ForwardTcpStream = tokio_stream::wrappers::ReceiverStream< + Result, + >; + + async fn forward_tcp( + &self, + _request: tonic::Request>, + ) -> Result, tonic::Status> { + Err(tonic::Status::unimplemented("not implemented in test")) + } } fn build_ca() -> (Certificate, KeyPair) { diff --git a/crates/openshell-cli/tests/provider_commands_integration.rs b/crates/openshell-cli/tests/provider_commands_integration.rs index 9bda696c1..f9332142e 100644 --- a/crates/openshell-cli/tests/provider_commands_integration.rs +++ b/crates/openshell-cli/tests/provider_commands_integration.rs @@ -409,6 +409,17 @@ impl OpenShell for TestOpenShell { ) -> Result, tonic::Status> { Err(tonic::Status::unimplemented("not implemented in test")) } + + type ForwardTcpStream = tokio_stream::wrappers::ReceiverStream< + Result, + >; + + async fn forward_tcp( + &self, + _request: tonic::Request>, + ) -> Result, tonic::Status> { + Err(tonic::Status::unimplemented("not implemented in test")) + } } fn install_rustls_provider() { diff --git a/crates/openshell-cli/tests/sandbox_create_lifecycle_integration.rs b/crates/openshell-cli/tests/sandbox_create_lifecycle_integration.rs index 79d482fdb..9ffd36834 100644 --- a/crates/openshell-cli/tests/sandbox_create_lifecycle_integration.rs +++ b/crates/openshell-cli/tests/sandbox_create_lifecycle_integration.rs @@ -198,7 +198,6 @@ impl OpenShell for TestOpenShell { gateway_scheme: "https".to_string(), gateway_host: "localhost".to_string(), gateway_port: 443, - connect_path: "/connect/ssh".to_string(), ..CreateSshSessionResponse::default() })) } @@ -433,6 +432,17 @@ impl OpenShell for TestOpenShell { ) -> Result, tonic::Status> { Err(tonic::Status::unimplemented("not implemented in test")) } + + type ForwardTcpStream = tokio_stream::wrappers::ReceiverStream< + Result, + >; + + async fn forward_tcp( + &self, + _request: tonic::Request>, + ) -> Result, tonic::Status> { + Err(tonic::Status::unimplemented("not implemented in test")) + } } fn install_rustls_provider() { @@ -732,6 +742,9 @@ async fn sandbox_create_keeps_sandbox_with_forwarding() { let _env = test_env(&fake_ssh_dir, &xdg_dir); let tls = test_tls(&server); install_fake_ssh(&fake_ssh_dir); + let listener = TcpListener::bind("127.0.0.1:0").await.unwrap(); + let forward_port = listener.local_addr().unwrap().port(); + drop(listener); run::sandbox_create( &server.endpoint, @@ -746,7 +759,7 @@ async fn sandbox_create_keeps_sandbox_with_forwarding() { None, &[], None, - Some(openshell_core::forward::ForwardSpec::new(8080)), + Some(openshell_core::forward::ForwardSpec::new(forward_port)), &["echo".to_string(), "OK".to_string()], Some(false), Some(false), diff --git a/crates/openshell-cli/tests/sandbox_name_fallback_integration.rs b/crates/openshell-cli/tests/sandbox_name_fallback_integration.rs index 5c463de9e..9e9bb26b1 100644 --- a/crates/openshell-cli/tests/sandbox_name_fallback_integration.rs +++ b/crates/openshell-cli/tests/sandbox_name_fallback_integration.rs @@ -370,6 +370,17 @@ impl OpenShell for TestOpenShell { ) -> Result, tonic::Status> { Err(tonic::Status::unimplemented("not implemented in test")) } + + type ForwardTcpStream = tokio_stream::wrappers::ReceiverStream< + Result, + >; + + async fn forward_tcp( + &self, + _request: tonic::Request>, + ) -> Result, tonic::Status> { + Err(tonic::Status::unimplemented("not implemented in test")) + } } // ── helpers ─────────────────────────────────────────────────────────── diff --git a/crates/openshell-core/src/config.rs b/crates/openshell-core/src/config.rs index 40a87fc41..a08f9bed4 100644 --- a/crates/openshell-core/src/config.rs +++ b/crates/openshell-core/src/config.rs @@ -151,10 +151,6 @@ pub struct Config { #[serde(default = "default_ssh_gateway_port")] pub ssh_gateway_port: u16, - /// Path for SSH CONNECT/upgrade requests. - #[serde(default = "default_ssh_connect_path")] - pub ssh_connect_path: String, - /// SSH listen port inside sandbox containers that expose a TCP endpoint. #[serde(default = "default_sandbox_ssh_port")] pub sandbox_ssh_port: u16, @@ -240,7 +236,6 @@ impl Config { grpc_endpoint: String::new(), ssh_gateway_host: default_ssh_gateway_host(), ssh_gateway_port: default_ssh_gateway_port(), - ssh_connect_path: default_ssh_connect_path(), sandbox_ssh_port: default_sandbox_ssh_port(), sandbox_ssh_socket_path: default_sandbox_ssh_socket_path(), ssh_handshake_secret: String::new(), @@ -336,13 +331,6 @@ impl Config { self } - /// Create a new configuration with the SSH connect path. - #[must_use] - pub fn with_ssh_connect_path(mut self, path: impl Into) -> Self { - self.ssh_connect_path = path.into(); - self - } - /// Create a new configuration with the sandbox SSH port. #[must_use] pub const fn with_sandbox_ssh_port(mut self, port: u16) -> Self { @@ -414,10 +402,6 @@ const fn default_ssh_gateway_port() -> u16 { DEFAULT_SERVER_PORT } -fn default_ssh_connect_path() -> String { - "/connect/ssh".to_string() -} - fn default_sandbox_ssh_socket_path() -> String { "/run/openshell/ssh.sock".to_string() } diff --git a/crates/openshell-core/src/forward.rs b/crates/openshell-core/src/forward.rs index e6c7d2f7c..655184465 100644 --- a/crates/openshell-core/src/forward.rs +++ b/crates/openshell-core/src/forward.rs @@ -470,6 +470,20 @@ pub fn resolve_ssh_gateway( (gateway_host.to_string(), gateway_port) } +/// Format a gateway URL, bracketing IPv6 literals when needed. +pub fn format_gateway_url(scheme: &str, host: &str, port: u16) -> String { + let host = if host + .parse::() + .is_ok_and(|ip| ip.is_ipv6()) + && !host.starts_with('[') + { + format!("[{host}]") + } else { + host.to_string() + }; + format!("{scheme}://{host}:{port}") +} + /// Shell-escape a value for use inside a `ProxyCommand` string. pub fn shell_escape(value: &str) -> String { if value.is_empty() { @@ -526,14 +540,11 @@ pub enum SshSessionResponseError { InvalidScheme, #[error("gateway_port must be in range 1..=65535")] InvalidPort, - #[error("connect_path must start with '/'")] - ConnectPathNotAbsolute, } const MAX_SANDBOX_ID_LEN: usize = 128; const MAX_TOKEN_LEN: usize = 4096; const MAX_GATEWAY_HOST_LEN: usize = 253; -const MAX_CONNECT_PATH_LEN: usize = 2048; const MAX_FINGERPRINT_LEN: usize = 256; fn is_sandbox_id_byte(b: u8) -> bool { @@ -552,33 +563,6 @@ fn is_gateway_host_byte(b: u8) -> bool { b.is_ascii_alphanumeric() || matches!(b, b'.' | b'-' | b':' | b'[' | b']') } -fn is_connect_path_byte(b: u8) -> bool { - // RFC 3986 path charset (pchar) without `?`, `#`, space, backtick, or - // backslash. `%` is permitted so percent-encoded segments round-trip. - b.is_ascii_alphanumeric() - || matches!( - b, - b'-' | b'.' - | b'_' - | b'~' - | b'!' - | b'$' - | b'&' - | b'\'' - | b'(' - | b')' - | b'*' - | b'+' - | b',' - | b';' - | b'=' - | b':' - | b'@' - | b'/' - | b'%' - ) -} - fn is_fingerprint_byte(b: u8) -> bool { b.is_ascii_alphanumeric() || matches!(b, b':' | b'+' | b'/' | b'=' | b'-') } @@ -613,25 +597,6 @@ pub fn validate_ssh_session_response( if resp.gateway_port == 0 || resp.gateway_port > u32::from(u16::MAX) { return Err(SshSessionResponseError::InvalidPort); } - if resp.connect_path.is_empty() { - return Err(SshSessionResponseError::Empty { - field: "connect_path", - }); - } - if !resp.connect_path.starts_with('/') { - return Err(SshSessionResponseError::ConnectPathNotAbsolute); - } - if resp.connect_path.len() > MAX_CONNECT_PATH_LEN { - return Err(SshSessionResponseError::TooLong { - field: "connect_path", - max: MAX_CONNECT_PATH_LEN, - }); - } - if !resp.connect_path.bytes().all(is_connect_path_byte) { - return Err(SshSessionResponseError::InvalidChars { - field: "connect_path", - }); - } if !resp.host_key_fingerprint.is_empty() { if resp.host_key_fingerprint.len() > MAX_FINGERPRINT_LEN { return Err(SshSessionResponseError::TooLong { @@ -736,6 +701,26 @@ mod tests { assert_eq!(port, 8080); } + #[test] + fn format_gateway_url_brackets_ipv6_literals() { + assert_eq!( + format_gateway_url("https", "::1", 8080), + "https://[::1]:8080" + ); + } + + #[test] + fn format_gateway_url_leaves_dns_and_bracketed_ipv6_unchanged() { + assert_eq!( + format_gateway_url("https", "gateway.example.com", 443), + "https://gateway.example.com:443" + ); + assert_eq!( + format_gateway_url("https", "[::1]", 8080), + "https://[::1]:8080" + ); + } + #[test] fn shell_escape_empty() { assert_eq!(shell_escape(""), "''"); @@ -758,7 +743,6 @@ mod tests { gateway_scheme: "https".to_string(), gateway_host: "gateway.example.com".to_string(), gateway_port: 443, - connect_path: "/connect/ssh".to_string(), host_key_fingerprint: String::new(), expires_at_ms: 0, } @@ -858,33 +842,6 @@ mod tests { } } - #[test] - fn validate_ssh_session_response_rejects_connect_path_without_leading_slash() { - let mut r = valid_session_response(); - r.connect_path = "connect/ssh".to_string(); - assert!(matches!( - validate_ssh_session_response(&r), - Err(SshSessionResponseError::ConnectPathNotAbsolute) - )); - } - - #[test] - fn validate_ssh_session_response_rejects_injected_connect_path() { - // `$`, `(`, `)` are valid RFC 3986 sub-delims (pchar) so the validator - // permits them; shell_escape is the second defensive layer. The - // following characters are rejected at the validator boundary because - // they are either unambiguously hostile in a shell context or invalid - // per RFC 3986 in the path component. - for bad in ["/x`id`y", "/x y", "/x\nb", "/x\\b", "/x?q=1", "/x#frag"] { - let mut r = valid_session_response(); - r.connect_path = bad.to_string(); - assert!( - validate_ssh_session_response(&r).is_err(), - "expected reject for connect_path={bad:?}" - ); - } - } - #[test] fn build_proxy_command_escapes_shell_metacharacters() { // Attacker-controlled values in every escapable position. diff --git a/crates/openshell-ocsf/src/format/shorthand.rs b/crates/openshell-ocsf/src/format/shorthand.rs index 7e2296de9..85424cb67 100644 --- a/crates/openshell-ocsf/src/format/shorthand.rs +++ b/crates/openshell-ocsf/src/format/shorthand.rs @@ -62,6 +62,7 @@ pub fn severity_tag(severity_id: u8) -> &'static str { /// Max length for the reason text in `[reason:...]` before truncation. const MAX_REASON_LEN: usize = 80; +const MAX_MESSAGE_LEN: usize = 120; /// Format a `[reason:...]` tag from `status_detail` (or `message` fallback) /// for denied events. Returns an empty string if neither field is set. @@ -81,6 +82,19 @@ fn reason_tag(base: &BaseEventData) -> String { } } +fn message_tag(base: &BaseEventData) -> String { + let text = base.message.as_deref().unwrap_or(""); + if text.is_empty() { + return String::new(); + } + let text = text.replace(['\n', '\r'], " "); + if text.len() > MAX_MESSAGE_LEN { + format!(" [msg:{}...]", &text[..MAX_MESSAGE_LEN]) + } else { + format!(" [msg:{text}]") + } +} + impl OcsfEvent { /// Produce the single-line shorthand for `openshell.log` and gRPC log push. /// @@ -141,7 +155,13 @@ impl OcsfEvent { (false, true) => format!(" {action}"), (false, false) => format!(" {action}{arrow}"), }; - format!("NET:{activity} {sev}{detail}{rule_ctx}{reason_ctx}") + let message_ctx = + if detail.is_empty() && rule_ctx.is_empty() && reason_ctx.is_empty() { + message_tag(&e.base) + } else { + String::new() + }; + format!("NET:{activity} {sev}{detail}{rule_ctx}{reason_ctx}{message_ctx}") } Self::HttpActivity(e) => { @@ -542,6 +562,33 @@ mod tests { ); } + #[test] + fn test_network_activity_shorthand_shows_message_when_no_key_fields() { + let event = OcsfEvent::NetworkActivity(NetworkActivityEvent { + base: { + let mut b = base(4001, "Network Activity", 4, "Network Activity", 1, "Open"); + b.set_message("relay open (channel_id=ch-42)"); + b + }, + src_endpoint: None, + dst_endpoint: None, + proxy_endpoint: None, + actor: None, + firewall_rule: None, + connection_info: None, + action: None, + disposition: None, + observation_point_id: None, + is_src_dst_assignment_known: None, + }); + + let shorthand = event.format_shorthand(); + assert_eq!( + shorthand, + "NET:OPEN [INFO] [msg:relay open (channel_id=ch-42)]" + ); + } + #[test] fn test_http_activity_shorthand_denied_shows_reason() { let mut b = base(4002, "HTTP Activity", 4, "Network Activity", 99, "Other"); diff --git a/crates/openshell-sandbox/src/lib.rs b/crates/openshell-sandbox/src/lib.rs index 34ee80bb5..e6092c537 100644 --- a/crates/openshell-sandbox/src/lib.rs +++ b/crates/openshell-sandbox/src/lib.rs @@ -685,7 +685,7 @@ pub async fn run_sandbox( sandbox_id.as_ref(), ssh_socket_path.as_ref(), ) { - supervisor_session::spawn(endpoint.clone(), id.clone(), socket.clone()); + supervisor_session::spawn(endpoint.clone(), id.clone(), socket.clone(), ssh_netns_fd); info!("supervisor session task spawned"); } diff --git a/crates/openshell-sandbox/src/sandbox/linux/netns.rs b/crates/openshell-sandbox/src/sandbox/linux/netns.rs index bbd02255f..0ac8b88b0 100644 --- a/crates/openshell-sandbox/src/sandbox/linux/netns.rs +++ b/crates/openshell-sandbox/src/sandbox/linux/netns.rs @@ -11,7 +11,7 @@ use miette::{IntoDiagnostic, Result}; use std::net::IpAddr; use std::os::unix::io::RawFd; use std::process::Command; -use tracing::{debug, info, warn}; +use tracing::{debug, warn}; use uuid::Uuid; /// Default subnet for sandbox networking. diff --git a/crates/openshell-sandbox/src/supervisor_session.rs b/crates/openshell-sandbox/src/supervisor_session.rs index 490a0cba7..49c52f9c2 100644 --- a/crates/openshell-sandbox/src/supervisor_session.rs +++ b/crates/openshell-sandbox/src/supervisor_session.rs @@ -4,24 +4,28 @@ //! Persistent supervisor-to-gateway session. //! //! Maintains a long-lived `ConnectSupervisor` bidirectional gRPC stream to the -//! gateway. When the gateway sends `RelayOpen`, the supervisor initiates a -//! `RelayStream` gRPC call (a new HTTP/2 stream multiplexed over the same -//! TCP+TLS connection as the control stream) and bridges it to the local SSH -//! daemon. The supervisor is a dumb byte bridge — it has no protocol awareness -//! of the SSH or NSSH1 bytes flowing through. - +//! gateway. When the gateway sends `RelayOpen`, the supervisor dials the +//! requested local target, initiates a `RelayStream` gRPC call (a new HTTP/2 +//! stream multiplexed over the same TCP+TLS connection as the control stream), +//! and bridges bytes. The supervisor is a dumb byte bridge after target +//! selection — it has no protocol awareness of the bytes flowing through. + +use std::net::IpAddr; +#[cfg(target_os = "linux")] +use std::os::fd::RawFd; use std::time::Duration; use openshell_core::proto::open_shell_client::OpenShellClient; use openshell_core::proto::{ - GatewayMessage, RelayFrame, RelayInit, SupervisorHeartbeat, SupervisorHello, SupervisorMessage, - gateway_message, supervisor_message, + GatewayMessage, RelayFrame, RelayInit, RelayOpen, RelayOpenResult, SupervisorHeartbeat, + SupervisorHello, SupervisorMessage, TcpRelayTarget, gateway_message, relay_open, + supervisor_message, }; use openshell_ocsf::{ - ActivityId, Endpoint, NetworkActivityBuilder, OcsfEvent, SandboxContext, SeverityId, StatusId, - ocsf_emit, + ActivityId, ConnectionInfo, Endpoint, NetworkActivityBuilder, OcsfEvent, SandboxContext, + SeverityId, StatusId, ocsf_emit, }; -use tokio::io::{AsyncReadExt, AsyncWriteExt}; +use tokio::io::{AsyncRead, AsyncReadExt, AsyncWrite, AsyncWriteExt}; use tokio::sync::mpsc; use tokio_stream::StreamExt; use tonic::transport::Channel; @@ -91,33 +95,103 @@ fn session_failed_event( .build() } -fn relay_open_event(ctx: &SandboxContext, channel_id: &str) -> OcsfEvent { - NetworkActivityBuilder::new(ctx) +fn relay_target_endpoint(open: &RelayOpen) -> Option { + let relay_open::Target::Tcp(target) = open.target.as_ref()? else { + return None; + }; + let host = target.host.trim(); + let port = u16::try_from(target.port).ok()?; + if let Ok(ip) = host.parse() { + Some(Endpoint::from_ip(ip, port)) + } else { + Some(Endpoint::from_domain(host, port)) + } +} + +fn relay_target_kind(open: &RelayOpen) -> &'static str { + match open.target.as_ref() { + Some(relay_open::Target::Tcp(_)) => "tcp relay", + Some(relay_open::Target::Ssh(_)) | None => "ssh relay", + } +} + +fn relay_target_message( + open: &RelayOpen, + state: &str, + ssh_socket_path: &std::path::Path, +) -> String { + let target = match open.target.as_ref() { + Some(relay_open::Target::Tcp(target)) => { + format!("{}:{}", target.host.trim(), target.port) + } + Some(relay_open::Target::Ssh(_)) | None => { + format!("unix:{}", ssh_socket_path.display()) + } + }; + + format!( + "{} {state} (channel_id={}, target={target})", + relay_target_kind(open), + open.channel_id + ) +} + +fn relay_open_event( + ctx: &SandboxContext, + open: &RelayOpen, + ssh_socket_path: &std::path::Path, +) -> OcsfEvent { + let mut builder = NetworkActivityBuilder::new(ctx) .activity(ActivityId::Open) .severity(SeverityId::Informational) .status(StatusId::Success) - .message(format!("relay open (channel_id={channel_id})")) - .build() + .message(relay_target_message(open, "open", ssh_socket_path)); + if let Some(endpoint) = relay_target_endpoint(open) { + builder = builder + .dst_endpoint(endpoint) + .connection_info(ConnectionInfo::new("tcp")); + } + builder.build() } -fn relay_closed_event(ctx: &SandboxContext, channel_id: &str) -> OcsfEvent { - NetworkActivityBuilder::new(ctx) +fn relay_closed_event( + ctx: &SandboxContext, + open: &RelayOpen, + ssh_socket_path: &std::path::Path, +) -> OcsfEvent { + let mut builder = NetworkActivityBuilder::new(ctx) .activity(ActivityId::Close) .severity(SeverityId::Informational) .status(StatusId::Success) - .message(format!("relay closed (channel_id={channel_id})")) - .build() + .message(relay_target_message(open, "closed", ssh_socket_path)); + if let Some(endpoint) = relay_target_endpoint(open) { + builder = builder + .dst_endpoint(endpoint) + .connection_info(ConnectionInfo::new("tcp")); + } + builder.build() } -fn relay_failed_event(ctx: &SandboxContext, channel_id: &str, error: &str) -> OcsfEvent { - NetworkActivityBuilder::new(ctx) +fn relay_failed_event( + ctx: &SandboxContext, + open: &RelayOpen, + ssh_socket_path: &std::path::Path, + error: &str, +) -> OcsfEvent { + let mut builder = NetworkActivityBuilder::new(ctx) .activity(ActivityId::Fail) .severity(SeverityId::Low) .status(StatusId::Failure) .message(format!( - "relay bridge failed (channel_id={channel_id}): {error}" - )) - .build() + "{}: {error}", + relay_target_message(open, "bridge failed", ssh_socket_path) + )); + if let Some(endpoint) = relay_target_endpoint(open) { + builder = builder + .dst_endpoint(endpoint) + .connection_info(ConnectionInfo::new("tcp")); + } + builder.build() } fn relay_close_from_gateway_event( @@ -139,6 +213,10 @@ fn relay_close_from_gateway_event( /// HTTP/2 frame size so each `RelayFrame::data` fits in one frame. const RELAY_CHUNK_SIZE: usize = 16 * 1024; +trait TargetStream: AsyncRead + AsyncWrite + Send + Unpin {} + +impl TargetStream for T where T: AsyncRead + AsyncWrite + Send + Unpin {} + fn map_stream_message( message: Result, tonic::Status>, eof_error: &'static str, @@ -158,14 +236,21 @@ pub fn spawn( endpoint: String, sandbox_id: String, ssh_socket_path: std::path::PathBuf, + netns_fd: Option, ) -> tokio::task::JoinHandle<()> { - tokio::spawn(run_session_loop(endpoint, sandbox_id, ssh_socket_path)) + tokio::spawn(run_session_loop( + endpoint, + sandbox_id, + ssh_socket_path, + netns_fd, + )) } async fn run_session_loop( endpoint: String, sandbox_id: String, ssh_socket_path: std::path::PathBuf, + netns_fd: Option, ) { let mut backoff = INITIAL_BACKOFF; let mut attempt: u64 = 0; @@ -173,7 +258,7 @@ async fn run_session_loop( loop { attempt += 1; - match run_single_session(&endpoint, &sandbox_id, &ssh_socket_path).await { + match run_single_session(&endpoint, &sandbox_id, &ssh_socket_path, netns_fd).await { Ok(()) => { let event = session_closed_event(crate::ocsf_ctx(), &endpoint, &sandbox_id); ocsf_emit!(event); @@ -194,6 +279,7 @@ async fn run_single_session( endpoint: &str, sandbox_id: &str, ssh_socket_path: &std::path::Path, + netns_fd: Option, ) -> Result<(), Box> { // Connect to the gateway. The same `Channel` is used for both the // long-lived control stream and all data-plane `RelayStream` calls, so @@ -262,7 +348,9 @@ async fn run_single_session( &msg, sandbox_id, ssh_socket_path, + netns_fd, &channel, + &tx, ); } _ = heartbeat_interval.tick() => { @@ -283,7 +371,9 @@ fn handle_gateway_message( msg: &GatewayMessage, sandbox_id: &str, ssh_socket_path: &std::path::Path, + netns_fd: Option, channel: &Channel, + tx: &mpsc::Sender, ) { match &msg.payload { Some(gateway_message::Payload::Heartbeat(_)) => { @@ -291,22 +381,30 @@ fn handle_gateway_message( } Some(gateway_message::Payload::RelayOpen(open)) => { let channel_id = open.channel_id.clone(); + let relay_open = open.clone(); let sandbox_id = sandbox_id.to_string(); let channel = channel.clone(); let ssh_socket_path = ssh_socket_path.to_path_buf(); + let tx = tx.clone(); - let event = relay_open_event(crate::ocsf_ctx(), &channel_id); + let event = relay_open_event(crate::ocsf_ctx(), &relay_open, &ssh_socket_path); ocsf_emit!(event); tokio::spawn(async move { - match handle_relay_open(&channel_id, &ssh_socket_path, channel).await { + let event_open = relay_open.clone(); + match handle_relay_open(relay_open, &ssh_socket_path, netns_fd, channel, tx).await { Ok(()) => { - let event = relay_closed_event(crate::ocsf_ctx(), &channel_id); + let event = + relay_closed_event(crate::ocsf_ctx(), &event_open, &ssh_socket_path); ocsf_emit!(event); } Err(e) => { - let event = - relay_failed_event(crate::ocsf_ctx(), &channel_id, &e.to_string()); + let event = relay_failed_event( + crate::ocsf_ctx(), + &event_open, + &ssh_socket_path, + &e.to_string(), + ); ocsf_emit!(event); warn!( sandbox_id = %sandbox_id, @@ -336,10 +434,23 @@ fn handle_gateway_message( /// TLS handshake. The first `RelayFrame` we send is a `RelayInit`; subsequent /// frames carry raw SSH bytes in `data`. async fn handle_relay_open( - channel_id: &str, + relay_open: RelayOpen, ssh_socket_path: &std::path::Path, + netns_fd: Option, channel: Channel, + tx: mpsc::Sender, ) -> Result<(), Box> { + let channel_id = relay_open.channel_id.clone(); + let target = match open_target(&relay_open, ssh_socket_path, netns_fd).await { + Ok(target) => target, + Err(err) => { + send_relay_open_result(&tx, &channel_id, false, err.to_string()).await; + return Err(err); + } + }; + + send_relay_open_result(&tx, &channel_id, true, String::new()).await; + let mut client = OpenShellClient::new(channel); // Outbound chunks to the gateway. @@ -351,7 +462,7 @@ async fn handle_relay_open( .send(RelayFrame { payload: Some(openshell_core::proto::relay_frame::Payload::Init( RelayInit { - channel_id: channel_id.to_string(), + channel_id: channel_id.clone(), }, )), }) @@ -366,21 +477,19 @@ async fn handle_relay_open( let mut inbound = response.into_inner(); // Connect to the local SSH daemon on its Unix socket. - let ssh = tokio::net::UnixStream::connect(ssh_socket_path).await?; - let (mut ssh_r, mut ssh_w) = ssh.into_split(); + let (mut target_r, mut target_w) = tokio::io::split(target); debug!( channel_id = %channel_id, - socket = %ssh_socket_path.display(), - "relay bridge: connected to local SSH daemon" + "relay bridge: connected to local target" ); - // SSH → gRPC (out_tx): read local SSH, forward as `RelayFrame::data`. + // Target → gRPC (out_tx): read local target, forward as `RelayFrame::data`. let out_tx_writer = out_tx.clone(); - let ssh_to_grpc = tokio::spawn(async move { + let target_to_grpc = tokio::spawn(async move { let mut buf = vec![0u8; RELAY_CHUNK_SIZE]; loop { - match ssh_r.read(&mut buf).await { + match target_r.read(&mut buf).await { Ok(0) | Err(_) => break, Ok(n) => { let chunk = RelayFrame { @@ -396,7 +505,7 @@ async fn handle_relay_open( } }); - // gRPC (inbound) → SSH: drain inbound chunks into the local SSH socket. + // gRPC (inbound) → target: drain inbound chunks into the local target socket. let mut inbound_err: Option = None; while let Some(next) = inbound.next().await { match next { @@ -409,8 +518,8 @@ async fn handle_relay_open( if data.is_empty() { continue; } - if let Err(e) = ssh_w.write_all(&data).await { - inbound_err = Some(format!("write to ssh failed: {e}")); + if let Err(e) = target_w.write_all(&data).await { + inbound_err = Some(format!("write to target failed: {e}")); break; } } @@ -421,13 +530,13 @@ async fn handle_relay_open( } } - // Half-close the SSH socket's write side so the daemon sees EOF. - let _ = ssh_w.shutdown().await; + // Half-close the target socket's write side so the service sees EOF. + let _ = target_w.shutdown().await; // Dropping out_tx closes the outbound gRPC stream, letting the gateway // observe EOF on its side too. drop(out_tx); - let _ = ssh_to_grpc.await; + let _ = target_to_grpc.await; if let Some(e) = inbound_err { return Err(e.into()); @@ -435,6 +544,165 @@ async fn handle_relay_open( Ok(()) } +async fn send_relay_open_result( + tx: &mpsc::Sender, + channel_id: &str, + success: bool, + error: String, +) { + let _ = tx + .send(SupervisorMessage { + payload: Some(supervisor_message::Payload::RelayOpenResult( + RelayOpenResult { + channel_id: channel_id.to_string(), + success, + error, + }, + )), + }) + .await; +} + +async fn open_target( + relay_open: &RelayOpen, + ssh_socket_path: &std::path::Path, + netns_fd: Option, +) -> Result, Box> { + match relay_open.target.as_ref() { + Some(relay_open::Target::Tcp(target)) => open_tcp_target(target, netns_fd).await, + Some(relay_open::Target::Ssh(_)) | None => { + let stream = tokio::net::UnixStream::connect(ssh_socket_path).await?; + Ok(Box::new(stream)) + } + } +} + +async fn open_tcp_target( + target: &TcpRelayTarget, + netns_fd: Option, +) -> Result, Box> { + let host = normalize_tcp_target_host(target)?; + let port = u16::try_from(target.port).map_err(|_| "tcp target port must fit in u16")?; + let stream = connect_tcp_target(host, port, netns_fd).await?; + Ok(Box::new(stream)) +} + +#[cfg(target_os = "linux")] +async fn connect_tcp_target( + host: String, + port: u16, + netns_fd: Option, +) -> Result> { + if let Some(fd) = netns_fd { + let (tx, rx) = tokio::sync::oneshot::channel(); + std::thread::spawn(move || { + let result = (|| -> std::io::Result { + #[allow(unsafe_code)] + let rc = unsafe { libc::setns(fd, libc::CLONE_NEWNET) }; + if rc != 0 { + return Err(std::io::Error::last_os_error()); + } + std::net::TcpStream::connect((host.as_str(), port)) + })(); + let _ = tx.send(result); + }); + + let stream = rx + .await + .map_err(|_| "netns tcp connect thread panicked")??; + stream.set_nonblocking(true)?; + return Ok(tokio::net::TcpStream::from_std(stream)?); + } + + Ok(tokio::net::TcpStream::connect((host.as_str(), port)).await?) +} + +#[cfg(not(target_os = "linux"))] +async fn connect_tcp_target( + host: String, + port: u16, + _netns_fd: Option, +) -> Result> { + Ok(tokio::net::TcpStream::connect((host.as_str(), port)).await?) +} + +#[cfg(test)] +fn validate_tcp_target(target: &TcpRelayTarget) -> Result<(), String> { + normalize_tcp_target_host(target).map(|_| ()) +} + +fn normalize_tcp_target_host(target: &TcpRelayTarget) -> Result { + if target.port == 0 || target.port > u32::from(u16::MAX) { + return Err("tcp target port must be between 1 and 65535".to_string()); + } + + let host = target.host.trim(); + if host.is_empty() { + return Err("tcp target host is required".to_string()); + } + if host.eq_ignore_ascii_case("localhost") { + return Ok("127.0.0.1".to_string()); + } + + let ip: IpAddr = host + .parse() + .map_err(|_| "tcp target host must be loopback".to_string())?; + if ip.is_loopback() { + Ok(ip.to_string()) + } else { + Err("tcp target host must be loopback".to_string()) + } +} + +#[cfg(test)] +mod target_tests { + use super::*; + + fn tcp(host: &str, port: u32) -> TcpRelayTarget { + TcpRelayTarget { + host: host.to_string(), + port, + } + } + + #[test] + fn tcp_target_allows_loopback_hosts() { + validate_tcp_target(&tcp("127.0.0.1", 8080)).expect("ipv4 loopback"); + validate_tcp_target(&tcp("::1", 8080)).expect("ipv6 loopback"); + validate_tcp_target(&tcp("localhost", 8080)).expect("localhost"); + } + + #[test] + fn tcp_target_normalizes_localhost_before_dialing() { + assert_eq!( + normalize_tcp_target_host(&tcp("localhost", 8080)).expect("localhost"), + "127.0.0.1" + ); + assert_eq!( + normalize_tcp_target_host(&tcp("LOCALHOST", 8080)).expect("localhost"), + "127.0.0.1" + ); + } + + #[test] + fn tcp_target_rejects_non_loopback_hosts() { + let err = validate_tcp_target(&tcp("10.0.0.1", 8080)).expect_err("private ip rejected"); + assert_eq!(err, "tcp target host must be loopback"); + + let err = validate_tcp_target(&tcp("example.com", 8080)).expect_err("hostname rejected"); + assert_eq!(err, "tcp target host must be loopback"); + } + + #[test] + fn tcp_target_rejects_invalid_ports() { + let err = validate_tcp_target(&tcp("127.0.0.1", 0)).expect_err("zero rejected"); + assert_eq!(err, "tcp target port must be between 1 and 65535"); + + let err = validate_tcp_target(&tcp("127.0.0.1", 70000)).expect_err("too large rejected"); + assert_eq!(err, "tcp target port must be between 1 and 65535"); + } +} + #[cfg(test)] mod ocsf_event_tests { use super::*; @@ -479,6 +747,29 @@ mod ocsf_event_tests { } } + fn ssh_relay_open(channel_id: &str) -> RelayOpen { + RelayOpen { + channel_id: channel_id.to_string(), + target: Some(relay_open::Target::Ssh(Default::default())), + service_id: String::new(), + } + } + + fn tcp_relay_open(channel_id: &str, host: &str, port: u32) -> RelayOpen { + RelayOpen { + channel_id: channel_id.to_string(), + target: Some(relay_open::Target::Tcp(TcpRelayTarget { + host: host.to_string(), + port, + })), + service_id: String::new(), + } + } + + fn ssh_socket_path() -> &'static std::path::Path { + std::path::Path::new("/run/openshell/ssh.sock") + } + #[test] fn session_established_emits_network_open_success() { let event = session_established_event(&ctx(), "https://gw:443", "sess-1", 30); @@ -518,22 +809,43 @@ mod ocsf_event_tests { #[test] fn relay_open_emits_network_open_success() { - let event = relay_open_event(&ctx(), "ch-42"); + let event = relay_open_event(&ctx(), &ssh_relay_open("ch-42"), ssh_socket_path()); let na = network_activity(&event); assert_eq!(na.base.activity_id, ActivityId::Open.as_u8()); assert_eq!(na.base.severity, SeverityId::Informational); + let msg = na.base.message.as_deref().unwrap_or_default(); + assert!(msg.contains("ch-42"), "message: {msg}"); assert!( - na.base - .message - .as_deref() - .unwrap_or_default() - .contains("ch-42") + msg.contains("target=unix:/run/openshell/ssh.sock"), + "message: {msg}" + ); + } + + #[test] + fn tcp_relay_open_emits_target_endpoint() { + let event = relay_open_event( + &ctx(), + &tcp_relay_open("ch-42", "127.0.0.1", 8765), + ssh_socket_path(), + ); + let na = network_activity(&event); + assert_eq!(na.base.activity_id, ActivityId::Open.as_u8()); + assert_eq!( + na.dst_endpoint.as_ref().and_then(|e| e.ip.as_deref()), + Some("127.0.0.1") + ); + assert_eq!(na.dst_endpoint.as_ref().and_then(|e| e.port), Some(8765)); + assert_eq!( + na.connection_info + .as_ref() + .map(|c| c.protocol_name.as_str()), + Some("tcp") ); } #[test] fn relay_closed_emits_network_close_success() { - let event = relay_closed_event(&ctx(), "ch-42"); + let event = relay_closed_event(&ctx(), &ssh_relay_open("ch-42"), ssh_socket_path()); let na = network_activity(&event); assert_eq!(na.base.activity_id, ActivityId::Close.as_u8()); assert_eq!(na.base.status, Some(StatusId::Success)); @@ -541,7 +853,12 @@ mod ocsf_event_tests { #[test] fn relay_failed_emits_network_fail_low() { - let event = relay_failed_event(&ctx(), "ch-42", "write to ssh failed"); + let event = relay_failed_event( + &ctx(), + &ssh_relay_open("ch-42"), + ssh_socket_path(), + "write to ssh failed", + ); let na = network_activity(&event); assert_eq!(na.base.activity_id, ActivityId::Fail.as_u8()); assert_eq!(na.base.severity, SeverityId::Low); diff --git a/crates/openshell-server/src/cli.rs b/crates/openshell-server/src/cli.rs index 2df0b06f4..8b40b93f8 100644 --- a/crates/openshell-server/src/cli.rs +++ b/crates/openshell-server/src/cli.rs @@ -96,14 +96,6 @@ struct Args { #[arg(long, env = "OPENSHELL_SSH_GATEWAY_PORT", default_value_t = DEFAULT_SERVER_PORT)] ssh_gateway_port: u16, - /// HTTP path for SSH CONNECT/upgrade. - #[arg( - long, - env = "OPENSHELL_SSH_CONNECT_PATH", - default_value = "/connect/ssh" - )] - ssh_connect_path: String, - /// SSH port inside sandbox pods. #[arg(long, env = "OPENSHELL_SANDBOX_SSH_PORT", default_value_t = DEFAULT_SSH_PORT)] sandbox_ssh_port: u16, @@ -303,7 +295,6 @@ async fn run_from_args(args: Args) -> Result<()> { .with_sandbox_namespace(args.sandbox_namespace) .with_ssh_gateway_host(args.ssh_gateway_host) .with_ssh_gateway_port(args.ssh_gateway_port) - .with_ssh_connect_path(args.ssh_connect_path) .with_sandbox_ssh_port(args.sandbox_ssh_port) .with_ssh_handshake_skew_secs(args.ssh_handshake_skew_secs); diff --git a/crates/openshell-server/src/grpc/mod.rs b/crates/openshell-server/src/grpc/mod.rs index 969204dad..df5aa6a01 100644 --- a/crates/openshell-server/src/grpc/mod.rs +++ b/crates/openshell-server/src/grpc/mod.rs @@ -25,8 +25,9 @@ use openshell_core::proto::{ RejectDraftChunkRequest, RejectDraftChunkResponse, RelayFrame, ReportPolicyStatusRequest, ReportPolicyStatusResponse, RevokeSshSessionRequest, RevokeSshSessionResponse, SandboxResponse, SandboxStreamEvent, ServiceStatus, SubmitPolicyAnalysisRequest, SubmitPolicyAnalysisResponse, - SupervisorMessage, UndoDraftChunkRequest, UndoDraftChunkResponse, UpdateConfigRequest, - UpdateConfigResponse, UpdateProviderRequest, WatchSandboxRequest, open_shell_server::OpenShell, + SupervisorMessage, TcpForwardFrame, UndoDraftChunkRequest, UndoDraftChunkResponse, + UpdateConfigRequest, UpdateConfigResponse, UpdateProviderRequest, WatchSandboxRequest, + open_shell_server::OpenShell, }; use serde::{Deserialize, Serialize}; use std::collections::BTreeMap; @@ -211,6 +212,16 @@ impl OpenShell for OpenShellService { sandbox::handle_exec_sandbox(&self.state, request).await } + type ForwardTcpStream = + Pin> + Send + 'static>>; + + async fn forward_tcp( + &self, + request: Request>, + ) -> Result, Status> { + sandbox::handle_forward_tcp(&self.state, request).await + } + // --- SSH sessions --- async fn create_ssh_session( diff --git a/crates/openshell-server/src/grpc/sandbox.rs b/crates/openshell-server/src/grpc/sandbox.rs index 60bca2a65..e0dd3a8d4 100644 --- a/crates/openshell-server/src/grpc/sandbox.rs +++ b/crates/openshell-server/src/grpc/sandbox.rs @@ -12,15 +12,19 @@ use crate::ServerState; use crate::persistence::{ObjectType, generate_name}; use futures::future; +use openshell_core::ObjectId; use openshell_core::proto::{ CreateSandboxRequest, CreateSshSessionRequest, CreateSshSessionResponse, DeleteSandboxRequest, DeleteSandboxResponse, ExecSandboxEvent, ExecSandboxExit, ExecSandboxRequest, ExecSandboxStderr, ExecSandboxStdout, GetSandboxRequest, ListSandboxesRequest, ListSandboxesResponse, RevokeSshSessionRequest, RevokeSshSessionResponse, SandboxResponse, - SandboxStreamEvent, WatchSandboxRequest, + SandboxStreamEvent, TcpForwardFrame, TcpForwardInit, TcpRelayTarget, WatchSandboxRequest, + relay_open, tcp_forward_init, }; use openshell_core::proto::{Sandbox, SandboxPhase, SandboxTemplate, SshSession}; use prost::Message; +use std::net::IpAddr; +use std::pin::Pin; use std::sync::Arc; use tokio::net::{TcpListener, TcpStream}; use tokio::sync::mpsc; @@ -38,6 +42,8 @@ use super::validation::{ }; use super::{MAX_PAGE_SIZE, clamp_limit, current_time_ms}; +const TCP_FORWARD_CHUNK_SIZE: usize = 64 * 1024; + // --------------------------------------------------------------------------- // Sandbox lifecycle handlers // --------------------------------------------------------------------------- @@ -467,9 +473,8 @@ pub(super) async fn handle_exec_sandbox( } // Open a relay channel through the supervisor session. Use a 15s - // session-wait timeout — enough to cover a transient supervisor - // reconnect, but shorter than `/connect/ssh` since `ExecSandbox` is - // typically called during normal operation (not right after create). + // session-wait timeout, enough to cover a transient supervisor reconnect + // while still failing quickly during normal operation. let (channel_id, relay_rx) = state .supervisor_sessions .open_relay(sandbox.object_id(), std::time::Duration::from_secs(15)) @@ -491,7 +496,12 @@ pub(super) async fn handle_exec_sandbox( let relay_stream = match tokio::time::timeout(std::time::Duration::from_secs(10), relay_rx) .await { - Ok(Ok(stream)) => stream, + Ok(Ok(Ok(stream))) => stream, + Ok(Ok(Err(status))) => { + warn!(sandbox_id = %sandbox_id, channel_id = %channel_id, error = %status.message(), "ExecSandbox: relay target open failed"); + let _ = tx.send(Err(status)).await; + return; + } Ok(Err(_)) => { warn!(sandbox_id = %sandbox_id, channel_id = %channel_id, "ExecSandbox: relay channel dropped"); let _ = tx @@ -528,6 +538,329 @@ pub(super) async fn handle_exec_sandbox( Ok(Response::new(ReceiverStream::new(rx))) } +pub(super) async fn handle_forward_tcp( + state: &Arc, + request: Request>, +) -> Result< + Response< + Pin> + Send + 'static>>, + >, + Status, +> { + let mut inbound = request.into_inner(); + let first = inbound + .message() + .await? + .ok_or_else(|| Status::invalid_argument("empty ForwardTcp stream"))?; + let init = match first.payload { + Some(openshell_core::proto::tcp_forward_frame::Payload::Init(init)) => init, + _ => { + return Err(Status::invalid_argument( + "first TcpForwardFrame must be init", + )); + } + }; + + let target = validate_tcp_forward_init(&init)?; + + let sandbox = state + .store + .get_message::(&init.sandbox_id) + .await + .map_err(|e| Status::internal(format!("fetch sandbox failed: {e}")))? + .ok_or_else(|| Status::not_found("sandbox not found"))?; + + if SandboxPhase::try_from(sandbox.phase).ok() != Some(SandboxPhase::Ready) { + return Err(Status::failed_precondition("sandbox is not ready")); + } + + let connection_guard = acquire_forward_connection_guard(state, &init, &sandbox).await?; + let (channel_id, relay_rx) = state + .supervisor_sessions + .open_relay_with_target( + sandbox.object_id(), + target, + init.service_id.clone(), + std::time::Duration::from_secs(15), + ) + .await + .map_err(|e| Status::unavailable(format!("supervisor relay failed: {e}")))?; + + let sandbox_id = sandbox.object_id().to_string(); + let (tx, rx) = mpsc::channel::>(256); + tokio::spawn(async move { + let _connection_guard = connection_guard; + let relay_stream = match tokio::time::timeout(std::time::Duration::from_secs(10), relay_rx) + .await + { + Ok(Ok(Ok(stream))) => stream, + Ok(Ok(Err(status))) => { + warn!(sandbox_id = %sandbox_id, channel_id = %channel_id, error = %status.message(), "ForwardTcp: relay target open failed"); + let _ = tx.send(Err(status)).await; + return; + } + Ok(Err(_)) => { + warn!(sandbox_id = %sandbox_id, channel_id = %channel_id, "ForwardTcp: relay channel dropped"); + let _ = tx + .send(Err(Status::unavailable("relay channel dropped"))) + .await; + return; + } + Err(_) => { + warn!(sandbox_id = %sandbox_id, channel_id = %channel_id, "ForwardTcp: relay open timed out"); + let _ = tx + .send(Err(Status::deadline_exceeded("relay open timed out"))) + .await; + return; + } + }; + + bridge_forward_tcp_stream(inbound, relay_stream, tx, &sandbox_id, &channel_id).await; + }); + + let stream: Pin< + Box> + Send + 'static>, + > = Box::pin(ReceiverStream::new(rx)); + Ok(Response::new(stream)) +} + +struct ForwardConnectionGuard { + state: Arc, + token: Option, + sandbox_id: String, +} + +impl Drop for ForwardConnectionGuard { + fn drop(&mut self) { + if let Some(token) = self.token.as_deref() { + decrement_ssh_connection_count(&self.state.ssh_connections_by_token, token); + decrement_ssh_connection_count( + &self.state.ssh_connections_by_sandbox, + &self.sandbox_id, + ); + } + } +} + +async fn acquire_forward_connection_guard( + state: &Arc, + init: &TcpForwardInit, + sandbox: &Sandbox, +) -> Result { + let sandbox_id = sandbox.object_id().to_string(); + let token = init.authorization_token.trim(); + if token.is_empty() { + return Err(Status::unauthenticated( + "authorization_token is required for ForwardTcp", + )); + } + + validate_ssh_forward_token(state, token, &sandbox_id).await?; + acquire_ssh_connection_slots( + &state.ssh_connections_by_token, + &state.ssh_connections_by_sandbox, + token, + &sandbox_id, + )?; + + Ok(ForwardConnectionGuard { + state: state.clone(), + token: Some(token.to_string()), + sandbox_id, + }) +} + +async fn validate_ssh_forward_token( + state: &Arc, + token: &str, + sandbox_id: &str, +) -> Result<(), Status> { + let session = state + .store + .get_message::(token) + .await + .map_err(|e| Status::internal(format!("fetch SSH session failed: {e}")))? + .ok_or_else(|| Status::unauthenticated("SSH session token not found"))?; + + if session.revoked || session.sandbox_id != sandbox_id { + return Err(Status::unauthenticated("SSH session token is not valid")); + } + + if session.expires_at_ms > 0 { + let now_ms = current_time_ms() + .map_err(|e| Status::internal(format!("timestamp generation failed: {e}")))?; + if now_ms > session.expires_at_ms { + return Err(Status::unauthenticated("SSH session token expired")); + } + } + + Ok(()) +} + +fn acquire_ssh_connection_slots( + token_counts: &std::sync::Mutex>, + sandbox_counts: &std::sync::Mutex>, + token: &str, + sandbox_id: &str, +) -> Result<(), Status> { + const MAX_CONNECTIONS_PER_TOKEN: u32 = 3; + const MAX_CONNECTIONS_PER_SANDBOX: u32 = 20; + + { + let mut counts = token_counts.lock().unwrap(); + let count = counts.entry(token.to_string()).or_insert(0); + if *count >= MAX_CONNECTIONS_PER_TOKEN { + return Err(Status::resource_exhausted( + "SSH session connection limit reached", + )); + } + *count += 1; + } + + { + let mut counts = sandbox_counts.lock().unwrap(); + let count = counts.entry(sandbox_id.to_string()).or_insert(0); + if *count >= MAX_CONNECTIONS_PER_SANDBOX { + decrement_ssh_connection_count(token_counts, token); + return Err(Status::resource_exhausted( + "sandbox SSH connection limit reached", + )); + } + *count += 1; + } + + Ok(()) +} + +fn decrement_ssh_connection_count( + counts: &std::sync::Mutex>, + key: &str, +) { + let mut counts = counts.lock().unwrap(); + if let Some(count) = counts.get_mut(key) { + *count = count.saturating_sub(1); + if *count == 0 { + counts.remove(key); + } + } +} + +fn validate_tcp_forward_init(init: &TcpForwardInit) -> Result { + if init.sandbox_id.is_empty() { + return Err(Status::invalid_argument("sandbox_id is required")); + } + + if let Some(target) = init.target.as_ref() { + return match target { + tcp_forward_init::Target::Ssh(_) => Ok(relay_open::Target::Ssh(Default::default())), + tcp_forward_init::Target::Tcp(target) => Ok(relay_open::Target::Tcp( + validate_tcp_forward_target(target)?, + )), + }; + } + + Err(Status::invalid_argument("tcp forward target is required")) +} + +fn validate_tcp_forward_target(target: &TcpRelayTarget) -> Result { + if target.port == 0 || target.port > u32::from(u16::MAX) { + return Err(Status::invalid_argument( + "tcp target port must be between 1 and 65535", + )); + } + + validate_tcp_target_parts(target.host.trim(), target.port).map(|host| TcpRelayTarget { + host, + port: target.port, + }) +} + +fn validate_tcp_target_parts(host: &str, _port: u32) -> Result { + if host.is_empty() { + return Err(Status::invalid_argument("tcp target host is required")); + } + if host.eq_ignore_ascii_case("localhost") { + return Ok("127.0.0.1".to_string()); + } + + let ip: IpAddr = host + .parse() + .map_err(|_| Status::invalid_argument("tcp target host must be loopback"))?; + if ip.is_loopback() { + Ok(ip.to_string()) + } else { + Err(Status::invalid_argument("tcp target host must be loopback")) + } +} + +async fn bridge_forward_tcp_stream( + mut inbound: tonic::Streaming, + relay_stream: tokio::io::DuplexStream, + tx: mpsc::Sender>, + sandbox_id: &str, + channel_id: &str, +) { + let (mut relay_read, mut relay_write) = tokio::io::split(relay_stream); + + let sandbox_id_in = sandbox_id.to_string(); + let channel_id_in = channel_id.to_string(); + tokio::spawn(async move { + loop { + match inbound.message().await { + Ok(Some(frame)) => { + let Some(openshell_core::proto::tcp_forward_frame::Payload::Data(data)) = + frame.payload + else { + warn!(sandbox_id = %sandbox_id_in, channel_id = %channel_id_in, "ForwardTcp: received non-data frame after init"); + break; + }; + if data.is_empty() { + continue; + } + if let Err(err) = + tokio::io::AsyncWriteExt::write_all(&mut relay_write, &data).await + { + warn!(sandbox_id = %sandbox_id_in, channel_id = %channel_id_in, error = %err, "ForwardTcp: write to relay failed"); + break; + } + } + Ok(None) => break, + Err(err) => { + warn!(sandbox_id = %sandbox_id_in, channel_id = %channel_id_in, error = %err, "ForwardTcp: inbound stream failed"); + break; + } + } + } + let _ = tokio::io::AsyncWriteExt::shutdown(&mut relay_write).await; + }); + + let mut buf = vec![0u8; TCP_FORWARD_CHUNK_SIZE]; + loop { + match tokio::io::AsyncReadExt::read(&mut relay_read, &mut buf).await { + Ok(0) => break, + Ok(n) => { + let frame = TcpForwardFrame { + payload: Some(openshell_core::proto::tcp_forward_frame::Payload::Data( + buf[..n].to_vec(), + )), + }; + if tx.send(Ok(frame)).await.is_err() { + break; + } + } + Err(err) => { + warn!(sandbox_id = %sandbox_id, channel_id = %channel_id, error = %err, "ForwardTcp: read from relay failed"); + let _ = tx + .send(Err(Status::unavailable(format!( + "relay read failed: {err}" + )))) + .await; + break; + } + } + } +} + // --------------------------------------------------------------------------- // SSH session handlers // --------------------------------------------------------------------------- @@ -595,7 +928,6 @@ pub(super) async fn handle_create_ssh_session( gateway_host, gateway_port: gateway_port.into(), gateway_scheme: scheme.to_string(), - connect_path: state.config.ssh_connect_path.clone(), host_key_fingerprint: String::new(), expires_at_ms, })) @@ -704,8 +1036,7 @@ fn build_remote_exec_command(req: &ExecSandboxRequest) -> Result /// /// This is the relay equivalent of `stream_exec_over_ssh`. Instead of dialing a /// sandbox endpoint directly, the SSH transport runs over a `DuplexStream` that -/// is bridged to the supervisor's local SSH daemon via a reverse HTTP CONNECT -/// tunnel. +/// is bridged to the supervisor's local SSH daemon via `RelayStream`. #[allow(clippy::too_many_arguments)] async fn stream_exec_over_relay( tx: mpsc::Sender>, @@ -1034,6 +1365,87 @@ mod tests { assert!(build_remote_exec_command(&req).is_err()); } + #[test] + fn tcp_forward_init_allows_loopback_targets() { + for host in ["127.0.0.1", "::1", "localhost"] { + let init = TcpForwardInit { + sandbox_id: "sbx".to_string(), + service_id: String::new(), + target: Some(tcp_forward_init::Target::Tcp(TcpRelayTarget { + host: host.to_string(), + port: 8080, + })), + authorization_token: String::new(), + }; + validate_tcp_forward_init(&init).expect("loopback target should pass"); + } + } + + #[test] + fn tcp_forward_init_allows_ssh_target() { + let init = TcpForwardInit { + sandbox_id: "sbx".to_string(), + target: Some(tcp_forward_init::Target::Ssh(Default::default())), + ..Default::default() + }; + match validate_tcp_forward_init(&init).expect("ssh target should pass") { + relay_open::Target::Ssh(_) => {} + other => panic!("expected SSH target, got {other:?}"), + } + } + + #[test] + fn tcp_forward_init_rejects_non_loopback_targets() { + let init = TcpForwardInit { + sandbox_id: "sbx".to_string(), + service_id: String::new(), + target: Some(tcp_forward_init::Target::Tcp(TcpRelayTarget { + host: "example.com".to_string(), + port: 8080, + })), + authorization_token: String::new(), + }; + assert_eq!( + validate_tcp_forward_init(&init) + .expect_err("hostname rejected") + .message(), + "tcp target host must be loopback" + ); + } + + #[test] + fn tcp_forward_init_rejects_invalid_port() { + let init = TcpForwardInit { + sandbox_id: "sbx".to_string(), + service_id: String::new(), + target: Some(tcp_forward_init::Target::Tcp(TcpRelayTarget { + host: "127.0.0.1".to_string(), + port: 0, + })), + authorization_token: String::new(), + }; + assert_eq!( + validate_tcp_forward_init(&init) + .expect_err("zero port rejected") + .message(), + "tcp target port must be between 1 and 65535" + ); + } + + #[test] + fn tcp_forward_init_requires_target() { + let init = TcpForwardInit { + sandbox_id: "sbx".to_string(), + ..Default::default() + }; + assert_eq!( + validate_tcp_forward_init(&init) + .expect_err("missing target rejected") + .message(), + "tcp forward target is required" + ); + } + // ---- petname / generate_name ---- #[test] diff --git a/crates/openshell-server/src/http.rs b/crates/openshell-server/src/http.rs index 7650c2339..7ca9cb8bf 100644 --- a/crates/openshell-server/src/http.rs +++ b/crates/openshell-server/src/http.rs @@ -59,7 +59,5 @@ async fn render_metrics(State(handle): State) -> impl IntoResp /// Create the HTTP router. pub fn http_router(state: Arc) -> Router { - crate::ssh_tunnel::router(state.clone()) - .merge(crate::ws_tunnel::router(state.clone())) - .merge(crate::auth::router(state)) + crate::ws_tunnel::router(state.clone()).merge(crate::auth::router(state)) } diff --git a/crates/openshell-server/src/lib.rs b/crates/openshell-server/src/lib.rs index 979bf4c1d..89935f614 100644 --- a/crates/openshell-server/src/lib.rs +++ b/crates/openshell-server/src/lib.rs @@ -30,7 +30,7 @@ mod persistence; pub(crate) mod policy_store; mod sandbox_index; mod sandbox_watch; -mod ssh_tunnel; +mod ssh_sessions; pub mod supervisor_session; mod tls; pub mod tracing_bus; @@ -190,7 +190,7 @@ pub async fn run_server( } state.compute.spawn_watchers(); - ssh_tunnel::spawn_session_reaper(store.clone(), Duration::from_secs(3600)); + ssh_sessions::spawn_session_reaper(store.clone(), Duration::from_secs(3600)); supervisor_session::spawn_relay_reaper(state.clone(), Duration::from_secs(30)); // Create the multiplexed service diff --git a/crates/openshell-server/src/multiplex.rs b/crates/openshell-server/src/multiplex.rs index e0c159958..f2632cdf3 100644 --- a/crates/openshell-server/src/multiplex.rs +++ b/crates/openshell-server/src/multiplex.rs @@ -286,7 +286,6 @@ fn grpc_status_from_response(res: &Response) -> String { fn normalize_http_path(path: &str) -> &'static str { match path { - p if p.starts_with("/connect/ssh") => "/connect/ssh", p if p.starts_with("/_ws_tunnel") => "/_ws_tunnel", p if p.starts_with("/auth/") => "/auth", _ => "unknown", @@ -353,19 +352,6 @@ mod tests { assert_eq!(grpc_method_from_path(""), ""); } - #[test] - fn normalize_ssh_path() { - assert_eq!(normalize_http_path("/connect/ssh"), "/connect/ssh"); - } - - #[test] - fn normalize_ssh_path_with_trailing_segments() { - assert_eq!( - normalize_http_path("/connect/ssh?token=abc"), - "/connect/ssh" - ); - } - #[test] fn normalize_ws_tunnel() { assert_eq!(normalize_http_path("/_ws_tunnel"), "/_ws_tunnel"); diff --git a/crates/openshell-server/src/ssh_sessions.rs b/crates/openshell-server/src/ssh_sessions.rs new file mode 100644 index 000000000..f8d85033d --- /dev/null +++ b/crates/openshell-server/src/ssh_sessions.rs @@ -0,0 +1,185 @@ +// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// SPDX-License-Identifier: Apache-2.0 + +//! SSH session token storage and cleanup. + +use openshell_core::ObjectId; +use openshell_core::proto::SshSession; +use prost::Message; +use std::sync::Arc; +use std::time::Duration; +use tracing::{info, warn}; + +use crate::persistence::{ObjectType, Store}; + +impl ObjectType for SshSession { + fn object_type() -> &'static str { + "ssh_session" + } +} + +/// Spawn a background task that periodically reaps expired and revoked SSH sessions. +pub fn spawn_session_reaper(store: Arc, interval: Duration) { + tokio::spawn(async move { + tokio::time::sleep(interval).await; + + loop { + if let Err(e) = reap_expired_sessions(&store).await { + warn!(error = %e, "SSH session reaper sweep failed"); + } + tokio::time::sleep(interval).await; + } + }); +} + +async fn reap_expired_sessions(store: &Store) -> Result<(), String> { + let now_ms = std::time::SystemTime::now() + .duration_since(std::time::UNIX_EPOCH) + .unwrap_or_default() + .as_millis() as i64; + + let records = store + .list(SshSession::object_type(), 1000, 0) + .await + .map_err(|e| e.to_string())?; + + let mut reaped = 0u32; + for record in records { + let session: SshSession = match Message::decode(record.payload.as_slice()) { + Ok(s) => s, + Err(_) => continue, + }; + + let should_delete = + (session.expires_at_ms > 0 && now_ms > session.expires_at_ms) || session.revoked; + + if should_delete { + if let Err(e) = store + .delete(SshSession::object_type(), session.object_id()) + .await + { + warn!(session_id = %session.object_id(), error = %e, "Failed to reap SSH session"); + } else { + reaped += 1; + } + } + } + + if reaped > 0 { + info!(count = reaped, "SSH session reaper: cleaned up sessions"); + } + Ok(()) +} + +#[cfg(test)] +mod tests { + use super::*; + use std::collections::HashMap; + + fn make_session(id: &str, sandbox_id: &str, expires_at_ms: i64, revoked: bool) -> SshSession { + SshSession { + metadata: Some(openshell_core::proto::datamodel::v1::ObjectMeta { + id: id.to_string(), + name: format!("session-{id}"), + created_at_ms: 1000, + labels: HashMap::new(), + }), + sandbox_id: sandbox_id.to_string(), + token: id.to_string(), + expires_at_ms, + revoked, + } + } + + fn now_ms() -> i64 { + std::time::SystemTime::now() + .duration_since(std::time::UNIX_EPOCH) + .unwrap() + .as_millis() as i64 + } + + #[tokio::test] + async fn reaper_deletes_expired_sessions() { + let store = Store::connect("sqlite::memory:?cache=shared") + .await + .unwrap(); + + let expired = make_session("expired1", "sbx1", now_ms() - 60_000, false); + store.put_message(&expired).await.unwrap(); + + let valid = make_session("valid1", "sbx1", now_ms() + 3_600_000, false); + store.put_message(&valid).await.unwrap(); + + reap_expired_sessions(&store).await.unwrap(); + + assert!( + store + .get_message::("expired1") + .await + .unwrap() + .is_none(), + "expired session should be reaped" + ); + assert!( + store + .get_message::("valid1") + .await + .unwrap() + .is_some(), + "valid session should be kept" + ); + } + + #[tokio::test] + async fn reaper_deletes_revoked_sessions() { + let store = Store::connect("sqlite::memory:?cache=shared") + .await + .unwrap(); + + let revoked = make_session("revoked1", "sbx1", 0, true); + store.put_message(&revoked).await.unwrap(); + + let active = make_session("active1", "sbx1", 0, false); + store.put_message(&active).await.unwrap(); + + reap_expired_sessions(&store).await.unwrap(); + + assert!( + store + .get_message::("revoked1") + .await + .unwrap() + .is_none(), + "revoked session should be reaped" + ); + assert!( + store + .get_message::("active1") + .await + .unwrap() + .is_some(), + "active session should be kept" + ); + } + + #[tokio::test] + async fn reaper_preserves_zero_expiry_sessions() { + let store = Store::connect("sqlite::memory:?cache=shared") + .await + .unwrap(); + + let no_expiry = make_session("noexpiry1", "sbx1", 0, false); + store.put_message(&no_expiry).await.unwrap(); + + reap_expired_sessions(&store).await.unwrap(); + + assert!( + store + .get_message::("noexpiry1") + .await + .unwrap() + .is_some(), + "session with no expiry should be preserved" + ); + } +} diff --git a/crates/openshell-server/src/ssh_tunnel.rs b/crates/openshell-server/src/ssh_tunnel.rs deleted file mode 100644 index 6b0232fa0..000000000 --- a/crates/openshell-server/src/ssh_tunnel.rs +++ /dev/null @@ -1,532 +0,0 @@ -// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -// SPDX-License-Identifier: Apache-2.0 - -//! SSH tunnel handler for the multiplexed gateway. - -use axum::{Router, extract::State, http::Method, response::IntoResponse, routing::any}; -use http::StatusCode; -use hyper::Request; -use hyper_util::rt::TokioIo; -use openshell_core::proto::{Sandbox, SandboxPhase, SshSession}; -use prost::Message; -use std::sync::Arc; -use std::time::Duration; -use tokio::io::AsyncWriteExt; -use tracing::{info, warn}; - -use crate::ServerState; -use crate::persistence::{ObjectType, Store}; - -const HEADER_SANDBOX_ID: &str = "x-sandbox-id"; -const HEADER_TOKEN: &str = "x-sandbox-token"; - -/// Maximum concurrent SSH tunnel connections per session token. -const MAX_CONNECTIONS_PER_TOKEN: u32 = 3; - -/// Redact a bearer token for safe logging — show only the last 4 characters. -fn redact_token(token: &str) -> String { - if token.len() <= 4 { - "****".to_string() - } else { - format!("****{}", &token[token.len() - 4..]) - } -} - -/// Maximum concurrent SSH tunnel connections per sandbox. -const MAX_CONNECTIONS_PER_SANDBOX: u32 = 20; - -fn acquire_connection_slots( - token_counts: &std::sync::Mutex>, - sandbox_counts: &std::sync::Mutex>, - token: &str, - sandbox_id: &str, -) -> Result<(), ConnectionLimit> { - { - let mut counts = token_counts.lock().unwrap(); - let count = counts.entry(token.to_string()).or_insert(0); - if *count >= MAX_CONNECTIONS_PER_TOKEN { - return Err(ConnectionLimit::PerToken); - } - *count += 1; - } - - { - let mut counts = sandbox_counts.lock().unwrap(); - let count = counts.entry(sandbox_id.to_string()).or_insert(0); - if *count >= MAX_CONNECTIONS_PER_SANDBOX { - decrement_connection_count(token_counts, token); - return Err(ConnectionLimit::PerSandbox); - } - *count += 1; - } - - Ok(()) -} - -enum ConnectionLimit { - PerToken, - PerSandbox, -} - -pub fn router(state: Arc) -> Router { - Router::new() - .route("/connect/ssh", any(ssh_connect)) - .with_state(state) -} - -async fn ssh_connect( - State(state): State>, - req: Request, -) -> impl IntoResponse { - if req.method() != Method::CONNECT { - return StatusCode::METHOD_NOT_ALLOWED.into_response(); - } - - let sandbox_id = match header_value(req.headers(), HEADER_SANDBOX_ID) { - Ok(value) => value, - Err(status) => return status.into_response(), - }; - let token = match header_value(req.headers(), HEADER_TOKEN) { - Ok(value) => value, - Err(status) => return status.into_response(), - }; - - let session = match state.store.get_message::(&token).await { - Ok(Some(session)) => session, - Ok(None) => return StatusCode::UNAUTHORIZED.into_response(), - Err(err) => { - warn!(error = %err, "Failed to fetch SSH session"); - return StatusCode::INTERNAL_SERVER_ERROR.into_response(); - } - }; - - if session.revoked || session.sandbox_id != sandbox_id { - return StatusCode::UNAUTHORIZED.into_response(); - } - - // Check token expiry (0 means no expiry for backward compatibility). - if session.expires_at_ms > 0 { - let now_ms = std::time::SystemTime::now() - .duration_since(std::time::UNIX_EPOCH) - .unwrap_or_default() - .as_millis() as i64; - if now_ms > session.expires_at_ms { - return StatusCode::UNAUTHORIZED.into_response(); - } - } - - let sandbox = match state.store.get_message::(&sandbox_id).await { - Ok(Some(sandbox)) => sandbox, - Ok(None) => return StatusCode::NOT_FOUND.into_response(), - Err(err) => { - warn!(error = %err, "Failed to fetch sandbox"); - return StatusCode::INTERNAL_SERVER_ERROR.into_response(); - } - }; - - if SandboxPhase::try_from(sandbox.phase).ok() != Some(SandboxPhase::Ready) { - return StatusCode::PRECONDITION_FAILED.into_response(); - } - - // Enforce connection caps *before* opening a relay — otherwise denied - // calls churn pending relay slots and wake the supervisor until the relay - // timeout elapses. - if let Err(limit) = acquire_connection_slots( - &state.ssh_connections_by_token, - &state.ssh_connections_by_sandbox, - &token, - &sandbox_id, - ) { - match limit { - ConnectionLimit::PerToken => { - warn!(token = %redact_token(&token), "SSH tunnel: per-token connection limit reached"); - } - ConnectionLimit::PerSandbox => { - warn!(sandbox_id = %sandbox_id, "SSH tunnel: per-sandbox connection limit reached"); - } - } - return StatusCode::TOO_MANY_REQUESTS.into_response(); - } - - // Open a relay channel through the supervisor session. Use a generous - // 30s session-wait timeout because `/connect/ssh` is typically called - // immediately after `sandbox create`, so we need to cover the supervisor's - // initial TLS + gRPC handshake on a cold-started pod. The old - // direct-connect path tolerated ~34s here for similar reasons. - let (channel_id, relay_rx) = match state - .supervisor_sessions - .open_relay(&sandbox_id, Duration::from_secs(30)) - .await - { - Ok(pair) => pair, - Err(status) => { - warn!(sandbox_id = %sandbox_id, error = %status.message(), "SSH tunnel: supervisor session not available"); - decrement_connection_count(&state.ssh_connections_by_token, &token); - decrement_connection_count(&state.ssh_connections_by_sandbox, &sandbox_id); - return StatusCode::BAD_GATEWAY.into_response(); - } - }; - - let sandbox_id_clone = sandbox_id.clone(); - let token_clone = token.clone(); - let state_clone = state.clone(); - - let upgrade = hyper::upgrade::on(req); - tokio::spawn(async move { - // Wait for the supervisor to open its `RelayStream` and deliver the - // bridge half of the relay. - let mut relay = match tokio::time::timeout(Duration::from_secs(10), relay_rx).await { - Ok(Ok(stream)) => stream, - Ok(Err(_)) => { - warn!(sandbox_id = %sandbox_id_clone, channel_id = %channel_id, "SSH tunnel: relay channel dropped"); - decrement_connection_count(&state_clone.ssh_connections_by_token, &token_clone); - decrement_connection_count( - &state_clone.ssh_connections_by_sandbox, - &sandbox_id_clone, - ); - return; - } - Err(_) => { - warn!(sandbox_id = %sandbox_id_clone, channel_id = %channel_id, "SSH tunnel: relay open timed out"); - decrement_connection_count(&state_clone.ssh_connections_by_token, &token_clone); - decrement_connection_count( - &state_clone.ssh_connections_by_sandbox, - &sandbox_id_clone, - ); - return; - } - }; - - info!(sandbox_id = %sandbox_id_clone, channel_id = %channel_id, "SSH tunnel: relay established, bridging client"); - - match upgrade.await { - Ok(upgraded) => { - let mut upgraded = TokioIo::new(upgraded); - let _ = tokio::io::copy_bidirectional(&mut upgraded, &mut relay).await; - let _ = AsyncWriteExt::shutdown(&mut upgraded).await; - } - Err(err) => { - warn!(error = %err, "SSH upgrade failed"); - } - } - - // Decrement connection counts on tunnel completion. - decrement_connection_count(&state_clone.ssh_connections_by_token, &token_clone); - decrement_connection_count(&state_clone.ssh_connections_by_sandbox, &sandbox_id_clone); - }); - - StatusCode::OK.into_response() -} - -fn header_value(headers: &http::HeaderMap, name: &str) -> Result { - let value = headers - .get(name) - .ok_or(StatusCode::UNAUTHORIZED)? - .to_str() - .map_err(|_| StatusCode::BAD_REQUEST)? - .trim() - .to_string(); - if value.is_empty() { - return Err(StatusCode::BAD_REQUEST); - } - Ok(value) -} - -impl ObjectType for SshSession { - fn object_type() -> &'static str { - "ssh_session" - } -} - -/// Decrement a connection count entry, removing it if it reaches zero. -fn decrement_connection_count( - counts: &std::sync::Mutex>, - key: &str, -) { - let mut map = counts.lock().unwrap(); - if let Some(count) = map.get_mut(key) { - *count = count.saturating_sub(1); - if *count == 0 { - map.remove(key); - } - } -} - -/// Spawn a background task that periodically reaps expired and revoked SSH sessions. -pub fn spawn_session_reaper(store: Arc, interval: Duration) { - tokio::spawn(async move { - // Initial delay to let startup settle. - tokio::time::sleep(interval).await; - - loop { - if let Err(e) = reap_expired_sessions(&store).await { - warn!(error = %e, "SSH session reaper sweep failed"); - } - tokio::time::sleep(interval).await; - } - }); -} - -async fn reap_expired_sessions(store: &Store) -> Result<(), String> { - let now_ms = std::time::SystemTime::now() - .duration_since(std::time::UNIX_EPOCH) - .unwrap_or_default() - .as_millis() as i64; - - let records = store - .list(SshSession::object_type(), 1000, 0) - .await - .map_err(|e| e.to_string())?; - - let mut reaped = 0u32; - for record in records { - let session: SshSession = match Message::decode(record.payload.as_slice()) { - Ok(s) => s, - Err(_) => continue, - }; - - let should_delete = - // Expired sessions (expires_at_ms > 0 means expiry is set). - (session.expires_at_ms > 0 && now_ms > session.expires_at_ms) - // Revoked sessions — already invalidated, just cleaning up storage. - || session.revoked; - - if should_delete { - use openshell_core::ObjectId; - if let Err(e) = store - .delete(SshSession::object_type(), session.object_id()) - .await - { - warn!(session_id = %session.object_id(), error = %e, "Failed to reap SSH session"); - } else { - reaped += 1; - } - } - } - - if reaped > 0 { - info!(count = reaped, "SSH session reaper: cleaned up sessions"); - } - Ok(()) -} - -#[cfg(test)] -mod tests { - use super::*; - use crate::persistence::Store; - use std::collections::HashMap; - use std::sync::Mutex; - - fn make_session(id: &str, sandbox_id: &str, expires_at_ms: i64, revoked: bool) -> SshSession { - SshSession { - metadata: Some(openshell_core::proto::datamodel::v1::ObjectMeta { - id: id.to_string(), - name: format!("session-{}", id), - created_at_ms: 1000, - labels: HashMap::new(), - }), - sandbox_id: sandbox_id.to_string(), - token: id.to_string(), - expires_at_ms, - revoked, - } - } - - fn now_ms() -> i64 { - std::time::SystemTime::now() - .duration_since(std::time::UNIX_EPOCH) - .unwrap() - .as_millis() as i64 - } - - // ---- Connection limit tests ---- - - #[test] - fn decrement_removes_entry_at_zero() { - let counts: Mutex> = Mutex::new(HashMap::new()); - counts.lock().unwrap().insert("tok1".to_string(), 1); - decrement_connection_count(&counts, "tok1"); - assert!(counts.lock().unwrap().is_empty()); - } - - #[test] - fn decrement_reduces_count() { - let counts: Mutex> = Mutex::new(HashMap::new()); - counts.lock().unwrap().insert("tok1".to_string(), 5); - decrement_connection_count(&counts, "tok1"); - assert_eq!(*counts.lock().unwrap().get("tok1").unwrap(), 4); - } - - #[test] - fn decrement_missing_key_is_noop() { - let counts: Mutex> = Mutex::new(HashMap::new()); - decrement_connection_count(&counts, "nonexistent"); - assert!(counts.lock().unwrap().is_empty()); - } - - #[test] - fn per_token_connection_limit_enforced() { - let counts: Mutex> = Mutex::new(HashMap::new()); - counts - .lock() - .unwrap() - .insert("tok1".to_string(), MAX_CONNECTIONS_PER_TOKEN); - let current = *counts.lock().unwrap().get("tok1").unwrap(); - assert!(current >= MAX_CONNECTIONS_PER_TOKEN); - } - - #[test] - fn per_sandbox_connection_limit_enforced() { - let counts: Mutex> = Mutex::new(HashMap::new()); - counts - .lock() - .unwrap() - .insert("sbx1".to_string(), MAX_CONNECTIONS_PER_SANDBOX); - let current = *counts.lock().unwrap().get("sbx1").unwrap(); - assert!(current >= MAX_CONNECTIONS_PER_SANDBOX); - } - - #[test] - fn acquire_connection_slots_rejects_per_token_limit_without_touching_sandbox() { - let token_counts: Mutex> = Mutex::new(HashMap::new()); - let sandbox_counts: Mutex> = Mutex::new(HashMap::new()); - token_counts - .lock() - .unwrap() - .insert("tok1".to_string(), MAX_CONNECTIONS_PER_TOKEN); - - let result = acquire_connection_slots(&token_counts, &sandbox_counts, "tok1", "sbx1"); - - assert!(matches!(result, Err(ConnectionLimit::PerToken))); - assert!(sandbox_counts.lock().unwrap().is_empty()); - } - - #[test] - fn acquire_connection_slots_rolls_back_token_increment_on_sandbox_limit() { - let token_counts: Mutex> = Mutex::new(HashMap::new()); - let sandbox_counts: Mutex> = Mutex::new(HashMap::new()); - sandbox_counts - .lock() - .unwrap() - .insert("sbx1".to_string(), MAX_CONNECTIONS_PER_SANDBOX); - - let result = acquire_connection_slots(&token_counts, &sandbox_counts, "tok1", "sbx1"); - - assert!(matches!(result, Err(ConnectionLimit::PerSandbox))); - assert!(token_counts.lock().unwrap().is_empty()); - } - - // ---- Session reaper tests ---- - - #[tokio::test] - async fn reaper_deletes_expired_sessions() { - let store = Store::connect("sqlite::memory:?cache=shared") - .await - .unwrap(); - - let expired = make_session("expired1", "sbx1", now_ms() - 60_000, false); - store.put_message(&expired).await.unwrap(); - - let valid = make_session("valid1", "sbx1", now_ms() + 3_600_000, false); - store.put_message(&valid).await.unwrap(); - - reap_expired_sessions(&store).await.unwrap(); - - assert!( - store - .get_message::("expired1") - .await - .unwrap() - .is_none(), - "expired session should be reaped" - ); - assert!( - store - .get_message::("valid1") - .await - .unwrap() - .is_some(), - "valid session should be kept" - ); - } - - #[tokio::test] - async fn reaper_deletes_revoked_sessions() { - let store = Store::connect("sqlite::memory:?cache=shared") - .await - .unwrap(); - - let revoked = make_session("revoked1", "sbx1", 0, true); - store.put_message(&revoked).await.unwrap(); - - let active = make_session("active1", "sbx1", 0, false); - store.put_message(&active).await.unwrap(); - - reap_expired_sessions(&store).await.unwrap(); - - assert!( - store - .get_message::("revoked1") - .await - .unwrap() - .is_none(), - "revoked session should be reaped" - ); - assert!( - store - .get_message::("active1") - .await - .unwrap() - .is_some(), - "active session should be kept" - ); - } - - #[tokio::test] - async fn reaper_preserves_zero_expiry_sessions() { - let store = Store::connect("sqlite::memory:?cache=shared") - .await - .unwrap(); - - // expires_at_ms = 0 means no expiry (backward compatible). - let no_expiry = make_session("noexpiry1", "sbx1", 0, false); - store.put_message(&no_expiry).await.unwrap(); - - reap_expired_sessions(&store).await.unwrap(); - - assert!( - store - .get_message::("noexpiry1") - .await - .unwrap() - .is_some(), - "session with no expiry should be preserved" - ); - } - - // ---- Expiry validation logic tests ---- - - #[test] - fn expired_session_is_detected() { - let session = make_session("tok1", "sbx1", now_ms() - 1000, false); - let is_expired = session.expires_at_ms > 0 && now_ms() > session.expires_at_ms; - assert!(is_expired, "session in the past should be expired"); - } - - #[test] - fn future_session_is_not_expired() { - let session = make_session("tok1", "sbx1", now_ms() + 3_600_000, false); - let is_expired = session.expires_at_ms > 0 && now_ms() > session.expires_at_ms; - assert!(!is_expired, "session in the future should not be expired"); - } - - #[test] - fn zero_expiry_is_not_expired() { - let session = make_session("tok1", "sbx1", 0, false); - let is_expired = session.expires_at_ms > 0 && now_ms() > session.expires_at_ms; - assert!( - !is_expired, - "session with zero expiry should never be expired" - ); - } -} diff --git a/crates/openshell-server/src/supervisor_session.rs b/crates/openshell-server/src/supervisor_session.rs index cd250459a..7b73befa4 100644 --- a/crates/openshell-server/src/supervisor_session.rs +++ b/crates/openshell-server/src/supervisor_session.rs @@ -13,8 +13,8 @@ use tracing::{info, warn}; use uuid::Uuid; use openshell_core::proto::{ - GatewayMessage, RelayFrame, RelayInit, RelayOpen, Sandbox, SessionAccepted, SupervisorMessage, - gateway_message, supervisor_message, + GatewayMessage, RelayFrame, RelayInit, RelayOpen, Sandbox, SessionAccepted, SshRelayTarget, + SupervisorMessage, gateway_message, relay_open, supervisor_message, }; use crate::ServerState; @@ -58,8 +58,9 @@ struct LiveSession { connected_at: Instant, } -/// Holds a oneshot sender that will deliver the upgraded relay stream. -type RelayStreamSender = oneshot::Sender; +/// Holds a oneshot sender that will deliver the upgraded relay stream or a +/// target-open failure reported by the supervisor. +type RelayStreamSender = oneshot::Sender>; impl openshell_driver_docker::SupervisorReadiness for SupervisorSessionRegistry { fn is_supervisor_connected(&self, sandbox_id: &str) -> bool { @@ -79,6 +80,7 @@ pub struct SupervisorSessionRegistry { struct PendingRelay { sender: RelayStreamSender, sandbox_id: String, + relay_open: RelayOpen, created_at: Instant, } @@ -234,12 +236,45 @@ impl SupervisorSessionRegistry { &self, sandbox_id: &str, session_wait_timeout: Duration, - ) -> Result<(String, oneshot::Receiver), Status> { + ) -> Result< + ( + String, + oneshot::Receiver>, + ), + Status, + > { + self.open_relay_with_target( + sandbox_id, + relay_open::Target::Ssh(SshRelayTarget {}), + "".to_string(), + session_wait_timeout, + ) + .await + } + + pub async fn open_relay_with_target( + &self, + sandbox_id: &str, + target: relay_open::Target, + service_id: String, + session_wait_timeout: Duration, + ) -> Result< + ( + String, + oneshot::Receiver>, + ), + Status, + > { let tx = self .wait_for_session(sandbox_id, session_wait_timeout) .await?; let channel_id = Uuid::new_v4().to_string(); + let relay_open = RelayOpen { + channel_id: channel_id.clone(), + target: Some(target), + service_id, + }; // Register the pending relay before sending RelayOpen to avoid a race. // Both caps are checked and the insert happens under a single lock hold @@ -267,15 +302,14 @@ impl SupervisorSessionRegistry { PendingRelay { sender: relay_tx, sandbox_id: sandbox_id.to_string(), + relay_open: relay_open.clone(), created_at: Instant::now(), }, ); } let msg = GatewayMessage { - payload: Some(gateway_message::Payload::RelayOpen(RelayOpen { - channel_id: channel_id.clone(), - })), + payload: Some(gateway_message::Payload::RelayOpen(relay_open)), }; if tx.send(msg).await.is_err() { @@ -287,6 +321,16 @@ impl SupervisorSessionRegistry { Ok((channel_id, relay_rx)) } + pub fn fail_pending_relay(&self, channel_id: &str, error: String) -> bool { + let pending = self.pending_relays.lock().unwrap().remove(channel_id); + if let Some(pending) = pending { + let _ = pending.sender.send(Err(Status::unavailable(error))); + true + } else { + false + } + } + /// Claim a pending relay channel. Called by the /relay/{channel_id} HTTP handler /// when the supervisor's reverse CONNECT arrives. /// @@ -306,8 +350,8 @@ impl SupervisorSessionRegistry { // the supervisor HTTP CONNECT handler. let (gateway_stream, supervisor_stream) = tokio::io::duplex(64 * 1024); - // Send the gateway-side stream to the waiter (ssh_tunnel or exec handler). - if pending.sender.send(gateway_stream).is_err() { + // Send the gateway-side stream to the waiter (exec handler or forward handler). + if pending.sender.send(Ok(gateway_stream)).is_err() { return Err(Status::internal("relay requester dropped")); } @@ -327,10 +371,17 @@ impl SupervisorSessionRegistry { pub async fn replay_pending_relays(&self, sandbox_id: &str, tx: &mpsc::Sender) { for channel_id in self.pending_channel_ids(sandbox_id) { + let relay_open = { + let pending = self.pending_relays.lock().unwrap(); + pending + .get(&channel_id) + .map(|pending| pending.relay_open.clone()) + }; + let Some(relay_open) = relay_open else { + continue; + }; let msg = GatewayMessage { - payload: Some(gateway_message::Payload::RelayOpen(RelayOpen { - channel_id: channel_id.clone(), - })), + payload: Some(gateway_message::Payload::RelayOpen(relay_open)), }; if tx.send(msg).await.is_err() { warn!(sandbox_id = %sandbox_id, channel_id = %channel_id, "supervisor session: failed to replay pending relay to superseding session"); @@ -623,7 +674,7 @@ pub async fn handle_connect_supervisor( } async fn run_session_loop( - _state: &Arc, + state: &Arc, sandbox_id: &str, session_id: &str, tx: &mpsc::Sender, @@ -644,7 +695,7 @@ async fn run_session_loop( msg = inbound.message() => { match msg { Ok(Some(msg)) => { - handle_supervisor_message(sandbox_id, session_id, msg); + handle_supervisor_message(state, sandbox_id, session_id, msg); } Ok(None) => { info!(sandbox_id = %sandbox_id, session_id = %session_id, "supervisor session: stream closed by supervisor"); @@ -671,7 +722,12 @@ async fn run_session_loop( } } -fn handle_supervisor_message(sandbox_id: &str, session_id: &str, msg: SupervisorMessage) { +fn handle_supervisor_message( + state: &Arc, + sandbox_id: &str, + session_id: &str, + msg: SupervisorMessage, +) { match msg.payload { Some(supervisor_message::Payload::Heartbeat(_)) => { // Heartbeat received — nothing to do for now. @@ -685,11 +741,15 @@ fn handle_supervisor_message(sandbox_id: &str, session_id: &str, msg: Supervisor "supervisor session: relay opened successfully" ); } else { + let failed = state + .supervisor_sessions + .fail_pending_relay(&result.channel_id, result.error.clone()); warn!( sandbox_id = %sandbox_id, session_id = %session_id, channel_id = %result.channel_id, error = %result.error, + pending_relay_failed = failed, "supervisor session: relay open failed" ); } @@ -742,6 +802,23 @@ mod tests { } } + fn pending_relay( + sandbox_id: &str, + relay_tx: RelayStreamSender, + created_at: Instant, + ) -> PendingRelay { + PendingRelay { + sender: relay_tx, + sandbox_id: sandbox_id.to_string(), + relay_open: RelayOpen { + channel_id: "ch-test".to_string(), + target: Some(relay_open::Target::Ssh(SshRelayTarget {})), + service_id: String::new(), + }, + created_at, + } + } + // ---- registry: register / remove ---- #[test] @@ -860,6 +937,7 @@ mod tests { match msg.payload { Some(gateway_message::Payload::RelayOpen(open)) => { assert_eq!(open.channel_id, channel_id); + assert!(matches!(open.target, Some(relay_open::Target::Ssh(_)))); } other => panic!("expected RelayOpen, got {other:?}"), } @@ -941,11 +1019,7 @@ mod tests { let sandbox_id = if i % 2 == 0 { "sbx-a" } else { "sbx-b" }; pending.insert( format!("channel-{i}"), - PendingRelay { - sender: oneshot_tx, - sandbox_id: sandbox_id.to_string(), - created_at: Instant::now(), - }, + pending_relay(sandbox_id, oneshot_tx, Instant::now()), ); } } @@ -970,11 +1044,7 @@ mod tests { let (oneshot_tx, _) = oneshot::channel(); pending.insert( format!("channel-{i}"), - PendingRelay { - sender: oneshot_tx, - sandbox_id: "sbx".to_string(), - created_at: Instant::now(), - }, + pending_relay("sbx", oneshot_tx, Instant::now()), ); } } @@ -1170,11 +1240,7 @@ mod tests { let (relay_tx, _relay_rx) = oneshot::channel(); registry.pending_relays.lock().unwrap().insert( "ch-1".to_string(), - PendingRelay { - sender: relay_tx, - sandbox_id: "sbx-test".to_string(), - created_at: Instant::now(), - }, + pending_relay("sbx-test", relay_tx, Instant::now()), ); let result = registry.claim_relay("ch-1"); @@ -1182,17 +1248,41 @@ mod tests { assert!(!registry.pending_relays.lock().unwrap().contains_key("ch-1")); } + #[tokio::test] + async fn relay_open_failure_completes_pending_waiter() { + let registry = SupervisorSessionRegistry::new(); + let (relay_tx, relay_rx) = oneshot::channel(); + registry.pending_relays.lock().unwrap().insert( + "ch-fail".to_string(), + pending_relay("sbx-test", relay_tx, Instant::now()), + ); + + assert!(registry.fail_pending_relay("ch-fail", "target refused".to_string())); + assert!( + !registry + .pending_relays + .lock() + .unwrap() + .contains_key("ch-fail") + ); + + let result = relay_rx.await.expect("failure should wake waiter"); + let status = result.expect_err("waiter should receive status failure"); + assert_eq!(status.code(), tonic::Code::Unavailable); + assert_eq!(status.message(), "target refused"); + } + #[test] fn claim_relay_expired_returns_deadline_exceeded() { let registry = SupervisorSessionRegistry::new(); let (relay_tx, _relay_rx) = oneshot::channel(); registry.pending_relays.lock().unwrap().insert( "ch-old".to_string(), - PendingRelay { - sender: relay_tx, - sandbox_id: "sbx-test".to_string(), - created_at: Instant::now() - Duration::from_secs(60), - }, + pending_relay( + "sbx-test", + relay_tx, + Instant::now() - Duration::from_secs(60), + ), ); let err = registry @@ -1212,15 +1302,11 @@ mod tests { #[test] fn claim_relay_receiver_dropped_returns_internal() { let registry = SupervisorSessionRegistry::new(); - let (relay_tx, relay_rx) = oneshot::channel::(); + let (relay_tx, relay_rx) = oneshot::channel::>(); drop(relay_rx); // Gateway-side waiter has given up already. registry.pending_relays.lock().unwrap().insert( "ch-1".to_string(), - PendingRelay { - sender: relay_tx, - sandbox_id: "sbx-test".to_string(), - created_at: Instant::now(), - }, + pending_relay("sbx-test", relay_tx, Instant::now()), ); let err = registry @@ -1232,18 +1318,17 @@ mod tests { #[tokio::test] async fn claim_relay_connects_both_ends() { let registry = SupervisorSessionRegistry::new(); - let (relay_tx, relay_rx) = oneshot::channel::(); + let (relay_tx, relay_rx) = oneshot::channel::>(); registry.pending_relays.lock().unwrap().insert( "ch-io".to_string(), - PendingRelay { - sender: relay_tx, - sandbox_id: "sbx-test".to_string(), - created_at: Instant::now(), - }, + pending_relay("sbx-test", relay_tx, Instant::now()), ); let mut supervisor_side = registry.claim_relay("ch-io").expect("claim should succeed"); - let mut gateway_side = relay_rx.await.expect("gateway side should receive stream"); + let mut gateway_side = relay_rx + .await + .expect("gateway side should receive result") + .expect("gateway side should receive stream"); // Supervisor side writes → gateway side reads. supervisor_side.write_all(b"hello").await.unwrap(); @@ -1266,11 +1351,11 @@ mod tests { let (relay_tx, _relay_rx) = oneshot::channel(); registry.pending_relays.lock().unwrap().insert( "ch-old".to_string(), - PendingRelay { - sender: relay_tx, - sandbox_id: "sbx-test".to_string(), - created_at: Instant::now() - Duration::from_secs(60), - }, + pending_relay( + "sbx-test", + relay_tx, + Instant::now() - Duration::from_secs(60), + ), ); registry.reap_expired_relays(); @@ -1289,11 +1374,7 @@ mod tests { let (relay_tx, _relay_rx) = oneshot::channel(); registry.pending_relays.lock().unwrap().insert( "ch-fresh".to_string(), - PendingRelay { - sender: relay_tx, - sandbox_id: "sbx-test".to_string(), - created_at: Instant::now(), - }, + pending_relay("sbx-test", relay_tx, Instant::now()), ); registry.reap_expired_relays(); diff --git a/crates/openshell-server/tests/auth_endpoint_integration.rs b/crates/openshell-server/tests/auth_endpoint_integration.rs index 12f302b63..25a02049d 100644 --- a/crates/openshell-server/tests/auth_endpoint_integration.rs +++ b/crates/openshell-server/tests/auth_endpoint_integration.rs @@ -684,6 +684,21 @@ impl openshell_core::proto::open_shell_server::OpenShell for TestOpenShell { ) -> Result, tonic::Status> { Err(tonic::Status::unimplemented("not implemented in test")) } + + type ForwardTcpStream = std::pin::Pin< + Box< + dyn tokio_stream::Stream< + Item = Result, + > + Send, + >, + >; + + async fn forward_tcp( + &self, + _request: tonic::Request>, + ) -> Result, tonic::Status> { + Err(tonic::Status::unimplemented("not implemented in test")) + } } /// Test 7: Plaintext server (no TLS) accepts both gRPC and HTTP. diff --git a/crates/openshell-server/tests/edge_tunnel_auth.rs b/crates/openshell-server/tests/edge_tunnel_auth.rs index e8c7e0038..4a2f6cb11 100644 --- a/crates/openshell-server/tests/edge_tunnel_auth.rs +++ b/crates/openshell-server/tests/edge_tunnel_auth.rs @@ -42,9 +42,9 @@ use openshell_core::proto::{ GetSandboxConfigResponse, GetSandboxProviderEnvironmentRequest, GetSandboxProviderEnvironmentResponse, GetSandboxRequest, HealthRequest, HealthResponse, ListProvidersRequest, ListProvidersResponse, ListSandboxesRequest, ListSandboxesResponse, - ProviderResponse, RevokeSshSessionRequest, RevokeSshSessionResponse, SandboxResponse, - SandboxStreamEvent, ServiceStatus, SupervisorMessage, UpdateProviderRequest, - WatchSandboxRequest, + ProviderResponse, RelayFrame, RevokeSshSessionRequest, RevokeSshSessionResponse, + SandboxResponse, SandboxStreamEvent, ServiceStatus, SupervisorMessage, TcpForwardFrame, + UpdateProviderRequest, WatchSandboxRequest, open_shell_client::OpenShellClient, open_shell_server::{OpenShell, OpenShellServer}, }; @@ -317,15 +317,23 @@ impl OpenShell for TestOpenShell { Err(Status::unimplemented("not implemented in test")) } - type RelayStreamStream = tokio_stream::wrappers::ReceiverStream< - Result, - >; + type RelayStreamStream = ReceiverStream>; async fn relay_stream( &self, - _request: tonic::Request>, - ) -> Result, tonic::Status> { - Err(tonic::Status::unimplemented("not implemented in test")) + _request: tonic::Request>, + ) -> Result, Status> { + Err(Status::unimplemented("not implemented in test")) + } + + type ForwardTcpStream = + std::pin::Pin> + Send>>; + + async fn forward_tcp( + &self, + _request: tonic::Request>, + ) -> Result, Status> { + Err(Status::unimplemented("not implemented in test")) } } diff --git a/crates/openshell-server/tests/multiplex_integration.rs b/crates/openshell-server/tests/multiplex_integration.rs index 561ea2ba7..a230d958a 100644 --- a/crates/openshell-server/tests/multiplex_integration.rs +++ b/crates/openshell-server/tests/multiplex_integration.rs @@ -16,9 +16,9 @@ use openshell_core::proto::{ GetSandboxConfigResponse, GetSandboxProviderEnvironmentRequest, GetSandboxProviderEnvironmentResponse, GetSandboxRequest, HealthRequest, HealthResponse, ListProvidersRequest, ListProvidersResponse, ListSandboxesRequest, ListSandboxesResponse, - ProviderResponse, RevokeSshSessionRequest, RevokeSshSessionResponse, SandboxResponse, - SandboxStreamEvent, ServiceStatus, SupervisorMessage, UpdateProviderRequest, - WatchSandboxRequest, + ProviderResponse, RelayFrame, RevokeSshSessionRequest, RevokeSshSessionResponse, + SandboxResponse, SandboxStreamEvent, ServiceStatus, SupervisorMessage, TcpForwardFrame, + UpdateProviderRequest, WatchSandboxRequest, open_shell_client::OpenShellClient, open_shell_server::{OpenShell, OpenShellServer}, }; @@ -285,15 +285,23 @@ impl OpenShell for TestOpenShell { Err(Status::unimplemented("not implemented in test")) } - type RelayStreamStream = tokio_stream::wrappers::ReceiverStream< - Result, - >; + type RelayStreamStream = ReceiverStream>; async fn relay_stream( &self, - _request: tonic::Request>, - ) -> Result, tonic::Status> { - Err(tonic::Status::unimplemented("not implemented in test")) + _request: tonic::Request>, + ) -> Result, Status> { + Err(Status::unimplemented("not implemented in test")) + } + + type ForwardTcpStream = + std::pin::Pin> + Send>>; + + async fn forward_tcp( + &self, + _request: tonic::Request>, + ) -> Result, Status> { + Err(Status::unimplemented("not implemented in test")) } } diff --git a/crates/openshell-server/tests/multiplex_tls_integration.rs b/crates/openshell-server/tests/multiplex_tls_integration.rs index dc51e6118..95c9beca4 100644 --- a/crates/openshell-server/tests/multiplex_tls_integration.rs +++ b/crates/openshell-server/tests/multiplex_tls_integration.rs @@ -18,9 +18,9 @@ use openshell_core::proto::{ GetSandboxConfigResponse, GetSandboxProviderEnvironmentRequest, GetSandboxProviderEnvironmentResponse, GetSandboxRequest, HealthRequest, HealthResponse, ListProvidersRequest, ListProvidersResponse, ListSandboxesRequest, ListSandboxesResponse, - ProviderResponse, RevokeSshSessionRequest, RevokeSshSessionResponse, SandboxResponse, - SandboxStreamEvent, ServiceStatus, SupervisorMessage, UpdateProviderRequest, - WatchSandboxRequest, + ProviderResponse, RelayFrame, RevokeSshSessionRequest, RevokeSshSessionResponse, + SandboxResponse, SandboxStreamEvent, ServiceStatus, SupervisorMessage, TcpForwardFrame, + UpdateProviderRequest, WatchSandboxRequest, open_shell_client::OpenShellClient, open_shell_server::{OpenShell, OpenShellServer}, }; @@ -298,15 +298,23 @@ impl OpenShell for TestOpenShell { Err(Status::unimplemented("not implemented in test")) } - type RelayStreamStream = tokio_stream::wrappers::ReceiverStream< - Result, - >; + type RelayStreamStream = ReceiverStream>; async fn relay_stream( &self, - _request: tonic::Request>, - ) -> Result, tonic::Status> { - Err(tonic::Status::unimplemented("not implemented in test")) + _request: tonic::Request>, + ) -> Result, Status> { + Err(Status::unimplemented("not implemented in test")) + } + + type ForwardTcpStream = + std::pin::Pin> + Send>>; + + async fn forward_tcp( + &self, + _request: tonic::Request>, + ) -> Result, Status> { + Err(Status::unimplemented("not implemented in test")) } } diff --git a/crates/openshell-server/tests/supervisor_relay_integration.rs b/crates/openshell-server/tests/supervisor_relay_integration.rs index 85d263223..6c8bab316 100644 --- a/crates/openshell-server/tests/supervisor_relay_integration.rs +++ b/crates/openshell-server/tests/supervisor_relay_integration.rs @@ -23,7 +23,7 @@ use hyper_util::{ server::conn::auto::Builder, }; use openshell_core::proto::{ - GatewayMessage, RelayFrame, RelayInit, SupervisorMessage, + GatewayMessage, RelayFrame, RelayInit, SupervisorMessage, TcpForwardFrame, open_shell_client::OpenShellClient, open_shell_server::{OpenShell, OpenShellServer}, }; @@ -87,6 +87,15 @@ impl OpenShell for RelayGateway { Err(Status::unimplemented("unused")) } + type ForwardTcpStream = + std::pin::Pin> + Send>>; + async fn forward_tcp( + &self, + _: tonic::Request>, + ) -> Result, Status> { + Err(Status::unimplemented("unused")) + } + async fn health( &self, _: tonic::Request, @@ -385,7 +394,7 @@ async fn relay_round_trips_bytes() { tokio::spawn(run_echo_supervisor(channel, channel_id)); - let relay = relay_rx.await.expect("relay duplex"); + let relay = relay_rx.await.expect("relay result").expect("relay duplex"); let (mut read_half, mut write_half) = tokio::io::split(relay); write_half.write_all(b"hello relay").await.expect("write"); @@ -410,7 +419,7 @@ async fn relay_closes_cleanly_when_gateway_drops() { let supervisor = tokio::spawn(run_echo_supervisor(channel, channel_id)); - let relay = relay_rx.await.expect("relay duplex"); + let relay = relay_rx.await.expect("relay result").expect("relay duplex"); drop(relay); // The supervisor's inbound stream should terminate shortly after the @@ -455,7 +464,7 @@ async fn relay_sees_eof_when_supervisor_closes() { }) }; - let relay = relay_rx.await.expect("relay duplex"); + let relay = relay_rx.await.expect("relay result").expect("relay duplex"); let (mut read_half, _write_half) = tokio::io::split(relay); let mut buf = [0u8; 16]; let n = tokio::time::timeout(Duration::from_secs(5), read_half.read(&mut buf)) @@ -501,8 +510,8 @@ async fn concurrent_relays_multiplex_independently() { tokio::spawn(run_echo_supervisor(channel.clone(), id_a)); tokio::spawn(run_echo_supervisor(channel, id_b)); - let relay_a = rx_a.await.expect("relay a"); - let relay_b = rx_b.await.expect("relay b"); + let relay_a = rx_a.await.expect("relay a result").expect("relay a"); + let relay_b = rx_b.await.expect("relay b result").expect("relay b"); let (mut ra, mut wa) = tokio::io::split(relay_a); let (mut rb, mut wb) = tokio::io::split(relay_b); diff --git a/crates/openshell-server/tests/ws_tunnel_integration.rs b/crates/openshell-server/tests/ws_tunnel_integration.rs index 584f09281..72861c0a0 100644 --- a/crates/openshell-server/tests/ws_tunnel_integration.rs +++ b/crates/openshell-server/tests/ws_tunnel_integration.rs @@ -45,9 +45,9 @@ use openshell_core::proto::{ GetSandboxConfigResponse, GetSandboxProviderEnvironmentRequest, GetSandboxProviderEnvironmentResponse, GetSandboxRequest, HealthRequest, HealthResponse, ListProvidersRequest, ListProvidersResponse, ListSandboxesRequest, ListSandboxesResponse, - ProviderResponse, RevokeSshSessionRequest, RevokeSshSessionResponse, SandboxResponse, - SandboxStreamEvent, ServiceStatus, SupervisorMessage, UpdateProviderRequest, - WatchSandboxRequest, + ProviderResponse, RelayFrame, RevokeSshSessionRequest, RevokeSshSessionResponse, + SandboxResponse, SandboxStreamEvent, ServiceStatus, SupervisorMessage, TcpForwardFrame, + UpdateProviderRequest, WatchSandboxRequest, open_shell_client::OpenShellClient, open_shell_server::{OpenShell, OpenShellServer}, }; @@ -311,15 +311,23 @@ impl OpenShell for TestOpenShell { Err(Status::unimplemented("not implemented in test")) } - type RelayStreamStream = tokio_stream::wrappers::ReceiverStream< - Result, - >; + type RelayStreamStream = ReceiverStream>; async fn relay_stream( &self, - _request: tonic::Request>, - ) -> Result, tonic::Status> { - Err(tonic::Status::unimplemented("not implemented in test")) + _request: tonic::Request>, + ) -> Result, Status> { + Err(Status::unimplemented("not implemented in test")) + } + + type ForwardTcpStream = + std::pin::Pin> + Send>>; + + async fn forward_tcp( + &self, + _request: tonic::Request>, + ) -> Result, Status> { + Err(Status::unimplemented("not implemented in test")) } } diff --git a/crates/openshell-tui/src/lib.rs b/crates/openshell-tui/src/lib.rs index 0a8caf675..9f5b08146 100644 --- a/crates/openshell-tui/src/lib.rs +++ b/crates/openshell-tui/src/lib.rs @@ -842,10 +842,7 @@ async fn handle_shell_connect( let gateway_port_u16 = session.gateway_port as u16; let (gateway_host, gateway_port) = resolve_ssh_gateway(&session.gateway_host, gateway_port_u16, &app.endpoint); - let gateway_url = format!( - "{}://{}:{gateway_port}{}", - session.gateway_scheme, gateway_host, session.connect_path - ); + let gateway_url = format_gateway_url(&session.gateway_scheme, &gateway_host, gateway_port); // Step 4: Build the ProxyCommand using our own binary. let exe = match std::env::current_exe() { @@ -990,10 +987,7 @@ async fn handle_exec_command( let gateway_port_u16 = session.gateway_port as u16; let (gateway_host, gateway_port) = resolve_ssh_gateway(&session.gateway_host, gateway_port_u16, &app.endpoint); - let gateway_url = format!( - "{}://{}:{gateway_port}{}", - session.gateway_scheme, gateway_host, session.connect_path - ); + let gateway_url = format_gateway_url(&session.gateway_scheme, &gateway_host, gateway_port); let exe = match std::env::current_exe() { Ok(p) => p, @@ -1082,7 +1076,8 @@ async fn handle_exec_command( // SSH utility functions are shared via openshell_core::forward. use openshell_core::forward::{ - build_proxy_command, resolve_ssh_gateway, shell_escape, validate_ssh_session_response, + build_proxy_command, format_gateway_url, resolve_ssh_gateway, shell_escape, + validate_ssh_session_response, }; /// Convert a `SandboxPolicy` proto into styled ratatui lines for the policy viewer. @@ -1428,10 +1423,7 @@ async fn start_port_forwards( let gateway_port_u16 = session.gateway_port as u16; let (gateway_host, gateway_port) = resolve_ssh_gateway(&session.gateway_host, gateway_port_u16, endpoint); - let gateway_url = format!( - "{}://{}:{gateway_port}{}", - session.gateway_scheme, gateway_host, session.connect_path - ); + let gateway_url = format_gateway_url(&session.gateway_scheme, &gateway_host, gateway_port); // Build ProxyCommand. let exe = match std::env::current_exe() { diff --git a/docs/sandboxes/manage-sandboxes.mdx b/docs/sandboxes/manage-sandboxes.mdx index fb24bae9b..33ac289cd 100644 --- a/docs/sandboxes/manage-sandboxes.mdx +++ b/docs/sandboxes/manage-sandboxes.mdx @@ -180,6 +180,14 @@ openshell forward list openshell forward stop 8000 my-sandbox ``` +Use the gRPC service-forwarding path when you want to test the OS-88 service relay path without SSH port forwarding: + +```shell +openshell forward service my-sandbox --target-port 8000 --local 8000 +``` + +This binds a local listener and opens one authenticated gRPC stream to the gateway for each accepted local TCP connection. The target must be a loopback TCP service inside the sandbox. Use `--local 127.0.0.1:0` to let OpenShell choose a free local port. + You can also forward a port at creation time with `--forward`: diff --git a/proto/openshell.proto b/proto/openshell.proto index 1d7eba218..91570bf84 100644 --- a/proto/openshell.proto +++ b/proto/openshell.proto @@ -42,6 +42,9 @@ service OpenShell { // Execute a command in a ready sandbox and stream output. rpc ExecSandbox(ExecSandboxRequest) returns (stream ExecSandboxEvent); + // Forward one CLI-side TCP connection to a loopback TCP target in a sandbox. + rpc ForwardTcp(stream TcpForwardFrame) returns (stream TcpForwardFrame); + // Create a provider. rpc CreateProvider(CreateProviderRequest) returns (ProviderResponse); @@ -95,8 +98,9 @@ service OpenShell { // // The supervisor opens this stream at startup and keeps it alive for the // sandbox lifetime. The gateway uses it to coordinate relay channels for - // SSH connect and ExecSandbox. Raw SSH bytes flow over RelayStream calls - // (separate HTTP/2 streams on the same connection), not over this stream. + // SSH connect, ExecSandbox, and targetable sandbox services. Raw service + // bytes flow over RelayStream calls (separate HTTP/2 streams on the same + // connection), not over this stream. rpc ConnectSupervisor(stream SupervisorMessage) returns (stream GatewayMessage); // Raw byte relay between supervisor and gateway. @@ -105,8 +109,8 @@ service OpenShell { // on its ConnectSupervisor stream. The first RelayFrame carries a // RelayInit with the channel_id to associate the new HTTP/2 stream with // the pending relay slot on the gateway. Subsequent frames carry raw bytes in either - // direction between the gateway-side waiter (ssh_tunnel / exec handler) - // and the supervisor-side local SSH daemon bridge. + // direction between the gateway-side waiter (ForwardTcp / exec handler) + // and the supervisor-side target bridge. // // This rides the same TCP+TLS+HTTP/2 connection as ConnectSupervisor — // no new TLS handshake, no reverse HTTP CONNECT. @@ -363,11 +367,6 @@ message CreateSshSessionResponse { // Gateway scheme. Must be exactly "http" or "https". string gateway_scheme = 5; - // HTTP path for the CONNECT/upgrade endpoint. Must begin with `/`. RFC - // 3986 path charset only ([A-Za-z0-9._~!$&'()*+,;=:@/-] plus %HH). - // Must not contain `?`, `#`, whitespace, backtick, or backslash. - string connect_path = 6; - // Optional host key fingerprint. If non-empty, [A-Za-z0-9:+/=-] only. string host_key_fingerprint = 7; @@ -435,6 +434,30 @@ message ExecSandboxEvent { } } +// Initial frame for one TCP forward stream. +message TcpForwardInit { + // Sandbox id. + string sandbox_id = 1; + // Optional service identifier for audit/correlation. + string service_id = 4; + // Target the gateway should request from the supervisor. + oneof target { + SshRelayTarget ssh = 5; + TcpRelayTarget tcp = 6; + } + // Optional target-specific authorization token. SSH targets use this as the + // short-lived SSH session token issued by CreateSshSession. + string authorization_token = 7; +} + +// A single frame on the CLI-to-gateway TCP forward stream. +message TcpForwardFrame { + oneof payload { + TcpForwardInit init = 1; + bytes data = 2; + } +} + // SSH session record stored in persistence. message SshSession { // Kubernetes-style metadata (id, name, labels, timestamps, resource version). @@ -835,10 +858,29 @@ message GatewayHeartbeat {} // On receiving this, the supervisor should initiate a RelayStream RPC to // the gateway, sending a RelayInit in the first RelayFrame to associate // the new HTTP/2 stream with the pending relay slot. The supervisor -// bridges that stream to the local SSH daemon. +// bridges that stream to the requested local target. message RelayOpen { // Gateway-allocated channel identifier (UUID). string channel_id = 1; + // Target the supervisor should dial inside the sandbox. + // If absent, supervisors treat the relay as SSH for compatibility. + oneof target { + SshRelayTarget ssh = 2; + TcpRelayTarget tcp = 3; + } + // Optional service identifier for audit/correlation. + string service_id = 5; +} + +// Built-in SSH relay target. +message SshRelayTarget {} + +// TCP target dialed by the supervisor from inside the sandbox. +message TcpRelayTarget { + // Phase 1 accepts loopback only: 127.0.0.1, ::1, or localhost. + string host = 1; + // Target port. Must fit in u16 and be non-zero. + uint32 port = 2; } // Initial RelayStream frame sent by the supervisor to claim a pending relay.