-
Notifications
You must be signed in to change notification settings - Fork 625
feat(docker): support GPU sandboxes #1076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -297,7 +297,8 @@ When environment variables are set, the entrypoint modifies the HelmChart manife | |
|
|
||
| GPU support is part of the single-node gateway bootstrap path rather than a separate architecture. | ||
|
|
||
| - `openshell gateway start --gpu` threads GPU device options through `crates/openshell-cli`, `crates/openshell-bootstrap`, and `crates/openshell-bootstrap/src/docker.rs`. | ||
| - `openshell gateway start` auto-detects GPU support and threads GPU device options through `crates/openshell-cli`, `crates/openshell-bootstrap`, and `crates/openshell-bootstrap/src/docker.rs`. Users can force passthrough with `--gpu` or disable auto-detection with `--no-gpu`. | ||
| - Auto-detection enables passthrough when Docker reports NVIDIA CDI devices. For local non-CDI hosts, it also enables passthrough when `/dev/nvidia*` devices exist and Docker reports the NVIDIA runtime. Remote legacy-runtime hosts still require explicit `--gpu`. | ||
| - When enabled, the cluster container is created with Docker `DeviceRequests`. The injection mechanism is selected based on whether CDI is enabled on the daemon (`SystemInfo.CDISpecDirs` via `GET /info`): | ||
| - **CDI enabled** (daemon reports non-empty `CDISpecDirs`): CDI device injection — `driver="cdi"` with `nvidia.com/gpu=all`. Specs are expected to be pre-generated on the host (e.g. automatically by the `nvidia-cdi-refresh.service` or manually via `nvidia-ctk generate`). | ||
| - **CDI not enabled**: `--gpus all` device request — `driver="nvidia"`, `count=-1`, which relies on the NVIDIA Container Runtime hook. | ||
|
|
@@ -317,9 +318,11 @@ Host GPU drivers & NVIDIA Container Toolkit | |
| └─ Pods: request nvidia.com/gpu in resource limits (CDI injection — no runtimeClassName needed) | ||
| ``` | ||
|
|
||
| ### `--gpu` flag | ||
| ### GPU flags | ||
|
|
||
| The `--gpu` flag on `gateway start` enables GPU passthrough. OpenShell auto-selects CDI when enabled on the daemon and falls back to Docker's NVIDIA GPU request path (`--gpus all`) otherwise. | ||
| `gateway start` enables GPU passthrough automatically when it detects NVIDIA GPU support. The `--gpu` flag forces GPU passthrough even when auto-detection does not find a device. The `--no-gpu` flag disables auto-detection. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: Does the |
||
|
|
||
| OpenShell auto-selects CDI when enabled on the daemon and falls back to Docker's NVIDIA GPU request path (`--gpus all`) otherwise. | ||
|
|
||
| Device injection uses CDI (`deviceListStrategy: cdi-cri`): the device plugin injects devices via direct CDI device requests in the CRI. Sandbox pods only need `nvidia.com/gpu: 1` in their resource limits, and GPU pods do not set `runtimeClassName`. | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -119,6 +119,9 @@ pub struct DeployOptions { | |
| /// - `["auto"]` — resolved at deploy time: CDI if enabled on the daemon, else the non-CDI fallback | ||
| /// - `[cdi-ids…]` — CDI DeviceRequest with the given device IDs | ||
| pub gpu: Vec<String>, | ||
| /// Detect NVIDIA GPU support during deploy and enable passthrough when no | ||
| /// explicit GPU device IDs were supplied. | ||
| pub gpu_auto_detect: bool, | ||
|
Comment on lines
+122
to
+124
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: Is there a reason for a separate value? Does an "auto" element in the gpu list not already do this? |
||
| /// When true, destroy any existing gateway resources before deploying. | ||
| /// When false, an existing gateway is left as-is and deployment is | ||
| /// skipped (the caller is responsible for prompting the user first). | ||
|
|
@@ -138,6 +141,7 @@ impl DeployOptions { | |
| registry_username: None, | ||
| registry_token: None, | ||
| gpu: vec![], | ||
| gpu_auto_detect: false, | ||
| recreate: false, | ||
| } | ||
| } | ||
|
|
@@ -202,6 +206,13 @@ impl DeployOptions { | |
| self | ||
| } | ||
|
|
||
| /// Enable or disable automatic GPU passthrough detection. | ||
| #[must_use] | ||
| pub fn with_gpu_auto_detect(mut self, auto_detect: bool) -> Self { | ||
| self.gpu_auto_detect = auto_detect; | ||
| self | ||
| } | ||
|
|
||
| /// Set whether to destroy and recreate existing gateway resources. | ||
| #[must_use] | ||
| pub fn with_recreate(mut self, recreate: bool) -> Self { | ||
|
|
@@ -270,7 +281,8 @@ where | |
| let disable_gateway_auth = options.disable_gateway_auth; | ||
| let registry_username = options.registry_username; | ||
| let registry_token = options.registry_token; | ||
| let gpu = options.gpu; | ||
| let mut gpu = options.gpu; | ||
| let gpu_auto_detect = options.gpu_auto_detect; | ||
| let recreate = options.recreate; | ||
|
|
||
| // Wrap on_log in Arc<Mutex<>> so we can share it with pull_remote_image | ||
|
|
@@ -296,17 +308,22 @@ where | |
| (preflight.docker, None) | ||
| }; | ||
|
|
||
| // CDI is considered enabled when the daemon reports at least one CDI spec | ||
| // directory via `GET /info` (`SystemInfo.CDISpecDirs`). An empty list or | ||
| // missing field means CDI is not configured and we fall back to the legacy | ||
| // NVIDIA `DeviceRequest` (driver="nvidia"). Detection is best-effort — | ||
| // failure to query daemon info is non-fatal. | ||
| let cdi_supported = target_docker | ||
| .info() | ||
| .await | ||
| .ok() | ||
| .and_then(|info| info.cdi_spec_dirs) | ||
| .is_some_and(|dirs| !dirs.is_empty()); | ||
| // GPU discovery is best-effort. Explicit `--gpu` still uses the legacy | ||
| // CDI-enabled check below, while auto-detection only enables GPU when the | ||
| // daemon reports NVIDIA CDI devices or the local host has NVIDIA devices | ||
| // plus the NVIDIA Docker runtime. | ||
| let docker_info = target_docker.info().await.ok(); | ||
| let cdi_supported = docker::docker_info_cdi_enabled(docker_info.as_ref()); | ||
| if gpu_auto_detect && gpu.is_empty() { | ||
| let detected_gpu = docker::auto_detect_gpu_device_ids( | ||
| docker_info.as_ref(), | ||
| remote_opts.is_none() && docker::local_nvidia_devices_present(), | ||
| ); | ||
| if !detected_gpu.is_empty() { | ||
| log("[status] Detected NVIDIA GPU support".to_string()); | ||
| gpu = detected_gpu; | ||
| } | ||
| } | ||
|
|
||
| // If an existing gateway is found, decide how to proceed: | ||
| // - recreate: destroy everything and start fresh | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Does the gateway still need GPU support? Does the new architecture not delegate this to the driver?