Skip to content

feat(docker): support GPU sandboxes#1076

Closed
drew wants to merge 1 commit intomainfrom
docker-gpu-support
Closed

feat(docker): support GPU sandboxes#1076
drew wants to merge 1 commit intomainfrom
docker-gpu-support

Conversation

@drew
Copy link
Copy Markdown
Collaborator

@drew drew commented Apr 30, 2026

Summary

Add GPU support for Docker compute-driver sandboxes and enable gateway GPU passthrough auto-detection on capable Docker hosts.

Related Issue

None

Changes

  • Detect NVIDIA CDI devices or the NVIDIA Docker runtime in the Docker compute driver and map GPU sandbox requests to Docker DeviceRequests.
  • Auto-detect gateway GPU passthrough on gateway start, with --gpu to force and --no-gpu to disable.
  • Add focused unit coverage and update CLI/docs/architecture references.

Testing

  • cargo fmt
  • cargo test -p openshell-driver-docker
  • cargo test -p openshell-bootstrap auto_detect_gpu
  • cargo test -p openshell-cli gateway_start
  • mise run pre-commit failed on unrelated local port conflict: sandbox_create_keeps_sandbox_with_forwarding found port 8080 in use by com.docke.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

Signed-off-by: Drew Newberry <anewberry@nvidia.com>
@drew drew requested a review from a team as a code owner April 30, 2026 07:05
@elezar
Copy link
Copy Markdown
Member

elezar commented Apr 30, 2026

It seems that we're trying to do the same thing. I had already created #1036 yesterday and have also verified that the landlocked changes from #608 work when added on top.

GPU support is part of the single-node gateway bootstrap path rather than a separate architecture.

- `openshell gateway start --gpu` threads GPU device options through `crates/openshell-cli`, `crates/openshell-bootstrap`, and `crates/openshell-bootstrap/src/docker.rs`.
- `openshell gateway start` auto-detects GPU support and threads GPU device options through `crates/openshell-cli`, `crates/openshell-bootstrap`, and `crates/openshell-bootstrap/src/docker.rs`. Users can force passthrough with `--gpu` or disable auto-detection with `--no-gpu`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Does the gateway still need GPU support? Does the new architecture not delegate this to the driver?

### GPU flags

The `--gpu` flag on `gateway start` enables GPU passthrough. OpenShell auto-selects CDI when enabled on the daemon and falls back to Docker's NVIDIA GPU request path (`--gpus all`) otherwise.
`gateway start` enables GPU passthrough automatically when it detects NVIDIA GPU support. The `--gpu` flag forces GPU passthrough even when auto-detection does not find a device. The `--no-gpu` flag disables auto-detection.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Does the --no-gpu disable auto detection or disable GPU support?

Comment on lines +122 to +124
/// Detect NVIDIA GPU support during deploy and enable passthrough when no
/// explicit GPU device IDs were supplied.
pub gpu_auto_detect: bool,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is there a reason for a separate value? Does an "auto" element in the gpu list not already do this?

Comment on lines +205 to +210
let gpu_device_request = docker
.info()
.await
.ok()
.as_ref()
.and_then(docker_gpu_device_request_from_info);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is there ONE DockerComputeDriver for all sandboxes, or is there one per sandbox?

@drew drew closed this Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants