Adding qemu vm driver support with GPU pass-through#992
Adding qemu vm driver support with GPU pass-through#992vince-brisebois wants to merge 2 commits intomainfrom
Conversation
7a747ab to
98c8eca
Compare
| /// Target a specific GPU by PCI address (e.g. "0000:2d:00.0") or index (e.g. "0", "1"). | ||
| /// Only valid with --gpu. When omitted with --gpu, the first available GPU is assigned. | ||
| #[arg(long, requires = "gpu")] | ||
| gpu_device: Option<String>, |
There was a problem hiding this comment.
Just to clarify, this is not specific to the VM driver and could be mapped to requests in k8s, Docker, or Podman?
As a follow up question: Does it make sense to allow gpu_device to be specified multiple times to allow for multiple devices, or should validation (e.g. a comma-separated list) be delegated to the driver?
There was a problem hiding this comment.
Good question on both points. Yes, --gpu and --gpu-device are intentionally driver-agnostic — the proto defines them on CreateSandboxRequest and DriverSandboxSpec, so k8s/Docker/Podman drivers can map them to their native GPU request mechanisms. For multi-device: today the proto field is a single string, so multi-GPU per sandbox would need a proto change (repeated string gpu_devices) plus inventory updates. I propose to update this in a follow-up PR.
There was a problem hiding this comment.
I'm fine with a follow-up. Would an issue to discuss how users are expected to request GPUs be a good place to have a follow-up discussion? Some of the basic use cases that I can see are:
- A user wants a sandbox with any GPU. (count == 1)
- A user wants a sandbox with a specific number of GPUs. (count > 1).
- A user wants a sandbox with a SPECIFIC set of GPUs. (Specified by driver-specific IDs).
A more advanced use case that one could also start discussing is when a user wants a sandbox with access to one or more GPUs with specific properties. I would assume that this could also be reduced to a set of driver-specific IDs though, so maybe it is sufficient to demonstrate this transform.
| driver_version: openshell_core::VERSION.to_string(), | ||
| default_image: self.config.default_image.clone(), | ||
| supports_gpu: self.has_gpu_capacity().await.unwrap_or(false), | ||
| gpu_count: 0, |
There was a problem hiding this comment.
Question: is a raw int rich enough here? Should a driver expose the valid names of devices that are available, for example?
There was a problem hiding this comment.
Agreed, a raw int is limited. A richer repeated GpuDeviceInfo message (with BDF, device name, availability) on GetCapabilitiesResponse would let the CLI show available devices and validate --gpu-device client-side. I propose to address this, along with the previous one, in a follow-up PR.
| supports_gpu: self.gpu_inventory.is_some(), | ||
| gpu_count: self.gpu_count, |
There was a problem hiding this comment.
Why is gpu_count not just the length of gpu_inventory? Is there a chance that self.gpu_inventory and self.gpu_count get "out of sync"?
There was a problem hiding this comment.
Good catch — removed the separate gpu_count field from VmDriver entirely. capabilities() now derives it on demand by locking the inventory and calling gpu_count(). This eliminates any possibility of the two getting out of sync.
| command | ||
| .arg("--vm-krun-log-level") | ||
| .arg(self.config.krun_log_level.to_string()); |
There was a problem hiding this comment.
Question: Why is --vm-krun-log-level set here an in the non-GPU branch?
There was a problem hiding this comment.
You're right, no reason for it to be duplicated. Hoisted --vm-krun-log-level out of both the GPU and non-GPU branches — it's now set once after the if/else block since it's common to both backends.
| // there is a single OPENSHELL_ENDPOINT value in the env list. | ||
| let endpoint_override = if gpu_bdf.is_some() { | ||
| let subnet = match self | ||
| .subnet_allocator |
There was a problem hiding this comment.
The name subnet_allocator does not make it clear that this is required for GPU injection. If they're dependent on each other, maybe there's a better way to indicate this relationship.
There was a problem hiding this comment.
Good point — TAP subnet allocation is exclusively a GPU concern. I suggest a follow-up where I can move SubnetAllocator into GpuInventory (or a new GpuNetworking wrapper) so the dependency is structurally explicit rather than relying on naming alone. That'll also let us wrap both behind the existing Option<Arc<Mutex<...>>> gate and skip initialization when GPUs are disabled.
There was a problem hiding this comment.
TAP subnet allocation is exclusively a GPU concern
Not knowing enough about why this is required, a naive question I would have is whether this is always the case, or only the case for the current vm driver feature set? Is it realistic that a user would expect to be able to configure something like this in the future?
A follow up to make the dependency structurally explicit sounds good though.
Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>
b052bde to
76e54d8
Compare
…upervisor reliability issues discovered during GPU VM bring-up. Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>
76e54d8 to
38f069e
Compare
Summary
Add QEMU backend support to the VM compute driver with VFIO GPU passthrough, enabling GPU-accelerated sandboxes on hosts without libkrun support. Includes a new openshell-vfio crate for safe GPU bind/unbind lifecycle, TAP networking with RAII cleanup, guest init GPU initialization, and automatic gateway registration in start.sh.
Related Issue
Changes
Testing
mise run pre-commitpassesChecklist