Skip to content

Adding qemu vm driver support with GPU pass-through#992

Open
vince-brisebois wants to merge 2 commits intomainfrom
vcauxbrisebo/vm-gpu-support-driver
Open

Adding qemu vm driver support with GPU pass-through#992
vince-brisebois wants to merge 2 commits intomainfrom
vcauxbrisebo/vm-gpu-support-driver

Conversation

@vince-brisebois
Copy link
Copy Markdown
Collaborator

Summary

Add QEMU backend support to the VM compute driver with VFIO GPU passthrough, enabling GPU-accelerated sandboxes on hosts without libkrun support. Includes a new openshell-vfio crate for safe GPU bind/unbind lifecycle, TAP networking with RAII cleanup, guest init GPU initialization, and automatic gateway registration in start.sh.

Related Issue

Changes

  • New openshell-vfio crate: Safe VFIO GPU bind/unbind with GpuBindGuard RAII, IOMMU group companion device handling, crash recovery via reconcile_stale_bindings, and atomic state persistence
  • QEMU launch path (runtime.rs): Q35/KVM with virtiofs, TAP networking, vhost-vsock, PCIe root ports for GPU passthrough; TapGuard RAII for leak-free TAP/iptables cleanup; procguard integration for virtiofsd and QEMU child processes
  • GPU inventory and subnet management (gpu.rs): GpuInventory for tracking GPU assignments, SubnetAllocator for per-VM TAP subnets, vsock CID allocation
  • Driver integration (driver.rs): GPU assignment/release in create_sandbox/delete_sandbox/monitor_sandbox, build_guest_environment with endpoint override for TAP path, GPU release on all error paths and abnormal VM exit
  • Guest init GPU support (openshell-vm-sandbox-init.sh): Kernel cmdline parsing for GPU_ENABLED, firmware staging to tmpfs, nvidia module loading, nvidia-smi validation, TAP static networking with DNS from kernel cmdline
  • Proto changes: Added supports_gpu, gpu_count to GetCapabilitiesResponse; gpu, gpu_device to DriverSandboxSpec and CreateSandboxRequest
  • CLI: Plumbed --gpu and --gpu-device flags through to CreateSandboxRequest
  • Gateway auto-registration: start.sh now runs gateway add before starting the server (using sudo -u $SUDO_USER for correct config ownership), eliminating the manual registration step
  • Documentation: Updated architecture/vm-gpu-sandbox-guide.md, crates/openshell-driver-vm/README.md

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@vince-brisebois vince-brisebois marked this pull request as ready for review April 27, 2026 22:03
@vince-brisebois vince-brisebois requested a review from a team as a code owner April 27, 2026 22:03
@vince-brisebois vince-brisebois requested a review from drew April 27, 2026 22:03
Comment thread architecture/podman-rootless-networking.md
@vince-brisebois vince-brisebois force-pushed the vcauxbrisebo/vm-gpu-support-driver branch from 7a747ab to 98c8eca Compare April 27, 2026 22:07
Comment thread crates/openshell-driver-vm/src/main.rs
Comment thread crates/openshell-driver-vm/start.sh
Comment on lines +1141 to +1144
/// Target a specific GPU by PCI address (e.g. "0000:2d:00.0") or index (e.g. "0", "1").
/// Only valid with --gpu. When omitted with --gpu, the first available GPU is assigned.
#[arg(long, requires = "gpu")]
gpu_device: Option<String>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, this is not specific to the VM driver and could be mapped to requests in k8s, Docker, or Podman?

As a follow up question: Does it make sense to allow gpu_device to be specified multiple times to allow for multiple devices, or should validation (e.g. a comma-separated list) be delegated to the driver?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question on both points. Yes, --gpu and --gpu-device are intentionally driver-agnostic — the proto defines them on CreateSandboxRequest and DriverSandboxSpec, so k8s/Docker/Podman drivers can map them to their native GPU request mechanisms. For multi-device: today the proto field is a single string, so multi-GPU per sandbox would need a proto change (repeated string gpu_devices) plus inventory updates. I propose to update this in a follow-up PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with a follow-up. Would an issue to discuss how users are expected to request GPUs be a good place to have a follow-up discussion? Some of the basic use cases that I can see are:

  1. A user wants a sandbox with any GPU. (count == 1)
  2. A user wants a sandbox with a specific number of GPUs. (count > 1).
  3. A user wants a sandbox with a SPECIFIC set of GPUs. (Specified by driver-specific IDs).

A more advanced use case that one could also start discussing is when a user wants a sandbox with access to one or more GPUs with specific properties. I would assume that this could also be reduced to a set of driver-specific IDs though, so maybe it is sufficient to demonstrate this transform.

driver_version: openshell_core::VERSION.to_string(),
default_image: self.config.default_image.clone(),
supports_gpu: self.has_gpu_capacity().await.unwrap_or(false),
gpu_count: 0,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is a raw int rich enough here? Should a driver expose the valid names of devices that are available, for example?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, a raw int is limited. A richer repeated GpuDeviceInfo message (with BDF, device name, availability) on GetCapabilitiesResponse would let the CLI show available devices and validate --gpu-device client-side. I propose to address this, along with the previous one, in a follow-up PR.

Comment on lines +258 to +259
supports_gpu: self.gpu_inventory.is_some(),
gpu_count: self.gpu_count,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is gpu_count not just the length of gpu_inventory? Is there a chance that self.gpu_inventory and self.gpu_count get "out of sync"?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — removed the separate gpu_count field from VmDriver entirely. capabilities() now derives it on demand by locking the inventory and calling gpu_count(). This eliminates any possibility of the two getting out of sync.

Comment on lines +374 to +376
command
.arg("--vm-krun-log-level")
.arg(self.config.krun_log_level.to_string());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why is --vm-krun-log-level set here an in the non-GPU branch?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, no reason for it to be duplicated. Hoisted --vm-krun-log-level out of both the GPU and non-GPU branches — it's now set once after the if/else block since it's common to both backends.

// there is a single OPENSHELL_ENDPOINT value in the env list.
let endpoint_override = if gpu_bdf.is_some() {
let subnet = match self
.subnet_allocator
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name subnet_allocator does not make it clear that this is required for GPU injection. If they're dependent on each other, maybe there's a better way to indicate this relationship.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — TAP subnet allocation is exclusively a GPU concern. I suggest a follow-up where I can move SubnetAllocator into GpuInventory (or a new GpuNetworking wrapper) so the dependency is structurally explicit rather than relying on naming alone. That'll also let us wrap both behind the existing Option<Arc<Mutex<...>>> gate and skip initialization when GPUs are disabled.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TAP subnet allocation is exclusively a GPU concern

Not knowing enough about why this is required, a naive question I would have is whether this is always the case, or only the case for the current vm driver feature set? Is it realistic that a user would expect to be able to configure something like this in the future?

A follow up to make the dependency structurally explicit sounds good though.

drew
drew previously approved these changes Apr 29, 2026
Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>
@vince-brisebois vince-brisebois force-pushed the vcauxbrisebo/vm-gpu-support-driver branch 2 times, most recently from b052bde to 76e54d8 Compare April 29, 2026 04:30
…upervisor reliability issues discovered during GPU VM bring-up.

Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>
@vince-brisebois vince-brisebois force-pushed the vcauxbrisebo/vm-gpu-support-driver branch from 76e54d8 to 38f069e Compare April 29, 2026 04:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants