Adding qemu vm driver support with GPU pass-through by vince-brisebois · Pull Request #992 · NVIDIA/OpenShell

vince-brisebois · 2026-04-27T19:05:21Z

Summary

Add QEMU backend support to the VM compute driver with VFIO GPU passthrough, enabling GPU-accelerated sandboxes on hosts without libkrun support. Includes a new openshell-vfio crate for safe GPU bind/unbind lifecycle, TAP networking with RAII cleanup, guest init GPU initialization, and automatic gateway registration in start.sh.

Related Issue

Changes

New openshell-vfio crate: Safe VFIO GPU bind/unbind with GpuBindGuard RAII, IOMMU group companion device handling, crash recovery via reconcile_stale_bindings, and atomic state persistence
QEMU launch path (runtime.rs): Q35/KVM with virtiofs, TAP networking, vhost-vsock, PCIe root ports for GPU passthrough; TapGuard RAII for leak-free TAP/iptables cleanup; procguard integration for virtiofsd and QEMU child processes
GPU inventory and subnet management (gpu.rs): GpuInventory for tracking GPU assignments, SubnetAllocator for per-VM TAP subnets, vsock CID allocation
Driver integration (driver.rs): GPU assignment/release in create_sandbox/delete_sandbox/monitor_sandbox, build_guest_environment with endpoint override for TAP path, GPU release on all error paths and abnormal VM exit
Guest init GPU support (openshell-vm-sandbox-init.sh): Kernel cmdline parsing for GPU_ENABLED, firmware staging to tmpfs, nvidia module loading, nvidia-smi validation, TAP static networking with DNS from kernel cmdline
Proto changes: Added supports_gpu, gpu_count to GetCapabilitiesResponse; gpu, gpu_device to DriverSandboxSpec and CreateSandboxRequest
CLI: Plumbed --gpu and --gpu-device flags through to CreateSandboxRequest
Gateway auto-registration: start.sh now runs gateway add before starting the server (using sudo -u $SUDO_USER for correct config ownership), eliminating the manual registration step
Documentation: Updated architecture/vm-gpu-sandbox-guide.md, crates/openshell-driver-vm/README.md

Testing

mise run pre-commit passes
Unit tests added/updated
E2E tests added/updated (if applicable)

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

copy-pr-bot · 2026-04-27T19:05:25Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

elezar · 2026-04-28T09:19:25Z

+        /// Target a specific GPU by PCI address (e.g. "0000:2d:00.0") or index (e.g. "0", "1").
+        /// Only valid with --gpu. When omitted with --gpu, the first available GPU is assigned.
+        #[arg(long, requires = "gpu")]
+        gpu_device: Option<String>,


Just to clarify, this is not specific to the VM driver and could be mapped to requests in k8s, Docker, or Podman?

As a follow up question: Does it make sense to allow gpu_device to be specified multiple times to allow for multiple devices, or should validation (e.g. a comma-separated list) be delegated to the driver?

Good question on both points. Yes, --gpu and --gpu-device are intentionally driver-agnostic — the proto defines them on CreateSandboxRequest and DriverSandboxSpec, so k8s/Docker/Podman drivers can map them to their native GPU request mechanisms. For multi-device: today the proto field is a single string, so multi-GPU per sandbox would need a proto change (repeated string gpu_devices) plus inventory updates. I propose to update this in a follow-up PR.

I'm fine with a follow-up. Would an issue to discuss how users are expected to request GPUs be a good place to have a follow-up discussion? Some of the basic use cases that I can see are:

A user wants a sandbox with any GPU. (count == 1)

A user wants a sandbox with a specific number of GPUs. (count > 1).

A user wants a sandbox with a SPECIFIC set of GPUs. (Specified by driver-specific IDs).

A more advanced use case that one could also start discussing is when a user wants a sandbox with access to one or more GPUs with specific properties. I would assume that this could also be reduced to a set of driver-specific IDs though, so maybe it is sufficient to demonstrate this transform.

elezar · 2026-04-28T09:21:39Z

            driver_version: openshell_core::VERSION.to_string(),
            default_image: self.config.default_image.clone(),
            supports_gpu: self.has_gpu_capacity().await.unwrap_or(false),
+            gpu_count: 0,


Question: is a raw int rich enough here? Should a driver expose the valid names of devices that are available, for example?

Agreed, a raw int is limited. A richer repeated GpuDeviceInfo message (with BDF, device name, availability) on GetCapabilitiesResponse would let the CLI show available devices and validate --gpu-device client-side. I propose to address this, along with the previous one, in a follow-up PR.

elezar · 2026-04-28T09:24:37Z

+            supports_gpu: self.gpu_inventory.is_some(),
+            gpu_count: self.gpu_count,


Why is gpu_count not just the length of gpu_inventory? Is there a chance that self.gpu_inventory and self.gpu_count get "out of sync"?

Good catch — removed the separate gpu_count field from VmDriver entirely. capabilities() now derives it on demand by locking the inventory and calling gpu_count(). This eliminates any possibility of the two getting out of sync.

elezar · 2026-04-28T09:27:06Z

+            command
+                .arg("--vm-krun-log-level")
+                .arg(self.config.krun_log_level.to_string());


Question: Why is --vm-krun-log-level set here an in the non-GPU branch?

You're right, no reason for it to be duplicated. Hoisted --vm-krun-log-level out of both the GPU and non-GPU branches — it's now set once after the if/else block since it's common to both backends.

elezar · 2026-04-28T09:28:48Z

+        // there is a single OPENSHELL_ENDPOINT value in the env list.
+        let endpoint_override = if gpu_bdf.is_some() {
+            let subnet = match self
+                .subnet_allocator


The name subnet_allocator does not make it clear that this is required for GPU injection. If they're dependent on each other, maybe there's a better way to indicate this relationship.

Good point — TAP subnet allocation is exclusively a GPU concern. I suggest a follow-up where I can move SubnetAllocator into GpuInventory (or a new GpuNetworking wrapper) so the dependency is structurally explicit rather than relying on naming alone. That'll also let us wrap both behind the existing Option<Arc<Mutex<...>>> gate and skip initialization when GPUs are disabled.

TAP subnet allocation is exclusively a GPU concern

Not knowing enough about why this is required, a naive question I would have is whether this is always the case, or only the case for the current vm driver feature set? Is it realistic that a user would expect to be able to configure something like this in the future?

A follow up to make the dependency structurally explicit sounds good though.

Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>

…upervisor reliability issues discovered during GPU VM bring-up. Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>

vince-brisebois marked this pull request as ready for review April 27, 2026 22:03

vince-brisebois requested a review from a team as a code owner April 27, 2026 22:03

vince-brisebois requested a review from drew April 27, 2026 22:03

vince-brisebois commented Apr 27, 2026

View reviewed changes

Comment thread architecture/podman-rootless-networking.md

vince-brisebois force-pushed the vcauxbrisebo/vm-gpu-support-driver branch from 7a747ab to 98c8eca Compare April 27, 2026 22:07

drew reviewed Apr 27, 2026

View reviewed changes

Comment thread crates/openshell-driver-vm/src/main.rs

Comment thread crates/openshell-driver-vm/start.sh

elezar reviewed Apr 28, 2026

View reviewed changes

drew previously approved these changes Apr 29, 2026

View reviewed changes

Adding qemu vm driver support with GPU pass-through

50adf73

Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>

vince-brisebois dismissed drew’s stale review via b052bde April 29, 2026 04:22

vince-brisebois force-pushed the vcauxbrisebo/vm-gpu-support-driver branch 2 times, most recently from b052bde to 76e54d8 Compare April 29, 2026 04:30

Add GPU rootfs variant, harden VFIO binding, and fix networking and s…

38f069e

…upervisor reliability issues discovered during GPU VM bring-up. Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>

vince-brisebois force-pushed the vcauxbrisebo/vm-gpu-support-driver branch from 76e54d8 to 38f069e Compare April 29, 2026 04:35

		supports_gpu: self.gpu_inventory.is_some(),
		gpu_count: self.gpu_count,

Conversation

vince-brisebois commented Apr 27, 2026

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

copy-pr-bot Bot commented Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants