rocm: DS4_CUDA_MANAGED opt-in to avoid high-context OOM on large-carveout UMA APUs#361
rocm: DS4_CUDA_MANAGED opt-in to avoid high-context OOM on large-carveout UMA APUs#361jamesburton wants to merge 1 commit into
Conversation
…anaged memory ds4_gpu_tensor_alloc used device-only cudaMalloc for all non-KV tensors (prefill scratch, activations). On unified-memory APUs with a large BIOS VRAM carve-out (e.g. Strix Halo, 96 GB), once ~81 GB of weights are resident the remaining VRAM headroom is too small for the prefill scratch at higher context, so cudaMalloc OOMs (observed: 16k ctx fails with repeated "ROCm tensor alloc failed: out of memory", model loads fine). DS4_CUDA_MANAGED=1 routes this allocator through cudaMallocManaged so it can draw from the full UMA pool (incl. GTT beyond the carve-out), complementing the existing auto-managed KV-cache class. Opt-in, zero-overhead when unset. Measured on Strix Halo (Radeon 8060S, gfx1151), q2 model 80.76 GiB, BIOS 96 GB VRAM split: 16k context goes from OOM to 195 tok/s prefill / 13.5 gen. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an opt-in path to allocate general GPU tensors from managed (unified) memory to reduce OOM risk on UMA systems with large VRAM carve-outs.
Changes:
- Introduce
DS4_CUDA_MANAGEDenv var to switchds4_gpu_tensor_allocfromcudaMalloctocudaMallocManaged. - Cache the env-var decision to avoid repeated
getenvcalls and include allocation-mode context in the error label.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| static int managed_all = -1; | ||
| if (managed_all < 0) managed_all = (getenv("DS4_CUDA_MANAGED") != NULL) ? 1 : 0; | ||
| cudaError_t rc = managed_all ? cudaMallocManaged(&t->ptr, (size_t)bytes) | ||
| : cudaMalloc(&t->ptr, (size_t)bytes); |
|
Correction — withdrawing the original justification. On a clean re-test (gfx1151, 96 GB-VRAM split, GPU exclusively free, idle-sleep disabled), managed ON vs OFF are within noise at every practical context (16k: off 186.9 / on 195.2 prefill tok/s; both succeed; 32k both ~185). The "16k OOMs without managed" I first reported was a VRAM-contention artifact (a concurrent GPU consumer), not deterministic — my mistake. managed OFF does genuinely OOM only at very large KV allocation ( The change is still gated/opt-in/zero-overhead-when-unset and correct in principle, but I can't substantiate a benefit on the hardware I have, so treat this as unvalidated — happy to close it, or leave it for someone who can test the small-VRAM UMA config. Sorry for the noise. |
Fixes #359.
On unified-memory APUs with a large BIOS VRAM carve-out (Strix Halo 96 GB split), ds4_gpu_tensor_alloc uses device-only cudaMalloc for the general tensor class (prefill scratch/activations); only the KV class is auto-managed. Once ~81 GB of weights are resident the ~15 GB VRAM headroom is too small for the prefill scratch at higher context, so it OOMs (16k ctx fails; model loads fine).
This adds an opt-in
DS4_CUDA_MANAGED=1that routes the general allocator throughcudaMallocManagedso it draws from the full UMA pool (incl. GTT beyond the carve-out). Zero-overhead when unset; complements the existing auto-managed KV class.Measured (Radeon 8060S gfx1151, q2 80.76 GiB, 96 GB split): 16k context OOM → 195 tok/s prefill / 13.5 gen. +11/-1 lines, ROCm-only.