From 8d5caf3324d6f436923269f58be4113304440b92 Mon Sep 17 00:00:00 2001
From: Walter Simson <walter.a.simson@gmail.com>
Date: Sat, 16 May 2026 18:30:25 +0000
Subject: [PATCH] docs: add CUDA binary release runbook for sm_120 (Blackwell)
 support
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Captures the end-to-end pipeline for shipping a new pre-compiled CUDA
binary release: Makefile bumps in the per-platform binary repos →
submodule bump in kspacefirstorder-unified → CI builds against CUDA 13
→ tag releases on each binary repo → bump URL pins in
kwave/__init__.py.

Cross-links #656 (canonical sm_120 issue) and #622 (independent reporter
of the same underlying problem) so the open work is tracked in one
place. Both issues remain open until the v1.4.0 binary release ships.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/development/cuda_binary_release.md | 126 ++++++++++++++++++++++++
 1 file changed, 126 insertions(+)
 create mode 100644 docs/development/cuda_binary_release.md

diff --git a/docs/development/cuda_binary_release.md b/docs/development/cuda_binary_release.md
new file mode 100644
index 000000000..b75c94508
--- /dev/null
+++ b/docs/development/cuda_binary_release.md
@@ -0,0 +1,126 @@
+# CUDA Binary Release Runbook (sm_120 / Blackwell)
+
+This runbook covers the end-to-end process for publishing a new set of pre-compiled CUDA binaries that include support for new GPU architectures. The motivating case is **NVIDIA Blackwell (sm_120, RTX 50xx series)** support, which is currently blocking users on:
+
+- [#656](https://github.com/waltsims/k-wave-python/issues/656) — RTX 5070 Ti / sm_120 segfaults (canonical issue, with full repro and the working build recipe from @aconesac)
+- [#622](https://github.com/waltsims/k-wave-python/issues/622) — "Can not use kspaceFirstOrder-CUDA" (same root cause, confirmed by @faberno [in this comment](https://github.com/waltsims/k-wave-python/issues/622#issuecomment-2273106886))
+
+Both issues will close automatically once a v1.4.0 release is published per the steps below.
+
+## Architecture
+
+The CUDA binaries are produced by a multi-repo build pipeline:
+
+```
+kspacefirstorder-unified  ─── multi-platform CI ───┐
+  ├─ repos/kspaceFirstOrder-cuda-linux  (submodule)│ matrices on CUDA 12.2 + 13.0
+  ├─ repos/kspaceFirstOrder-cuda-windows (submodule)│ uploads artifacts per leg
+  ├─ repos/kspaceFirstOrder-openmp-linux (submodule)│
+  ├─ repos/kspaceFirstOrder-openmp-windows         │
+  └─ repos/kspaceFirstOrder-openmp-darwin          │
+                                                   │
+   Each individual repo (CUDA-linux, CUDA-windows, OMP-*)
+   ships a release tag containing the binary, which k-wave-python
+   then downloads from a URL pinned in `kwave/__init__.py`.
+```
+
+## Release pipeline
+
+### Step 1 — Bump `CUDA_ARCH` in both CUDA repos
+
+Add `sm_100`, `sm_120`, and the forward-compatible PTX `compute_120` entry to `Makefile` on the `cuda-12-support` branch of each repo.
+
+- **Linux**: [waltsims/kspaceFirstOrder-CUDA-linux#5](https://github.com/waltsims/kspaceFirstOrder-CUDA-linux/pull/5) (opened against `cuda-12-support`)
+- **Windows**: [waltsims/kspaceFirstOrder-CUDA-windows#1](https://github.com/waltsims/kspaceFirstOrder-CUDA-windows/pull/1) — needs minor style cleanup per [review](https://github.com/waltsims/kspaceFirstOrder-CUDA-windows/pull/1#pullrequestreview-) before merge
+
+Note that CUDA 13 dropped support for `sm_50` / `sm_60` / older. The current `cuda-12-support` branches already omit those, so no extra deletion is needed.
+
+The `sm_103` (Blackwell B300 variant) entry is also valid under CUDA 13 but not added by either PR — it's a niche datacenter SKU, optional follow-up.
+
+### Step 2 — Bump submodule SHAs in `kspacefirstorder-unified`
+
+After both PRs land:
+
+```bash
+git clone --recurse-submodules https://github.com/waltsims/kspacefirstorder-unified
+cd kspacefirstorder-unified
+cd repos/kspaceFirstOrder-cuda-linux && git fetch && git checkout origin/cuda-12-support && cd -
+cd repos/kspaceFirstOrder-cuda-windows && git fetch && git checkout origin/cuda-12-support && cd -
+git add repos/kspaceFirstOrder-cuda-linux repos/kspaceFirstOrder-cuda-windows
+git commit -m "Bump CUDA submodules to sm_120-capable Makefile"
+git push
+```
+
+CI will then build all 5 binaries across the matrix (CUDA 12.2 + 13.0 for the CUDA legs). Only the CUDA 13.0.0 artifacts will have sm_120 support — that's the one we ship.
+
+CI matrix is defined in `.github/workflows/ci-multi-platform.yml` and uploads per-leg artifacts named like:
+- `kspaceFirstOrder-cuda-linux-13.0.0`
+- `kspaceFirstOrder-cuda-windows-13.0.0`
+- `kspaceFirstOrder-openmp-linux-*`, etc.
+
+### Step 3 — Tag releases on the individual binary repos
+
+Pull the CUDA 13.0.0 artifacts off the unified CI run, then on each binary repo:
+
+```bash
+# kspaceFirstOrder-CUDA-linux
+gh release create v1.4.0 ./kspaceFirstOrder-CUDA \
+  --title "v1.4.0: Blackwell (sm_120) support" \
+  --notes "Adds sm_100 (Blackwell datacenter) and sm_120 (consumer RTX 50xx) compute capabilities, plus PTX compute_120 for forward compatibility. Requires CUDA 13 runtime to consume the new code paths."
+
+# kspaceFirstOrder-CUDA-windows
+gh release create v1.4.0 ./kspaceFirstOrder-CUDA.exe ./*.dll \
+  --title "v1.4.0: Blackwell (sm_120) support" \
+  --notes "(same as Linux)"
+```
+
+For the OMP repos, refresh against the latest HDF5 to also resolve [#661](https://github.com/waltsims/k-wave-python/issues/661) (macOS HDF5 ABI mismatch) and similar issues:
+
+```bash
+# k-wave-omp-darwin, kspaceFirstOrder-OMP-linux, kspaceFirstOrder-OMP-windows
+gh release create v0.4.0 ./kspaceFirstOrder-OMP \
+  --title "v0.4.0: HDF5 ABI refresh" \
+  --notes "Built against current Homebrew HDF5 to resolve libhdf5.310 vs .320 ABI mismatch on macOS (#661)."
+```
+
+### Step 4 — Bump version pins in k-wave-python
+
+Edit `kwave/__init__.py`:
+
+```python
+URL_DICT = {
+    "linux": {
+        "cuda": [URL_BASE + f"kspaceFirstOrder-CUDA-{PLATFORM}/releases/download/v1.4.0/{EXECUTABLE_PREFIX}CUDA"],
+        "omp":  [URL_BASE + f"kspaceFirstOrder-OMP-{PLATFORM}/releases/download/v0.4.0/{EXECUTABLE_PREFIX}OMP"],
+    },
+    "darwin": {
+        "cuda": [],
+        "omp":  [URL_BASE + f"k-wave-omp-{PLATFORM}/releases/download/v0.4.0/{EXECUTABLE_PREFIX}OMP"],
+    },
+    ...
+}
+```
+
+Bump `BINARY_VERSION` if defined elsewhere. Open a PR. CI will re-download against the new URLs.
+
+### Step 5 — Verify and close issues
+
+After the new k-wave-python release is out:
+
+- Test on an actual Blackwell GPU (e.g. RTX 5070 Ti) — Brno can confirm, or @aconesac who built the working binary
+- Close [#656](https://github.com/waltsims/k-wave-python/issues/656) and [#622](https://github.com/waltsims/k-wave-python/issues/622) with the release version
+- If the OMP refresh happened, close [#661](https://github.com/waltsims/k-wave-python/issues/661) too
+
+## Test plan for this PR
+
+This PR adds documentation only — no code change. The runbook will be exercised by the actual binary release work described above.
+
+## Open work tracking
+
+- [ ] Merge waltsims/kspaceFirstOrder-CUDA-linux#5 (Linux sm_120)
+- [ ] Merge waltsims/kspaceFirstOrder-CUDA-windows#1 (Windows sm_120)
+- [ ] Bump submodules in `kspacefirstorder-unified`, run CI, download artifacts
+- [ ] Tag v1.4.0 releases on both CUDA binary repos
+- [ ] (optional) Tag v0.4.0 OMP releases with current HDF5 ABI to close #661
+- [ ] Open k-wave-python PR bumping version pins in `kwave/__init__.py`
+- [ ] Close #656 and #622 once verified on a Blackwell box