docs: CUDA binary release runbook (sm_120 / Blackwell) by waltsims · Pull Request #737 · waltsims/k-wave-python

waltsims · 2026-05-16T18:30:45Z

Summary

Adds `docs/development/cuda_binary_release.md` — an end-to-end runbook for publishing a new pre-compiled CUDA binary release. Motivating case is NVIDIA Blackwell (sm_120, RTX 50xx) support, which is blocking real users.

Why now

Two open issues report the same underlying problem (the bundled CUDA binary doesn't include sm_120):

[BUG]RTX 5070 Ti (sm_120) Support: Precompiled CUDA binaries incompatible with Blackwell architecture GPUs #656 — RTX 5070 Ti / sm_120 segfaults (canonical; full repro and aconesac's working recipe in the thread)
[BUG] Can not use kspaceFirstOrder-CUDA #622 — "Can not use kspaceFirstOrder-CUDA" (different reporter, confirmed by @faberno to be the same Blackwell sm gap)

I've already started the upstream work:

✅ waltsims/kspaceFirstOrder-CUDA-linux#5 — Linux Makefile sm_120 bump (opened today)
✅ waltsims/kspaceFirstOrder-CUDA-windows#1 — Windows Makefile sm_120 bump (existing, reviewed today with minor style notes)

The runbook documents the remaining steps so this can be picked up and finished without re-deriving the architecture.

Open work checklist (from the runbook)

Merge Bump CUDA SM support to 120 (Blackwell) kspaceFirstOrder-CUDA-linux#5
Merge Bump CUDA SM support to 120 kspaceFirstOrder-CUDA-windows#1
Bump submodule SHAs in `kspacefirstorder-unified`, run CI, download CUDA 13.0.0 artifacts
Tag v1.4.0 on `kspaceFirstOrder-CUDA-{linux,windows}` with the artifacts attached
(optional) Tag v0.4.0 on `k-wave-omp-darwin` / `-linux` / `-windows` to also resolve [BUG] hdf5 version mismatch on fresh install #661 (HDF5 ABI)
Open k-wave-python PR bumping URL pins in `kwave/init.py:56,63`
Close [BUG]RTX 5070 Ti (sm_120) Support: Precompiled CUDA binaries incompatible with Blackwell architecture GPUs #656 and [BUG] Can not use kspaceFirstOrder-CUDA #622 once verified on a real Blackwell box (cc @aconesac or Brno team for the verification run)

Test plan

Docs-only PR. The runbook itself will be exercised by the actual binary release work above.

🤖 Generated with Claude Code

Greptile Summary

Adds docs/development/cuda_binary_release.md, a docs-only runbook that walks through the five-step process for publishing CUDA binaries with NVIDIA Blackwell (sm_120 / RTX 50xx) support, referencing upstream Makefile PRs and downstream kwave/__init__.py URL pins.

The Step 4 snippet shows only \"linux\"/\"darwin\" URL updates and leaves Windows as ..., while the actual __init__.py derives Windows CUDA URLs from BINARY_VERSION via get_windows_release_urls — without an explicit call to update BINARY_VERSION = \"v1.4.0\", Windows users would still receive the old pre-sm_120 binary.
A fragment anchor in the Windows PR review link (#pullrequestreview-) is incomplete and will 404, and the ./*.dll glob in the Windows gh release create command could attach unintended build artifacts.

Confidence Score: 3/5

Safe to merge as documentation, but the runbook as written would leave Windows users without sm_120 support if followed literally.

The Step 4 snippet omits the Windows URL path entirely and treats the BINARY_VERSION bump as an afterthought. In kwave/init.py, the Windows CUDA URL is entirely derived from BINARY_VERSION through get_windows_release_urls, while the Linux CUDA URL is a separate hardcoded string that does not use BINARY_VERSION at all. A developer following the runbook would update the Linux/Darwin entries directly and might never realize BINARY_VERSION must also be bumped to propagate the change to Windows.

docs/development/cuda_binary_release.md — Step 4 needs an explicit BINARY_VERSION update instruction and should show the Windows URL generation path.

Important Files Changed

Filename	Overview
docs/development/cuda_binary_release.md	New end-to-end runbook for publishing CUDA binaries with Blackwell (sm_120) support; Step 4 omits the `BINARY_VERSION` update needed to cover the Windows CUDA URL, a broken review link exists in Step 1, and the `./*.dll` glob in the Windows release command may attach unintended files.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Step 1: Merge Makefile PRs] --> B[Step 2: Bump submodule SHAs]
    B --> C[CI builds matrix]
    C --> D{Pick CUDA 13.0.0 artifacts}
    D --> E[Step 3: Tag v1.4.0 releases]
    E --> F[Step 4: Update kwave/__init__.py]
    F --> G[Open k-wave-python PR]
    G --> H[Step 5: Verify on Blackwell GPU]

_{Reviews (1): Last reviewed commit: "docs: add CUDA binary release runbook fo..." | Re-trigger Greptile}

Greptile also left 3 inline comments on this PR.

Captures the end-to-end pipeline for shipping a new pre-compiled CUDA binary release: Makefile bumps in the per-platform binary repos → submodule bump in kspacefirstorder-unified → CI builds against CUDA 13 → tag releases on each binary repo → bump URL pins in kwave/__init__.py. Cross-links #656 (canonical sm_120 issue) and #622 (independent reporter of the same underlying problem) so the open work is tracked in one place. Both issues remain open until the v1.4.0 binary release ships. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-05-16T18:33:08Z

+### Step 4 — Bump version pins in k-wave-python
+
+Edit `kwave/__init__.py`:
+
+```python
+URL_DICT = {
+    "linux": {
+        "cuda": [URL_BASE + f"kspaceFirstOrder-CUDA-{PLATFORM}/releases/download/v1.4.0/{EXECUTABLE_PREFIX}CUDA"],
+        "omp":  [URL_BASE + f"kspaceFirstOrder-OMP-{PLATFORM}/releases/download/v0.4.0/{EXECUTABLE_PREFIX}OMP"],
+    },
+    "darwin": {
+        "cuda": [],
+        "omp":  [URL_BASE + f"k-wave-omp-{PLATFORM}/releases/download/v0.4.0/{EXECUTABLE_PREFIX}OMP"],
+    },
+    ...
+}
+```
+
+Bump `BINARY_VERSION` if defined elsewhere. Open a PR. CI will re-download against the new URLs.


Windows CUDA URL will silently stay on the old binary

The Step 4 snippet shows only "linux" and "darwin" keys (with ... for Windows), and the note "Bump BINARY_VERSION if defined elsewhere" is easy to miss. In kwave/__init__.py, the Windows CUDA URL is generated by get_windows_release_urls("cuda") → PREFIX.format("CUDA", "windows"), where PREFIX embeds BINARY_VERSION (currently "v1.3.0"). The Linux CUDA entry is hardcoded to v1.3.1 and doesn't use BINARY_VERSION, so a developer updating that line might not realize BINARY_VERSION also controls the Windows path. Without an explicit update to BINARY_VERSION = "v1.4.0", Windows users would still download the pre-sm_120 binary after the release.

greptile-apps · 2026-05-16T18:33:09Z

+Add `sm_100`, `sm_120`, and the forward-compatible PTX `compute_120` entry to `Makefile` on the `cuda-12-support` branch of each repo.
+
+- **Linux**: [waltsims/kspaceFirstOrder-CUDA-linux#5](https://github.com/waltsims/kspaceFirstOrder-CUDA-linux/pull/5) (opened against `cuda-12-support`)
+- **Windows**: [waltsims/kspaceFirstOrder-CUDA-windows#1](https://github.com/waltsims/kspaceFirstOrder-CUDA-windows/pull/1) — needs minor style cleanup per [review](https://github.com/waltsims/kspaceFirstOrder-CUDA-windows/pull/1#pullrequestreview-) before merge


Broken review link

The fragment #pullrequestreview- is missing its numeric ID — this will 404. You can grab the full anchor from the PR review URL (e.g. #pullrequestreview-2XXXXXXXXX) and paste it here.

greptile-apps · 2026-05-16T18:33:10Z

+  --notes "Adds sm_100 (Blackwell datacenter) and sm_120 (consumer RTX 50xx) compute capabilities, plus PTX compute_120 for forward compatibility. Requires CUDA 13 runtime to consume the new code paths."
+
+# kspaceFirstOrder-CUDA-windows
+gh release create v1.4.0 ./kspaceFirstOrder-CUDA.exe ./*.dll \


./*.dll glob may pick up unintended files

kwave/__init__.py enumerates exactly 10 DLLs in WINDOWS_DLLS. Using ./*.dll in the gh release create command will attach every .dll present in the working directory at release time, which could include test or build-intermediate DLLs. Listing the same explicit filenames (or at least documenting they should match WINDOWS_DLLS) would be safer and self-documenting.

codecov · 2026-05-16T18:34:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.62%. Comparing base (9f25de1) to head (8d5caf3).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #737      +/-   ##
==========================================
+ Coverage   74.82%   76.62%   +1.79%     
==========================================
  Files          56       57       +1     
  Lines        8095     8761     +666     
  Branches     1577     1854     +277     
==========================================
+ Hits         6057     6713     +656     
+ Misses       1422     1406      -16     
- Partials      616      642      +26

Flag	Coverage Δ
3.10	`76.60% <ø> (+1.80%)`	⬆️
3.11	`76.60% <ø> (+1.80%)`	⬆️
3.12	`76.60% <ø> (+1.80%)`	⬆️
3.13	`76.60% <ø> (+1.80%)`	⬆️
macos-latest	`76.49% <ø> (+1.74%)`	⬆️
ubuntu-latest	`76.53% <ø> (+1.78%)`	⬆️
windows-latest	`76.40% <ø> (+1.64%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

* Bump CUDA submodules to sm_120 PR HEADs for Blackwell CI run Submodule SHA bumps: - repos/kspaceFirstOrder-cuda-linux: da4e013 -> 65fbec6 - repos/kspaceFirstOrder-cuda-windows: 319fec6 -> e3e2404 Both new SHAs are the HEAD of the still-unmerged sm_120 PR branches. Pinning to PR HEADs lets the existing multi-platform CI matrix build against the sm_120-capable Makefiles to validate the Blackwell binaries are produced before the upstream PRs merge. Once both upstream PRs land, re-bump to the merge commits on `cuda-12-support` so we are not pinning detached refs. Companion documentation: waltsims/k-wave-python#737. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Re-bump cuda-linux submodule SHA to 7e887cf (gate sm_120 on CUDA 13+) Previous SHA (65fbec6) added sm_100/sm_120/PTX compute_120 unconditionally, which makes the CUDA 12.2.0 leg of this CI matrix fail with: nvcc fatal: Unsupported gpu architecture 'compute_100' The new SHA 7e887cf wraps those three lines in a Makefile `ifeq` guarded on `nvcc --version` major >= 13, so the 12.2 leg builds the original sm_75..sm_90a list and the 13.0 leg additionally builds the Blackwell arches. Pre-existing CI failures on this branch (windows-cuda CUDA 10.2.props not found, windows-openmp hdf5_hl.h not found) are unrelated to this submodule bump and reproduce on the base branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Re-bump cuda-windows submodule SHA to 34480ea (vcxproj CUDA 13 fix) Previous SHA (e3e2404) only updated the Makefile, but the unified CI uses the .vcxproj via MSBuild and that hardcoded "CUDA 10.2.props" / "CUDA 10.2.targets" imports — causing the windows-cuda leg to fail with MSB4019. New SHA 34480ea fixes the .vcxproj to import CUDA 13.0.props/.targets (which the CI's "Register CUDA MSBuild customizations" step now copies into VCTargetsPath/BuildCustomizations from CUDA 13's installation), plus replaces the stale Release|x64 <CodeGeneration> list (which had compute_30..sm_75 from the CUDA 10 era) with the modern set including sm_86 and Blackwell. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Re-pin cuda-windows submodule to main + bump to PR #1 merge commit waltsims/kspaceFirstOrder-CUDA-windows#1 (the sm_120 / vcxproj CUDA 13 work) merged today into main (commit a6d6919), and the upstream cuda-12-support branch is no longer the long-lived integration branch. Two coordinated changes: - .gitmodules: switch cuda-windows from `branch = cuda-12-support` to `branch = main` so `git submodule update --remote` picks up new commits from the right ref going forward. - Bump the recorded SHA from 34480ea (which was the PR-head commit on the now-deleted bump-CUDA-sm-suppoprt-to-120 branch) to a6d6919 (the merge commit, reachable from main). cuda-linux still pins `branch = cuda-12-support` because waltsims/ kspaceFirstOrder-CUDA-linux#5 hasn't merged yet. That submodule entry will be flipped to main once the Linux PR lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Re-bump CUDA submodules to PR merge commits on main Both Linux and Windows sm_120 work has now landed on main: - repos/kspaceFirstOrder-cuda-linux: 7e887cf -> 072ec8f (PR #4 "Update cufft error enumeration." merged, bringing the cuda-12-support branch into main: includes sm_120 Makefile gating + cuFFT #ifdef restoration) - repos/kspaceFirstOrder-cuda-windows: a6d6919 -> e8661b1 (PR #2 "Cuda 12 support" merged, bringing v143 toolset, CUDA 12.2/ 13.0 fallback Imports, VerifyCudaCustomizations + VerifyVcpkgRoot targets, and ResolvedVcpkgRoot property) Also flips cuda-linux's .gitmodules pin from `branch = cuda-12-support` to `branch = main` since that branch is no longer the long-lived integration branch (matches the cuda-windows side, which was flipped to main in commit 84f1a88). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Flip cuda-linux submodule pin from cuda-12-support to main PR #4 merged cuda-12-support into main on the upstream cuda-linux repo, so the cuda-12-support branch is no longer the long-lived integration target. Tracks main same as cuda-windows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…po releases Adds release-on-tag.yml (named for the eventual use case but currently only fires on workflow_dispatch). When manually triggered with a version string (e.g. v1.4.0), the workflow runs the existing multi-platform CI matrix (called via workflow_call), downloads each artifact, and uploads it to a release of that version name on the corresponding per-platform binary repository. Two safety features deliberately built in: - Trigger is workflow_dispatch only (no automatic firing on tag push) so a human always confirms the cross-repo publish operation. - A BINARY_RELEASE_TOKEN repo secret is required (PAT or App token with contents:write on the five target repos). The workflow checks for it and fails loudly with a clear error rather than silently no-op'ing or trying default GITHUB_TOKEN. Tiny companion change to ci-multi-platform.yml: add workflow_call to its on: triggers so it can be reused without duplicating the build matrix. No effect on push/PR-triggered runs. Future hardening to consider: - Switch trigger to push: tags: ['v*'] once the workflow has been proven on a couple of manual runs and the secret is known-good. - Add a dry-run mode that builds artifacts but does not publish. Companion docs: waltsims/k-wave-python#737. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…po releases (#6) Adds release-on-tag.yml (named for the eventual use case but currently only fires on workflow_dispatch). When manually triggered with a version string (e.g. v1.4.0), the workflow runs the existing multi-platform CI matrix (called via workflow_call), downloads each artifact, and uploads it to a release of that version name on the corresponding per-platform binary repository. Two safety features deliberately built in: - Trigger is workflow_dispatch only (no automatic firing on tag push) so a human always confirms the cross-repo publish operation. - A BINARY_RELEASE_TOKEN repo secret is required (PAT or App token with contents:write on the five target repos). The workflow checks for it and fails loudly with a clear error rather than silently no-op'ing or trying default GITHUB_TOKEN. Tiny companion change to ci-multi-platform.yml: add workflow_call to its on: triggers so it can be reused without duplicating the build matrix. No effect on push/PR-triggered runs. Future hardening to consider: - Switch trigger to push: tags: ['v*'] once the workflow has been proven on a couple of manual runs and the secret is known-good. - Add a dry-run mode that builds artifacts but does not publish. Companion docs: waltsims/k-wave-python#737. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

waltsims · 2026-05-17T03:00:49Z

Closing as bloat: this runbook documents a manual release flow for the 5-mirror architecture, which is being deprecated via mirror consolidation (tracked in waltsims/kspacefirstorder-unified#13). Most content is one-time-specific to v1.4.0 (the cuda-12-support branch, the specific PR/issue numbers, the HDF5 ABI mismatch all close with v1.4.0). The architecture diagram + cmake approach are already captured in kspacefirstorder-unified/plans/. Once consolidation lands, the release flow is a single tag on unified — no runbook needed.

greptile-apps Bot reviewed May 16, 2026

View reviewed changes

This was referenced May 16, 2026

Bump CUDA submodules for sm_120 (Blackwell) CI run [DRAFT] waltsims/kspacefirstorder-unified#5

Merged

ci: add manual-trigger release workflow [DRAFT] waltsims/kspacefirstorder-unified#6

Merged

waltsims closed this May 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: CUDA binary release runbook (sm_120 / Blackwell)#737

docs: CUDA binary release runbook (sm_120 / Blackwell)#737
waltsims wants to merge 1 commit into
masterfrom
docs/sm-120-binary-release-runbook

waltsims commented May 16, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 16, 2026

Uh oh!

greptile-apps Bot May 16, 2026

Uh oh!

greptile-apps Bot May 16, 2026

Uh oh!

codecov Bot commented May 16, 2026 •

edited

Loading

Uh oh!

waltsims commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

waltsims commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why now

Open work checklist (from the runbook)

Test plan

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

waltsims commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

waltsims commented May 16, 2026 •

edited

Loading

codecov Bot commented May 16, 2026 •

edited

Loading