Skip to content

docs: CUDA binary release runbook (sm_120 / Blackwell)#737

Closed
waltsims wants to merge 1 commit into
masterfrom
docs/sm-120-binary-release-runbook
Closed

docs: CUDA binary release runbook (sm_120 / Blackwell)#737
waltsims wants to merge 1 commit into
masterfrom
docs/sm-120-binary-release-runbook

Conversation

@waltsims
Copy link
Copy Markdown
Owner

@waltsims waltsims commented May 16, 2026

Summary

Adds `docs/development/cuda_binary_release.md` — an end-to-end runbook for publishing a new pre-compiled CUDA binary release. Motivating case is NVIDIA Blackwell (sm_120, RTX 50xx) support, which is blocking real users.

Why now

Two open issues report the same underlying problem (the bundled CUDA binary doesn't include sm_120):

I've already started the upstream work:

The runbook documents the remaining steps so this can be picked up and finished without re-deriving the architecture.

Open work checklist (from the runbook)

Test plan

Docs-only PR. The runbook itself will be exercised by the actual binary release work above.

🤖 Generated with Claude Code

Greptile Summary

Adds docs/development/cuda_binary_release.md, a docs-only runbook that walks through the five-step process for publishing CUDA binaries with NVIDIA Blackwell (sm_120 / RTX 50xx) support, referencing upstream Makefile PRs and downstream kwave/__init__.py URL pins.

  • The Step 4 snippet shows only \"linux\"/\"darwin\" URL updates and leaves Windows as ..., while the actual __init__.py derives Windows CUDA URLs from BINARY_VERSION via get_windows_release_urls — without an explicit call to update BINARY_VERSION = \"v1.4.0\", Windows users would still receive the old pre-sm_120 binary.
  • A fragment anchor in the Windows PR review link (#pullrequestreview-) is incomplete and will 404, and the ./*.dll glob in the Windows gh release create command could attach unintended build artifacts.

Confidence Score: 3/5

Safe to merge as documentation, but the runbook as written would leave Windows users without sm_120 support if followed literally.

The Step 4 snippet omits the Windows URL path entirely and treats the BINARY_VERSION bump as an afterthought. In kwave/init.py, the Windows CUDA URL is entirely derived from BINARY_VERSION through get_windows_release_urls, while the Linux CUDA URL is a separate hardcoded string that does not use BINARY_VERSION at all. A developer following the runbook would update the Linux/Darwin entries directly and might never realize BINARY_VERSION must also be bumped to propagate the change to Windows.

docs/development/cuda_binary_release.md — Step 4 needs an explicit BINARY_VERSION update instruction and should show the Windows URL generation path.

Important Files Changed

Filename Overview
docs/development/cuda_binary_release.md New end-to-end runbook for publishing CUDA binaries with Blackwell (sm_120) support; Step 4 omits the BINARY_VERSION update needed to cover the Windows CUDA URL, a broken review link exists in Step 1, and the ./*.dll glob in the Windows release command may attach unintended files.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Step 1: Merge Makefile PRs] --> B[Step 2: Bump submodule SHAs]
    B --> C[CI builds matrix]
    C --> D{Pick CUDA 13.0.0 artifacts}
    D --> E[Step 3: Tag v1.4.0 releases]
    E --> F[Step 4: Update kwave/__init__.py]
    F --> G[Open k-wave-python PR]
    G --> H[Step 5: Verify on Blackwell GPU]
Loading

Reviews (1): Last reviewed commit: "docs: add CUDA binary release runbook fo..." | Re-trigger Greptile

Greptile also left 3 inline comments on this PR.

Captures the end-to-end pipeline for shipping a new pre-compiled CUDA
binary release: Makefile bumps in the per-platform binary repos →
submodule bump in kspacefirstorder-unified → CI builds against CUDA 13
→ tag releases on each binary repo → bump URL pins in
kwave/__init__.py.

Cross-links #656 (canonical sm_120 issue) and #622 (independent reporter
of the same underlying problem) so the open work is tracked in one
place. Both issues remain open until the v1.4.0 binary release ships.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +86 to +104
### Step 4 — Bump version pins in k-wave-python

Edit `kwave/__init__.py`:

```python
URL_DICT = {
"linux": {
"cuda": [URL_BASE + f"kspaceFirstOrder-CUDA-{PLATFORM}/releases/download/v1.4.0/{EXECUTABLE_PREFIX}CUDA"],
"omp": [URL_BASE + f"kspaceFirstOrder-OMP-{PLATFORM}/releases/download/v0.4.0/{EXECUTABLE_PREFIX}OMP"],
},
"darwin": {
"cuda": [],
"omp": [URL_BASE + f"k-wave-omp-{PLATFORM}/releases/download/v0.4.0/{EXECUTABLE_PREFIX}OMP"],
},
...
}
```

Bump `BINARY_VERSION` if defined elsewhere. Open a PR. CI will re-download against the new URLs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Windows CUDA URL will silently stay on the old binary

The Step 4 snippet shows only "linux" and "darwin" keys (with ... for Windows), and the note "Bump BINARY_VERSION if defined elsewhere" is easy to miss. In kwave/__init__.py, the Windows CUDA URL is generated by get_windows_release_urls("cuda")PREFIX.format("CUDA", "windows"), where PREFIX embeds BINARY_VERSION (currently "v1.3.0"). The Linux CUDA entry is hardcoded to v1.3.1 and doesn't use BINARY_VERSION, so a developer updating that line might not realize BINARY_VERSION also controls the Windows path. Without an explicit update to BINARY_VERSION = "v1.4.0", Windows users would still download the pre-sm_120 binary after the release.

Add `sm_100`, `sm_120`, and the forward-compatible PTX `compute_120` entry to `Makefile` on the `cuda-12-support` branch of each repo.

- **Linux**: [waltsims/kspaceFirstOrder-CUDA-linux#5](https://github.com/waltsims/kspaceFirstOrder-CUDA-linux/pull/5) (opened against `cuda-12-support`)
- **Windows**: [waltsims/kspaceFirstOrder-CUDA-windows#1](https://github.com/waltsims/kspaceFirstOrder-CUDA-windows/pull/1) — needs minor style cleanup per [review](https://github.com/waltsims/kspaceFirstOrder-CUDA-windows/pull/1#pullrequestreview-) before merge
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Broken review link

The fragment #pullrequestreview- is missing its numeric ID — this will 404. You can grab the full anchor from the PR review URL (e.g. #pullrequestreview-2XXXXXXXXX) and paste it here.

--notes "Adds sm_100 (Blackwell datacenter) and sm_120 (consumer RTX 50xx) compute capabilities, plus PTX compute_120 for forward compatibility. Requires CUDA 13 runtime to consume the new code paths."

# kspaceFirstOrder-CUDA-windows
gh release create v1.4.0 ./kspaceFirstOrder-CUDA.exe ./*.dll \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 ./*.dll glob may pick up unintended files

kwave/__init__.py enumerates exactly 10 DLLs in WINDOWS_DLLS. Using ./*.dll in the gh release create command will attach every .dll present in the working directory at release time, which could include test or build-intermediate DLLs. Listing the same explicit filenames (or at least documenting they should match WINDOWS_DLLS) would be safer and self-documenting.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.62%. Comparing base (9f25de1) to head (8d5caf3).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #737      +/-   ##
==========================================
+ Coverage   74.82%   76.62%   +1.79%     
==========================================
  Files          56       57       +1     
  Lines        8095     8761     +666     
  Branches     1577     1854     +277     
==========================================
+ Hits         6057     6713     +656     
+ Misses       1422     1406      -16     
- Partials      616      642      +26     
Flag Coverage Δ
3.10 76.60% <ø> (+1.80%) ⬆️
3.11 76.60% <ø> (+1.80%) ⬆️
3.12 76.60% <ø> (+1.80%) ⬆️
3.13 76.60% <ø> (+1.80%) ⬆️
macos-latest 76.49% <ø> (+1.74%) ⬆️
ubuntu-latest 76.53% <ø> (+1.78%) ⬆️
windows-latest 76.40% <ø> (+1.64%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

waltsims added a commit to waltsims/kspacefirstorder-unified that referenced this pull request May 16, 2026
* Bump CUDA submodules to sm_120 PR HEADs for Blackwell CI run

Submodule SHA bumps:
- repos/kspaceFirstOrder-cuda-linux:   da4e013 -> 65fbec6
- repos/kspaceFirstOrder-cuda-windows: 319fec6 -> e3e2404

Both new SHAs are the HEAD of the still-unmerged sm_120 PR branches.
Pinning to PR HEADs lets the existing multi-platform CI matrix build
against the sm_120-capable Makefiles to validate the Blackwell
binaries are produced before the upstream PRs merge.

Once both upstream PRs land, re-bump to the merge commits on
`cuda-12-support` so we are not pinning detached refs.

Companion documentation: waltsims/k-wave-python#737.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Re-bump cuda-linux submodule SHA to 7e887cf (gate sm_120 on CUDA 13+)

Previous SHA (65fbec6) added sm_100/sm_120/PTX compute_120 unconditionally,
which makes the CUDA 12.2.0 leg of this CI matrix fail with:
  nvcc fatal: Unsupported gpu architecture 'compute_100'

The new SHA 7e887cf wraps those three lines in a Makefile `ifeq` guarded
on `nvcc --version` major >= 13, so the 12.2 leg builds the original
sm_75..sm_90a list and the 13.0 leg additionally builds the Blackwell
arches.

Pre-existing CI failures on this branch (windows-cuda CUDA 10.2.props
not found, windows-openmp hdf5_hl.h not found) are unrelated to this
submodule bump and reproduce on the base branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Re-bump cuda-windows submodule SHA to 34480ea (vcxproj CUDA 13 fix)

Previous SHA (e3e2404) only updated the Makefile, but the unified CI
uses the .vcxproj via MSBuild and that hardcoded "CUDA 10.2.props" /
"CUDA 10.2.targets" imports — causing the windows-cuda leg to fail
with MSB4019.

New SHA 34480ea fixes the .vcxproj to import CUDA 13.0.props/.targets
(which the CI's "Register CUDA MSBuild customizations" step now copies
into VCTargetsPath/BuildCustomizations from CUDA 13's installation),
plus replaces the stale Release|x64 <CodeGeneration> list (which had
compute_30..sm_75 from the CUDA 10 era) with the modern set including
sm_86 and Blackwell.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Re-pin cuda-windows submodule to main + bump to PR #1 merge commit

waltsims/kspaceFirstOrder-CUDA-windows#1 (the sm_120 / vcxproj CUDA 13
work) merged today into main (commit a6d6919), and the upstream
cuda-12-support branch is no longer the long-lived integration branch.

Two coordinated changes:
- .gitmodules: switch cuda-windows from `branch = cuda-12-support` to
  `branch = main` so `git submodule update --remote` picks up new
  commits from the right ref going forward.
- Bump the recorded SHA from 34480ea (which was the PR-head commit on
  the now-deleted bump-CUDA-sm-suppoprt-to-120 branch) to a6d6919
  (the merge commit, reachable from main).

cuda-linux still pins `branch = cuda-12-support` because waltsims/
kspaceFirstOrder-CUDA-linux#5 hasn't merged yet. That submodule entry
will be flipped to main once the Linux PR lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Re-bump CUDA submodules to PR merge commits on main

Both Linux and Windows sm_120 work has now landed on main:
- repos/kspaceFirstOrder-cuda-linux:   7e887cf -> 072ec8f
  (PR #4 "Update cufft error enumeration." merged, bringing the
   cuda-12-support branch into main: includes sm_120 Makefile gating
   + cuFFT #ifdef restoration)
- repos/kspaceFirstOrder-cuda-windows: a6d6919 -> e8661b1
  (PR #2 "Cuda 12 support" merged, bringing v143 toolset, CUDA 12.2/
   13.0 fallback Imports, VerifyCudaCustomizations + VerifyVcpkgRoot
   targets, and ResolvedVcpkgRoot property)

Also flips cuda-linux's .gitmodules pin from `branch = cuda-12-support`
to `branch = main` since that branch is no longer the long-lived
integration branch (matches the cuda-windows side, which was flipped
to main in commit 84f1a88).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Flip cuda-linux submodule pin from cuda-12-support to main

PR #4 merged cuda-12-support into main on the upstream cuda-linux
repo, so the cuda-12-support branch is no longer the long-lived
integration target. Tracks main same as cuda-windows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
waltsims added a commit to waltsims/kspacefirstorder-unified that referenced this pull request May 16, 2026
…po releases

Adds release-on-tag.yml (named for the eventual use case but currently
only fires on workflow_dispatch). When manually triggered with a
version string (e.g. v1.4.0), the workflow runs the existing
multi-platform CI matrix (called via workflow_call), downloads each
artifact, and uploads it to a release of that version name on the
corresponding per-platform binary repository.

Two safety features deliberately built in:
- Trigger is workflow_dispatch only (no automatic firing on tag push)
  so a human always confirms the cross-repo publish operation.
- A BINARY_RELEASE_TOKEN repo secret is required (PAT or App token
  with contents:write on the five target repos). The workflow checks
  for it and fails loudly with a clear error rather than silently
  no-op'ing or trying default GITHUB_TOKEN.

Tiny companion change to ci-multi-platform.yml: add workflow_call to
its on: triggers so it can be reused without duplicating the build
matrix. No effect on push/PR-triggered runs.

Future hardening to consider:
- Switch trigger to push: tags: ['v*'] once the workflow has been
  proven on a couple of manual runs and the secret is known-good.
- Add a dry-run mode that builds artifacts but does not publish.

Companion docs: waltsims/k-wave-python#737.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
waltsims added a commit to waltsims/kspacefirstorder-unified that referenced this pull request May 16, 2026
…po releases

Adds release-on-tag.yml (named for the eventual use case but currently
only fires on workflow_dispatch). When manually triggered with a
version string (e.g. v1.4.0), the workflow runs the existing
multi-platform CI matrix (called via workflow_call), downloads each
artifact, and uploads it to a release of that version name on the
corresponding per-platform binary repository.

Two safety features deliberately built in:
- Trigger is workflow_dispatch only (no automatic firing on tag push)
  so a human always confirms the cross-repo publish operation.
- A BINARY_RELEASE_TOKEN repo secret is required (PAT or App token
  with contents:write on the five target repos). The workflow checks
  for it and fails loudly with a clear error rather than silently
  no-op'ing or trying default GITHUB_TOKEN.

Tiny companion change to ci-multi-platform.yml: add workflow_call to
its on: triggers so it can be reused without duplicating the build
matrix. No effect on push/PR-triggered runs.

Future hardening to consider:
- Switch trigger to push: tags: ['v*'] once the workflow has been
  proven on a couple of manual runs and the secret is known-good.
- Add a dry-run mode that builds artifacts but does not publish.

Companion docs: waltsims/k-wave-python#737.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
waltsims added a commit to waltsims/kspacefirstorder-unified that referenced this pull request May 16, 2026
…po releases (#6)

Adds release-on-tag.yml (named for the eventual use case but currently
only fires on workflow_dispatch). When manually triggered with a
version string (e.g. v1.4.0), the workflow runs the existing
multi-platform CI matrix (called via workflow_call), downloads each
artifact, and uploads it to a release of that version name on the
corresponding per-platform binary repository.

Two safety features deliberately built in:
- Trigger is workflow_dispatch only (no automatic firing on tag push)
  so a human always confirms the cross-repo publish operation.
- A BINARY_RELEASE_TOKEN repo secret is required (PAT or App token
  with contents:write on the five target repos). The workflow checks
  for it and fails loudly with a clear error rather than silently
  no-op'ing or trying default GITHUB_TOKEN.

Tiny companion change to ci-multi-platform.yml: add workflow_call to
its on: triggers so it can be reused without duplicating the build
matrix. No effect on push/PR-triggered runs.

Future hardening to consider:
- Switch trigger to push: tags: ['v*'] once the workflow has been
  proven on a couple of manual runs and the secret is known-good.
- Add a dry-run mode that builds artifacts but does not publish.

Companion docs: waltsims/k-wave-python#737.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@waltsims
Copy link
Copy Markdown
Owner Author

Closing as bloat: this runbook documents a manual release flow for the 5-mirror architecture, which is being deprecated via mirror consolidation (tracked in waltsims/kspacefirstorder-unified#13). Most content is one-time-specific to v1.4.0 (the cuda-12-support branch, the specific PR/issue numbers, the HDF5 ABI mismatch all close with v1.4.0). The architecture diagram + cmake approach are already captured in kspacefirstorder-unified/plans/. Once consolidation lands, the release flow is a single tag on unified — no runbook needed.

@waltsims waltsims closed this May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant