From 071c85d91de64985b57c7c9eee3c294c560dcd46 Mon Sep 17 00:00:00 2001
From: Intron7 <sdicks@nvidia.com>
Date: Thu, 30 Apr 2026 09:30:18 +0200
Subject: [PATCH 1/9] add blogpost

---
 content/blog/2026-rsc-goes-nanobind.md | 175 +++++++++++++++++++++++++
 1 file changed, 175 insertions(+)
 create mode 100644 content/blog/2026-rsc-goes-nanobind.md

diff --git a/content/blog/2026-rsc-goes-nanobind.md b/content/blog/2026-rsc-goes-nanobind.md
new file mode 100644
index 0000000..deec227
--- /dev/null
+++ b/content/blog/2026-rsc-goes-nanobind.md
@@ -0,0 +1,175 @@
+# rapids-singlecell 0.15.0: Prebuilt CUDA Wheels and Compiled Kernels
+
+*rapids-singlecell 0.15.0 now ships GPU kernels as precompiled extensions instead of being compiled at runtime. Here's what that means for you.*
+
+---
+
+## Why the packaging changed
+
+In earlier versions of rapids-singlecell, all GPU kernels were written as CuPy RawKernels. These were compiled the first time you called them — in your environment, on your machine. That worked, but it came with friction:
+
+- **First-call latency.** The initial invocation of a kernel-backed function could take several seconds while nvrtc compiled the CUDA source.
+- **Silent dtype/layout mismatches.** A RawKernel receives raw pointers. If the input array had the wrong dtype or wasn't C-contiguous, the kernel might silently produce garbage rather than raising an error.
+- **CUDA code trapped in Python strings.** RawKernels are defined as CUDA source inside Python string literals. That means no syntax highlighting, no autocomplete, and no compiler warnings in your editor — debugging C++ code buried in a Python string is nobody's idea of a good time.
+
+Starting with 0.15.0, these kernels are compiled once at build time and shipped as nanobind/CUDA C++ extension modules inside the wheel. The result is a more conventional compiled-extension workflow: you `pip install` the package and every kernel is ready immediately.
+
+## What changed
+
+The GPU kernels that were previously CuPy RawKernels are now nanobind C++ extensions built with `scikit-build-core` and CMake. This gives us:
+
+- **No runtime compilation** for any migrated kernel — the compiled code is in the wheel.
+- **Typed bindings at the Python/C++ boundary.** nanobind enforces dtype (e.g. float32 vs float64) and memory layout (C-contiguous vs F-contiguous) before the kernel launches, so mismatches raise a clear `TypeError` instead of producing wrong results.
+- **A conventional C++/CUDA project structure** with headers, shared helpers, and room for larger fused or fully C++ GPU routines. Harmony2, shipping in this release, is the first example of a more complex function built on this foundation.
+- **CUDA-versioned wheel packaging.** CI builds separate wheels for each CUDA major version — `rapids-singlecell-cu12` and `rapids-singlecell-cu13` — each with a `[rapids]` dependency extra that pulls in the matching RAPIDS and CuPy packages.
+
+The Python API and import name are unchanged:
+
+```python
+import rapids_singlecell as rsc
+```
+
+Your existing analysis scripts should work without modification.
+
+## CUDA-specific wheels
+
+Because the kernels are now compiled binaries, we need to ship one wheel per CUDA major version. (Python wheel tags don't encode CUDA version, so we encode it in the package name — the same approach used by CuPy, PyTorch, and other CUDA-dependent packages.)
+
+| Package name | Compiled with | Runtime CUDA support | Blackwell GPUs |
+|---|---|---|---|
+| `rapids-singlecell-cu12` | CUDA 12.2 | CUDA 12.2 – 12.9+ | Via PTX JIT (sm_90) |
+| `rapids-singlecell-cu13` | CUDA 13.0 | CUDA 13.0+ | Native binaries |
+
+Both wheels are available for **x86_64** and **aarch64** on Linux.
+
+If you have a Blackwell GPU (B200, GB200) and want the best out-of-the-box performance, the CUDA 13 wheel includes native binaries for Blackwell architectures. The CUDA 12 wheel still supports Blackwell through PTX just-in-time compilation, so it will work, but the first kernel launch on Blackwell will be slightly slower while the driver JIT-compiles the PTX.
+
+## How to install
+
+### Prebuilt wheel (recommended)
+
+Pick the wheel that matches your CUDA version:
+
+```bash
+pip install rapids-singlecell-cu13   # CUDA 13
+pip install rapids-singlecell-cu12   # CUDA 12
+```
+
+This installs rapids-singlecell with precompiled kernels, but does **not** pull in the RAPIDS stack (cupy, cuml, cudf, etc.). If you manage those dependencies separately — for example, through conda — this is all you need.
+
+### Prebuilt wheel with RAPIDS dependencies
+
+If you want pip to also install the matching RAPIDS and CuPy packages:
+
+```bash
+pip install 'rapids-singlecell-cu13[rapids]' --extra-index-url=https://pypi.nvidia.com
+pip install 'rapids-singlecell-cu12[rapids]' --extra-index-url=https://pypi.nvidia.com
+```
+
+Note: on the prebuilt wheels, the dependency extra is always `[rapids]`. The CUDA version is determined by which package name you install — `rapids-singlecell-cu12` or `rapids-singlecell-cu13`. If you're building from source instead, the extras are `[rapids-cu12]` and `[rapids-cu13]`.
+
+### Conda / Mamba
+
+Environment files are provided in the repository:
+
+```bash
+conda env create -f conda/rsc_rapids_26.04_cuda13.yml   # Python 3.14, CUDA 13
+conda env create -f conda/rsc_rapids_26.04_cuda12.yml   # Python 3.14, CUDA 12
+```
+
+> **Note:** RAPIDS currently does not support `channel_priority: strict`. Use `channel_priority: flexible` instead.
+
+### Docker / Apptainer
+
+Pre-built containers are available for both CUDA versions:
+
+```bash
+docker pull ghcr.io/scverse/rapids-singlecell-cu13:latest
+docker run --rm --gpus all ghcr.io/scverse/rapids-singlecell-cu13:latest
+```
+
+For HPC clusters using Apptainer/Singularity:
+
+```bash
+apptainer pull rsc.sif docker://ghcr.io/scverse/rapids-singlecell-cu13:latest
+apptainer run --nv rsc.sif
+```
+
+## Migration from 0.14.x
+
+For most users, upgrading is straightforward:
+
+1. **Change your pip install command.** Replace `pip install rapids-singlecell` with `pip install rapids-singlecell-cu12` or `rapids-singlecell-cu13`, depending on your CUDA version.
+2. **No code changes needed.** The `import rapids_singlecell as rsc` import and all public APIs remain the same.
+3. **Check your CUDA version.** Run `nvidia-smi` or `nvcc --version` to confirm whether you're on CUDA 12.x or CUDA 13.x, and install the matching wheel. If you're using conda, make sure the CUDA runtime library version in your environment matches the wheel you install — e.g., `cuda-cudart` from the `nvidia` channel should be 12.x for the cu12 wheel or 13.x for the cu13 wheel.
+
+## What about `pip install rapids-singlecell`?
+
+The plain install — `pip install rapids-singlecell`, without the `-cu12` or `-cu13` suffix — still works. It will compile the CUDA extensions from source during installation. This is perfectly functional, but please be aware of what that means: you need a CUDA toolkit with nvcc, CMake ≥ 3.24, and a compatible C++ compiler already present in your environment, and the build will take longer than downloading a prebuilt wheel.
+
+When building from source, you can install the matching RAPIDS dependencies with the `[rapids-cu12]` or `[rapids-cu13]` extra:
+
+```bash
+pip install 'rapids-singlecell[rapids-cu12]' --extra-index-url=https://pypi.nvidia.com
+```
+
+Or install the RAPIDS stack separately before or after the build.
+
+For most users, we recommend the prebuilt CUDA wheels. They're faster to install and don't require a local compiler toolchain. For more details on source builds — including how to target custom GPU architectures — see the [installation docs](https://rapids-singlecell.readthedocs.io/en/latest/installation.html).
+
+Source builds are the right choice if you are:
+
+- **Contributing to rapids-singlecell** and need to iterate on C++ kernel code.
+- **Debugging CUDA extensions** and want to compile with debug flags or sanitizers.
+- **Targeting a custom GPU architecture** not covered by the prebuilt wheels (e.g. a future compute capability).
+- **On a platform we don't publish wheels for** (though we cover x86_64 and aarch64 Linux).
+
+If none of those apply to you, use the prebuilt wheel.
+
+## Other highlights in 0.15.0
+
+Beyond packaging, this release includes a substantial set of algorithmic and performance improvements built up across the 0.15.0 development cycle:
+
+### Harmony2 and C++ harmony
+
+Harmony was rewritten as a C++ nanobind kernel ([#578](https://github.com/scverse/rapids-singlecell/pull/578)), making it significantly faster and more memory-efficient. On top of that, we implemented three algorithmic improvements from the Harmony2 paper (Patikas et al. 2026): a stabilized diversity penalty, dynamic per-cluster-per-batch ridge regularization, and automatic batch pruning to prevent overintegration in biologically heterogeneous datasets ([#625](https://github.com/scverse/rapids-singlecell/pull/625)). This is also the first example of a more complex routine built on the new compiled-kernel infrastructure.
+
+### Contrast-based energy distance
+
+Perturbation experiments typically don't need a full k×k distance matrix between all groups — you want to compare each perturbation against one or two controls, possibly stratified by cell type. The new `contrast_distances()` API ([#603](https://github.com/scverse/rapids-singlecell/pull/603)) lets you express exactly that. You build a contrasts DataFrame — either with the `Distance.create_contrasts()` helper or by hand — where each row is a (target, reference) comparison, optionally stratified by `split_by` columns (e.g., cell type). Under the hood, the kernel deduplicates shared distance pairs across contrasts, subsets the embedding to only the referenced cells before transferring to GPU, and launches a single kernel call for all unique pairs. The result is a copy of your contrasts DataFrame with an `edistance` column appended.
+
+```python
+from rapids_singlecell.pertpy_gpu import Distance
+
+dist = Distance("edistance")
+
+# Compare each perturbation against two controls, stratified by cell type
+contrasts = Distance.create_contrasts(
+    adata,
+    groupby="target_gene",
+    selected_group=["Non_target", "Scramble"],
+    split_by="cell_type",
+)
+
+result = dist.contrast_distances(adata, contrasts=contrasts)
+```
+
+`onesided_distances()` also now accepts a sequence of control group names via `selected_group`, returning a DataFrame with one column per control ([#601](https://github.com/scverse/rapids-singlecell/pull/601)). Both energy distance and co-occurrence kernels gained multi-GPU support ([#545](https://github.com/scverse/rapids-singlecell/pull/545), [#546](https://github.com/scverse/rapids-singlecell/pull/546)).
+
+### More highlights
+
+- **Dask support for `highly_variable_genes`** with the Seurat v3 flavor ([#616](https://github.com/scverse/rapids-singlecell/pull/616)).
+- **CUDA kernel error surfacing** — launch errors are now raised instead of silently continuing ([#619](https://github.com/scverse/rapids-singlecell/pull/619)).
+- **RAPIDS 26.04 and Python 3.14 support** across all CI and conda environments.
+
+## Get started
+
+```bash
+pip install rapids-singlecell-cu13   # or rapids-singlecell-cu12
+```
+
+For questions and bug reports, visit the [GitHub issue tracker](https://github.com/scverse/rapids_singlecell/issues).
+
+---
+
+*rapids-singlecell is part of the [scverse](https://scverse.org) ecosystem. If you use it in your research, please cite the project.*

From 9076f80eddb9f2603b9df5927f2593ecb4702dba Mon Sep 17 00:00:00 2001
From: Intron7 <sdicks@nvidia.com>
Date: Thu, 30 Apr 2026 09:36:00 +0200
Subject: [PATCH 2/9] update formating

---
 content/blog/2026-rsc-goes-nanobind.md | 88 +++++++++++++++++++-------
 1 file changed, 65 insertions(+), 23 deletions(-)

diff --git a/content/blog/2026-rsc-goes-nanobind.md b/content/blog/2026-rsc-goes-nanobind.md
index deec227..53158b3 100644
--- a/content/blog/2026-rsc-goes-nanobind.md
+++ b/content/blog/2026-rsc-goes-nanobind.md
@@ -1,27 +1,48 @@
++++
+title = "rapids-singlecell 0.15.0: Prebuilt CUDA Wheels and Compiled Kernels"
+date = 2026-04-30T00:00:05+01:00
+description = "rapids-singlecell 0.15.0 ships GPU kernels as precompiled wheels — no more runtime compilation."
+author = "Severin Dicks"
+draft = false
++++
+
 # rapids-singlecell 0.15.0: Prebuilt CUDA Wheels and Compiled Kernels
 
-*rapids-singlecell 0.15.0 now ships GPU kernels as precompiled extensions instead of being compiled at runtime. Here's what that means for you.*
+*rapids-singlecell 0.15.0 now ships GPU kernels as precompiled extensions instead of being compiled at runtime.
+Here's what that means for you.*
 
 ---
 
 ## Why the packaging changed
 
-In earlier versions of rapids-singlecell, all GPU kernels were written as CuPy RawKernels. These were compiled the first time you called them — in your environment, on your machine. That worked, but it came with friction:
+In earlier versions of rapids-singlecell, all GPU kernels were written as CuPy RawKernels.
+These were compiled the first time you called them — in your environment, on your machine.
+That worked, but it came with friction:
 
-- **First-call latency.** The initial invocation of a kernel-backed function could take several seconds while nvrtc compiled the CUDA source.
-- **Silent dtype/layout mismatches.** A RawKernel receives raw pointers. If the input array had the wrong dtype or wasn't C-contiguous, the kernel might silently produce garbage rather than raising an error.
-- **CUDA code trapped in Python strings.** RawKernels are defined as CUDA source inside Python string literals. That means no syntax highlighting, no autocomplete, and no compiler warnings in your editor — debugging C++ code buried in a Python string is nobody's idea of a good time.
+- **First-call latency.**
+  The initial invocation of a kernel-backed function could take several seconds while nvrtc compiled the CUDA source.
+- **Silent dtype/layout mismatches.**
+  A RawKernel receives raw pointers.
+  If the input array had the wrong dtype or wasn't C-contiguous, the kernel might silently produce garbage rather than raising an error.
+- **CUDA code trapped in Python strings.**
+  RawKernels are defined as CUDA source inside Python string literals.
+  That means no syntax highlighting, no autocomplete, and no compiler warnings in your editor — debugging C++ code buried in a Python string is nobody's idea of a good time.
 
-Starting with 0.15.0, these kernels are compiled once at build time and shipped as nanobind/CUDA C++ extension modules inside the wheel. The result is a more conventional compiled-extension workflow: you `pip install` the package and every kernel is ready immediately.
+Starting with 0.15.0, these kernels are compiled once at build time and shipped as nanobind/CUDA C++ extension modules inside the wheel.
+The result is a more conventional compiled-extension workflow: you `pip install` the package and every kernel is ready immediately.
 
 ## What changed
 
-The GPU kernels that were previously CuPy RawKernels are now nanobind C++ extensions built with `scikit-build-core` and CMake. This gives us:
+The GPU kernels that were previously CuPy RawKernels are now nanobind C++ extensions built with `scikit-build-core` and CMake.
+This gives us:
 
 - **No runtime compilation** for any migrated kernel — the compiled code is in the wheel.
-- **Typed bindings at the Python/C++ boundary.** nanobind enforces dtype (e.g. float32 vs float64) and memory layout (C-contiguous vs F-contiguous) before the kernel launches, so mismatches raise a clear `TypeError` instead of producing wrong results.
-- **A conventional C++/CUDA project structure** with headers, shared helpers, and room for larger fused or fully C++ GPU routines. Harmony2, shipping in this release, is the first example of a more complex function built on this foundation.
-- **CUDA-versioned wheel packaging.** CI builds separate wheels for each CUDA major version — `rapids-singlecell-cu12` and `rapids-singlecell-cu13` — each with a `[rapids]` dependency extra that pulls in the matching RAPIDS and CuPy packages.
+- **Typed bindings at the Python/C++ boundary.**
+  nanobind enforces dtype (e.g. float32 vs float64) and memory layout (C-contiguous vs F-contiguous) before the kernel launches, so mismatches raise a clear `TypeError` instead of producing wrong results.
+- **A conventional C++/CUDA project structure** with headers, shared helpers, and room for larger fused or fully C++ GPU routines.
+  Harmony2, shipping in this release, is the first example of a more complex function built on this foundation.
+- **CUDA-versioned wheel packaging.**
+  CI builds separate wheels for each CUDA major version — `rapids-singlecell-cu12` and `rapids-singlecell-cu13` — each with a `[rapids]` dependency extra that pulls in the matching RAPIDS and CuPy packages.
 
 The Python API and import name are unchanged:
 
@@ -33,7 +54,8 @@ Your existing analysis scripts should work without modification.
 
 ## CUDA-specific wheels
 
-Because the kernels are now compiled binaries, we need to ship one wheel per CUDA major version. (Python wheel tags don't encode CUDA version, so we encode it in the package name — the same approach used by CuPy, PyTorch, and other CUDA-dependent packages.)
+Because the kernels are now compiled binaries, we need to ship one wheel per CUDA major version.
+(Python wheel tags don't encode CUDA version, so we encode it in the package name — the same approach used by CuPy, PyTorch, and other CUDA-dependent packages.)
 
 | Package name | Compiled with | Runtime CUDA support | Blackwell GPUs |
 |---|---|---|---|
@@ -42,7 +64,8 @@ Because the kernels are now compiled binaries, we need to ship one wheel per CUD
 
 Both wheels are available for **x86_64** and **aarch64** on Linux.
 
-If you have a Blackwell GPU (B200, GB200) and want the best out-of-the-box performance, the CUDA 13 wheel includes native binaries for Blackwell architectures. The CUDA 12 wheel still supports Blackwell through PTX just-in-time compilation, so it will work, but the first kernel launch on Blackwell will be slightly slower while the driver JIT-compiles the PTX.
+If you have a Blackwell GPU (B200, GB200) and want the best out-of-the-box performance, the CUDA 13 wheel includes native binaries for Blackwell architectures.
+The CUDA 12 wheel still supports Blackwell through PTX just-in-time compilation, so it will work, but the first kernel launch on Blackwell will be slightly slower while the driver JIT-compiles the PTX.
 
 ## How to install
 
@@ -55,7 +78,8 @@ pip install rapids-singlecell-cu13   # CUDA 13
 pip install rapids-singlecell-cu12   # CUDA 12
 ```
 
-This installs rapids-singlecell with precompiled kernels, but does **not** pull in the RAPIDS stack (cupy, cuml, cudf, etc.). If you manage those dependencies separately — for example, through conda — this is all you need.
+This installs rapids-singlecell with precompiled kernels, but does **not** pull in the RAPIDS stack (cupy, cuml, cudf, etc.).
+If you manage those dependencies separately — for example, through conda — this is all you need.
 
 ### Prebuilt wheel with RAPIDS dependencies
 
@@ -66,7 +90,9 @@ pip install 'rapids-singlecell-cu13[rapids]' --extra-index-url=https://pypi.nvid
 pip install 'rapids-singlecell-cu12[rapids]' --extra-index-url=https://pypi.nvidia.com
 ```
 
-Note: on the prebuilt wheels, the dependency extra is always `[rapids]`. The CUDA version is determined by which package name you install — `rapids-singlecell-cu12` or `rapids-singlecell-cu13`. If you're building from source instead, the extras are `[rapids-cu12]` and `[rapids-cu13]`.
+Note: on the prebuilt wheels, the dependency extra is always `[rapids]`.
+The CUDA version is determined by which package name you install — `rapids-singlecell-cu12` or `rapids-singlecell-cu13`.
+If you're building from source instead, the extras are `[rapids-cu12]` and `[rapids-cu13]`.
 
 ### Conda / Mamba
 
@@ -99,13 +125,19 @@ apptainer run --nv rsc.sif
 
 For most users, upgrading is straightforward:
 
-1. **Change your pip install command.** Replace `pip install rapids-singlecell` with `pip install rapids-singlecell-cu12` or `rapids-singlecell-cu13`, depending on your CUDA version.
-2. **No code changes needed.** The `import rapids_singlecell as rsc` import and all public APIs remain the same.
-3. **Check your CUDA version.** Run `nvidia-smi` or `nvcc --version` to confirm whether you're on CUDA 12.x or CUDA 13.x, and install the matching wheel. If you're using conda, make sure the CUDA runtime library version in your environment matches the wheel you install — e.g., `cuda-cudart` from the `nvidia` channel should be 12.x for the cu12 wheel or 13.x for the cu13 wheel.
+1. **Change your pip install command.**
+   Replace `pip install rapids-singlecell` with `pip install rapids-singlecell-cu12` or `rapids-singlecell-cu13`, depending on your CUDA version.
+2. **No code changes needed.**
+   The `import rapids_singlecell as rsc` import and all public APIs remain the same.
+3. **Check your CUDA version.**
+   Run `nvidia-smi` or `nvcc --version` to confirm whether you're on CUDA 12.x or CUDA 13.x, and install the matching wheel.
+   If you're using conda, make sure the CUDA runtime library version in your environment matches the wheel you install — e.g., `cuda-cudart` from the `nvidia` channel should be 12.x for the cu12 wheel or 13.x for the cu13 wheel.
 
 ## What about `pip install rapids-singlecell`?
 
-The plain install — `pip install rapids-singlecell`, without the `-cu12` or `-cu13` suffix — still works. It will compile the CUDA extensions from source during installation. This is perfectly functional, but please be aware of what that means: you need a CUDA toolkit with nvcc, CMake ≥ 3.24, and a compatible C++ compiler already present in your environment, and the build will take longer than downloading a prebuilt wheel.
+The plain install — `pip install rapids-singlecell`, without the `-cu12` or `-cu13` suffix — still works.
+It will compile the CUDA extensions from source during installation.
+This is perfectly functional, but please be aware of what that means: you need a CUDA toolkit with nvcc, CMake ≥ 3.24, and a compatible C++ compiler already present in your environment, and the build will take longer than downloading a prebuilt wheel.
 
 When building from source, you can install the matching RAPIDS dependencies with the `[rapids-cu12]` or `[rapids-cu13]` extra:
 
@@ -115,7 +147,9 @@ pip install 'rapids-singlecell[rapids-cu12]' --extra-index-url=https://pypi.nvid
 
 Or install the RAPIDS stack separately before or after the build.
 
-For most users, we recommend the prebuilt CUDA wheels. They're faster to install and don't require a local compiler toolchain. For more details on source builds — including how to target custom GPU architectures — see the [installation docs](https://rapids-singlecell.readthedocs.io/en/latest/installation.html).
+For most users, we recommend the prebuilt CUDA wheels.
+They're faster to install and don't require a local compiler toolchain.
+For more details on source builds — including how to target custom GPU architectures — see the [installation docs](https://rapids-singlecell.readthedocs.io/en/latest/installation.html).
 
 Source builds are the right choice if you are:
 
@@ -132,11 +166,17 @@ Beyond packaging, this release includes a substantial set of algorithmic and per
 
 ### Harmony2 and C++ harmony
 
-Harmony was rewritten as a C++ nanobind kernel ([#578](https://github.com/scverse/rapids-singlecell/pull/578)), making it significantly faster and more memory-efficient. On top of that, we implemented three algorithmic improvements from the Harmony2 paper (Patikas et al. 2026): a stabilized diversity penalty, dynamic per-cluster-per-batch ridge regularization, and automatic batch pruning to prevent overintegration in biologically heterogeneous datasets ([#625](https://github.com/scverse/rapids-singlecell/pull/625)). This is also the first example of a more complex routine built on the new compiled-kernel infrastructure.
+Harmony was rewritten as a C++ nanobind kernel ([#578](https://github.com/scverse/rapids-singlecell/pull/578)), making it significantly faster and more memory-efficient.
+On top of that, we implemented three algorithmic improvements from the Harmony2 paper (Patikas et al. 2026): a stabilized diversity penalty, dynamic per-cluster-per-batch ridge regularization, and automatic batch pruning to prevent overintegration in biologically heterogeneous datasets ([#625](https://github.com/scverse/rapids-singlecell/pull/625)).
+This is also the first example of a more complex routine built on the new compiled-kernel infrastructure.
 
 ### Contrast-based energy distance
 
-Perturbation experiments typically don't need a full k×k distance matrix between all groups — you want to compare each perturbation against one or two controls, possibly stratified by cell type. The new `contrast_distances()` API ([#603](https://github.com/scverse/rapids-singlecell/pull/603)) lets you express exactly that. You build a contrasts DataFrame — either with the `Distance.create_contrasts()` helper or by hand — where each row is a (target, reference) comparison, optionally stratified by `split_by` columns (e.g., cell type). Under the hood, the kernel deduplicates shared distance pairs across contrasts, subsets the embedding to only the referenced cells before transferring to GPU, and launches a single kernel call for all unique pairs. The result is a copy of your contrasts DataFrame with an `edistance` column appended.
+Perturbation experiments typically don't need a full k×k distance matrix between all groups — you want to compare each perturbation against one or two controls, possibly stratified by cell type.
+The new `contrast_distances()` API ([#603](https://github.com/scverse/rapids-singlecell/pull/603)) lets you express exactly that.
+You build a contrasts DataFrame — either with the `Distance.create_contrasts()` helper or by hand — where each row is a (target, reference) comparison, optionally stratified by `split_by` columns (e.g., cell type).
+Under the hood, the kernel deduplicates shared distance pairs across contrasts, subsets the embedding to only the referenced cells before transferring to GPU, and launches a single kernel call for all unique pairs.
+The result is a copy of your contrasts DataFrame with an `edistance` column appended.
 
 ```python
 from rapids_singlecell.pertpy_gpu import Distance
@@ -154,7 +194,8 @@ contrasts = Distance.create_contrasts(
 result = dist.contrast_distances(adata, contrasts=contrasts)
 ```
 
-`onesided_distances()` also now accepts a sequence of control group names via `selected_group`, returning a DataFrame with one column per control ([#601](https://github.com/scverse/rapids-singlecell/pull/601)). Both energy distance and co-occurrence kernels gained multi-GPU support ([#545](https://github.com/scverse/rapids-singlecell/pull/545), [#546](https://github.com/scverse/rapids-singlecell/pull/546)).
+`onesided_distances()` also now accepts a sequence of control group names via `selected_group`, returning a DataFrame with one column per control ([#601](https://github.com/scverse/rapids-singlecell/pull/601)).
+Both energy distance and co-occurrence kernels gained multi-GPU support ([#545](https://github.com/scverse/rapids-singlecell/pull/545), [#546](https://github.com/scverse/rapids-singlecell/pull/546)).
 
 ### More highlights
 
@@ -172,4 +213,5 @@ For questions and bug reports, visit the [GitHub issue tracker](https://github.c
 
 ---
 
-*rapids-singlecell is part of the [scverse](https://scverse.org) ecosystem. If you use it in your research, please cite the project.*
+*rapids-singlecell is part of the [scverse](https://scverse.org) ecosystem.
+If you use it in your research, please cite the project.*

From e6639bf39ac8c96247160d2a765df646a1d6901c Mon Sep 17 00:00:00 2001
From: Intron7 <sdicks@nvidia.com>
Date: Thu, 30 Apr 2026 09:41:40 +0200
Subject: [PATCH 3/9] fix title

---
 content/blog/2026-rsc-goes-nanobind.md | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/content/blog/2026-rsc-goes-nanobind.md b/content/blog/2026-rsc-goes-nanobind.md
index 53158b3..91cd1d6 100644
--- a/content/blog/2026-rsc-goes-nanobind.md
+++ b/content/blog/2026-rsc-goes-nanobind.md
@@ -1,18 +1,11 @@
 +++
 title = "rapids-singlecell 0.15.0: Prebuilt CUDA Wheels and Compiled Kernels"
 date = 2026-04-30T00:00:05+01:00
-description = "rapids-singlecell 0.15.0 ships GPU kernels as precompiled wheels — no more runtime compilation."
+description = "Why we moved from CuPy RawKernels to nanobind C++ extensions, plus other release highlights."
 author = "Severin Dicks"
 draft = false
 +++
 
-# rapids-singlecell 0.15.0: Prebuilt CUDA Wheels and Compiled Kernels
-
-*rapids-singlecell 0.15.0 now ships GPU kernels as precompiled extensions instead of being compiled at runtime.
-Here's what that means for you.*
-
----
-
 ## Why the packaging changed
 
 In earlier versions of rapids-singlecell, all GPU kernels were written as CuPy RawKernels.
@@ -209,6 +202,8 @@ Both energy distance and co-occurrence kernels gained multi-GPU support ([#545](
 pip install rapids-singlecell-cu13   # or rapids-singlecell-cu12
 ```
 
+A big thank you to everyone who tested the pre-releases and helped surface issues before this release went out.
+
 For questions and bug reports, visit the [GitHub issue tracker](https://github.com/scverse/rapids_singlecell/issues).
 
 ---

From 3adcacde762e4728b3ef35fed827ba0e84644fd5 Mon Sep 17 00:00:00 2001
From: Intron7 <sdicks@nvidia.com>
Date: Thu, 30 Apr 2026 09:42:22 +0200
Subject: [PATCH 4/9] fix description

---
 content/blog/2026-rsc-goes-nanobind.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/blog/2026-rsc-goes-nanobind.md b/content/blog/2026-rsc-goes-nanobind.md
index 91cd1d6..6fd0407 100644
--- a/content/blog/2026-rsc-goes-nanobind.md
+++ b/content/blog/2026-rsc-goes-nanobind.md
@@ -1,7 +1,7 @@
 +++
 title = "rapids-singlecell 0.15.0: Prebuilt CUDA Wheels and Compiled Kernels"
 date = 2026-04-30T00:00:05+01:00
-description = "Why we moved from CuPy RawKernels to nanobind C++ extensions, plus other release highlights."
+description = "Why we moved from CuPy RawKernels to nanobind C++ extensions, and other release highlights."
 author = "Severin Dicks"
 draft = false
 +++

From 0576c6cfc2f0c487175eed5ed4c66fae42a8066a Mon Sep 17 00:00:00 2001
From: Intron7 <sdicks@nvidia.com>
Date: Thu, 30 Apr 2026 09:42:45 +0200
Subject: [PATCH 5/9] fix description

---
 content/blog/2026-rsc-goes-nanobind.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/blog/2026-rsc-goes-nanobind.md b/content/blog/2026-rsc-goes-nanobind.md
index 6fd0407..9e16e97 100644
--- a/content/blog/2026-rsc-goes-nanobind.md
+++ b/content/blog/2026-rsc-goes-nanobind.md
@@ -1,7 +1,7 @@
 +++
 title = "rapids-singlecell 0.15.0: Prebuilt CUDA Wheels and Compiled Kernels"
 date = 2026-04-30T00:00:05+01:00
-description = "Why we moved from CuPy RawKernels to nanobind C++ extensions, and other release highlights."
+description = "Why we moved from CuPy RawKernels to nanobind C++ extensions and other release highlights."
 author = "Severin Dicks"
 draft = false
 +++

From b240ddd392a00bfa9dc0d5a6438d1d76e5991479 Mon Sep 17 00:00:00 2001
From: Intron7 <sdicks@nvidia.com>
Date: Thu, 30 Apr 2026 09:46:46 +0200
Subject: [PATCH 6/9] fix talbe

---
 content/blog/2026-rsc-goes-nanobind.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/content/blog/2026-rsc-goes-nanobind.md b/content/blog/2026-rsc-goes-nanobind.md
index 9e16e97..754db2f 100644
--- a/content/blog/2026-rsc-goes-nanobind.md
+++ b/content/blog/2026-rsc-goes-nanobind.md
@@ -50,10 +50,10 @@ Your existing analysis scripts should work without modification.
 Because the kernels are now compiled binaries, we need to ship one wheel per CUDA major version.
 (Python wheel tags don't encode CUDA version, so we encode it in the package name — the same approach used by CuPy, PyTorch, and other CUDA-dependent packages.)
 
-| Package name | Compiled with | Runtime CUDA support | Blackwell GPUs |
-|---|---|---|---|
-| `rapids-singlecell-cu12` | CUDA 12.2 | CUDA 12.2 – 12.9+ | Via PTX JIT (sm_90) |
-| `rapids-singlecell-cu13` | CUDA 13.0 | CUDA 13.0+ | Native binaries |
+| Package                  | Build CUDA | Runtime CUDA | Blackwell (B200, GB200) |
+| :----------------------- | :--------: | :----------: | :---------------------- |
+| `rapids-singlecell-cu12` |    12.2    | 12.2 – 12.9+ | Supported via PTX JIT   |
+| `rapids-singlecell-cu13` |    13.0    |    13.0+     | Native binaries         |
 
 Both wheels are available for **x86_64** and **aarch64** on Linux.
 

From 5865b9acabfac6aae2dd63cf6bb5fd0000c66199 Mon Sep 17 00:00:00 2001
From: Intron7 <sdicks@nvidia.com>
Date: Thu, 30 Apr 2026 10:07:50 +0200
Subject: [PATCH 7/9] make table nicer

---
 assets/main.scss | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/assets/main.scss b/assets/main.scss
index ac7594b..17e000b 100644
--- a/assets/main.scss
+++ b/assets/main.scss
@@ -957,6 +957,26 @@ body {
                     Open Sans,
                     sans-serif;
             }
+            > table {
+                width: 100%;
+                border-collapse: collapse;
+                margin: 1.5rem 0;
+                font-family: "Inter", sans-serif;
+                font-size: 1rem;
+                th,
+                td {
+                    padding: 0.6rem 0.9rem;
+                    border: 1px solid $overline;
+                    text-align: left;
+                }
+                th {
+                    background-color: $tilebg;
+                    font-weight: 600;
+                }
+                tbody tr:nth-child(even) {
+                    background-color: $tilebg4;
+                }
+            }
             @media (max-width: 50rem), (max-device-width: 40rem) {
                 font-size: 1rem;
                 line-height: 1.8rem;

From 1fa3f9b7b9c230090971713f4cd9989bf403bf58 Mon Sep 17 00:00:00 2001
From: Lukas Heumos <lukas.heumos@posteo.net>
Date: Thu, 30 Apr 2026 14:54:47 +0200
Subject: [PATCH 8/9] headers

Signed-off-by: Lukas Heumos <lukas.heumos@posteo.net>
---
 content/blog/2026-rsc-goes-nanobind.md | 30 ++++++++++++--------------
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/content/blog/2026-rsc-goes-nanobind.md b/content/blog/2026-rsc-goes-nanobind.md
index 754db2f..7e6b26d 100644
--- a/content/blog/2026-rsc-goes-nanobind.md
+++ b/content/blog/2026-rsc-goes-nanobind.md
@@ -6,7 +6,11 @@ author = "Severin Dicks"
 draft = false
 +++
 
-## Why the packaging changed
+# Rapids-singlecell release 0.15.0
+
+We are proud to announce rapids-singlecell release 0.15.0 which comes with lots of new features but also changes to the installation process.
+
+## Why the packaging changes
 
 In earlier versions of rapids-singlecell, all GPU kernels were written as CuPy RawKernels.
 These were compiled the first time you called them — in your environment, on your machine.
@@ -24,7 +28,7 @@ That worked, but it came with friction:
 Starting with 0.15.0, these kernels are compiled once at build time and shipped as nanobind/CUDA C++ extension modules inside the wheel.
 The result is a more conventional compiled-extension workflow: you `pip install` the package and every kernel is ready immediately.
 
-## What changed
+### Packaging changes in detail
 
 The GPU kernels that were previously CuPy RawKernels are now nanobind C++ extensions built with `scikit-build-core` and CMake.
 This gives us:
@@ -45,7 +49,7 @@ import rapids_singlecell as rsc
 
 Your existing analysis scripts should work without modification.
 
-## CUDA-specific wheels
+### CUDA-specific wheels
 
 Because the kernels are now compiled binaries, we need to ship one wheel per CUDA major version.
 (Python wheel tags don't encode CUDA version, so we encode it in the package name — the same approach used by CuPy, PyTorch, and other CUDA-dependent packages.)
@@ -60,9 +64,9 @@ Both wheels are available for **x86_64** and **aarch64** on Linux.
 If you have a Blackwell GPU (B200, GB200) and want the best out-of-the-box performance, the CUDA 13 wheel includes native binaries for Blackwell architectures.
 The CUDA 12 wheel still supports Blackwell through PTX just-in-time compilation, so it will work, but the first kernel launch on Blackwell will be slightly slower while the driver JIT-compiles the PTX.
 
-## How to install
+### How to install
 
-### Prebuilt wheel (recommended)
+#### Prebuilt wheel (recommended)
 
 Pick the wheel that matches your CUDA version:
 
@@ -74,7 +78,7 @@ pip install rapids-singlecell-cu12   # CUDA 12
 This installs rapids-singlecell with precompiled kernels, but does **not** pull in the RAPIDS stack (cupy, cuml, cudf, etc.).
 If you manage those dependencies separately — for example, through conda — this is all you need.
 
-### Prebuilt wheel with RAPIDS dependencies
+#### Prebuilt wheel with RAPIDS dependencies
 
 If you want pip to also install the matching RAPIDS and CuPy packages:
 
@@ -87,7 +91,7 @@ Note: on the prebuilt wheels, the dependency extra is always `[rapids]`.
 The CUDA version is determined by which package name you install — `rapids-singlecell-cu12` or `rapids-singlecell-cu13`.
 If you're building from source instead, the extras are `[rapids-cu12]` and `[rapids-cu13]`.
 
-### Conda / Mamba
+#### Conda / Mamba
 
 Environment files are provided in the repository:
 
@@ -98,7 +102,7 @@ conda env create -f conda/rsc_rapids_26.04_cuda12.yml   # Python 3.14, CUDA 12
 
 > **Note:** RAPIDS currently does not support `channel_priority: strict`. Use `channel_priority: flexible` instead.
 
-### Docker / Apptainer
+#### Docker / Apptainer
 
 Pre-built containers are available for both CUDA versions:
 
@@ -114,7 +118,7 @@ apptainer pull rsc.sif docker://ghcr.io/scverse/rapids-singlecell-cu13:latest
 apptainer run --nv rsc.sif
 ```
 
-## Migration from 0.14.x
+### Migration from 0.14.x
 
 For most users, upgrading is straightforward:
 
@@ -126,7 +130,7 @@ For most users, upgrading is straightforward:
    Run `nvidia-smi` or `nvcc --version` to confirm whether you're on CUDA 12.x or CUDA 13.x, and install the matching wheel.
    If you're using conda, make sure the CUDA runtime library version in your environment matches the wheel you install — e.g., `cuda-cudart` from the `nvidia` channel should be 12.x for the cu12 wheel or 13.x for the cu13 wheel.
 
-## What about `pip install rapids-singlecell`?
+### What about `pip install rapids-singlecell`?
 
 The plain install — `pip install rapids-singlecell`, without the `-cu12` or `-cu13` suffix — still works.
 It will compile the CUDA extensions from source during installation.
@@ -196,12 +200,6 @@ Both energy distance and co-occurrence kernels gained multi-GPU support ([#545](
 - **CUDA kernel error surfacing** — launch errors are now raised instead of silently continuing ([#619](https://github.com/scverse/rapids-singlecell/pull/619)).
 - **RAPIDS 26.04 and Python 3.14 support** across all CI and conda environments.
 
-## Get started
-
-```bash
-pip install rapids-singlecell-cu13   # or rapids-singlecell-cu12
-```
-
 A big thank you to everyone who tested the pre-releases and helped surface issues before this release went out.
 
 For questions and bug reports, visit the [GitHub issue tracker](https://github.com/scverse/rapids_singlecell/issues).

From c10e7c9f2b66a7982d837b4d8ccaa90afc49ce8a Mon Sep 17 00:00:00 2001
From: Lukas Heumos <lukas.heumos@posteo.net>
Date: Thu, 30 Apr 2026 14:57:35 +0200
Subject: [PATCH 9/9] more details

Signed-off-by: Lukas Heumos <lukas.heumos@posteo.net>
---
 content/blog/2026-rsc-goes-nanobind.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/content/blog/2026-rsc-goes-nanobind.md b/content/blog/2026-rsc-goes-nanobind.md
index 7e6b26d..3719e44 100644
--- a/content/blog/2026-rsc-goes-nanobind.md
+++ b/content/blog/2026-rsc-goes-nanobind.md
@@ -2,7 +2,7 @@
 title = "rapids-singlecell 0.15.0: Prebuilt CUDA Wheels and Compiled Kernels"
 date = 2026-04-30T00:00:05+01:00
 description = "Why we moved from CuPy RawKernels to nanobind C++ extensions and other release highlights."
-author = "Severin Dicks"
+author = "Severin Dicks, Lukas Heumos"
 draft = false
 +++
 
@@ -196,9 +196,10 @@ Both energy distance and co-occurrence kernels gained multi-GPU support ([#545](
 
 ### More highlights
 
+- **RAPIDS 26.04 and Python 3.14 support** across all CI and conda environments.
 - **Dask support for `highly_variable_genes`** with the Seurat v3 flavor ([#616](https://github.com/scverse/rapids-singlecell/pull/616)).
 - **CUDA kernel error surfacing** — launch errors are now raised instead of silently continuing ([#619](https://github.com/scverse/rapids-singlecell/pull/619)).
-- **RAPIDS 26.04 and Python 3.14 support** across all CI and conda environments.
+- **Additional tutorials** such as a Pertpy-GPU tutorial ([#645](https://github.com/scverse/rapids-singlecell/pull/645))
 
 A big thank you to everyone who tested the pre-releases and helped surface issues before this release went out.