diff --git a/assets/main.scss b/assets/main.scss index ac7594b..17e000b 100644 --- a/assets/main.scss +++ b/assets/main.scss @@ -957,6 +957,26 @@ body { Open Sans, sans-serif; } + > table { + width: 100%; + border-collapse: collapse; + margin: 1.5rem 0; + font-family: "Inter", sans-serif; + font-size: 1rem; + th, + td { + padding: 0.6rem 0.9rem; + border: 1px solid $overline; + text-align: left; + } + th { + background-color: $tilebg; + font-weight: 600; + } + tbody tr:nth-child(even) { + background-color: $tilebg4; + } + } @media (max-width: 50rem), (max-device-width: 40rem) { font-size: 1rem; line-height: 1.8rem; diff --git a/content/blog/2026-rsc-goes-nanobind.md b/content/blog/2026-rsc-goes-nanobind.md new file mode 100644 index 0000000..3719e44 --- /dev/null +++ b/content/blog/2026-rsc-goes-nanobind.md @@ -0,0 +1,211 @@ ++++ +title = "rapids-singlecell 0.15.0: Prebuilt CUDA Wheels and Compiled Kernels" +date = 2026-04-30T00:00:05+01:00 +description = "Why we moved from CuPy RawKernels to nanobind C++ extensions and other release highlights." +author = "Severin Dicks, Lukas Heumos" +draft = false ++++ + +# Rapids-singlecell release 0.15.0 + +We are proud to announce rapids-singlecell release 0.15.0 which comes with lots of new features but also changes to the installation process. + +## Why the packaging changes + +In earlier versions of rapids-singlecell, all GPU kernels were written as CuPy RawKernels. +These were compiled the first time you called them — in your environment, on your machine. +That worked, but it came with friction: + +- **First-call latency.** + The initial invocation of a kernel-backed function could take several seconds while nvrtc compiled the CUDA source. +- **Silent dtype/layout mismatches.** + A RawKernel receives raw pointers. + If the input array had the wrong dtype or wasn't C-contiguous, the kernel might silently produce garbage rather than raising an error. +- **CUDA code trapped in Python strings.** + RawKernels are defined as CUDA source inside Python string literals. + That means no syntax highlighting, no autocomplete, and no compiler warnings in your editor — debugging C++ code buried in a Python string is nobody's idea of a good time. + +Starting with 0.15.0, these kernels are compiled once at build time and shipped as nanobind/CUDA C++ extension modules inside the wheel. +The result is a more conventional compiled-extension workflow: you `pip install` the package and every kernel is ready immediately. + +### Packaging changes in detail + +The GPU kernels that were previously CuPy RawKernels are now nanobind C++ extensions built with `scikit-build-core` and CMake. +This gives us: + +- **No runtime compilation** for any migrated kernel — the compiled code is in the wheel. +- **Typed bindings at the Python/C++ boundary.** + nanobind enforces dtype (e.g. float32 vs float64) and memory layout (C-contiguous vs F-contiguous) before the kernel launches, so mismatches raise a clear `TypeError` instead of producing wrong results. +- **A conventional C++/CUDA project structure** with headers, shared helpers, and room for larger fused or fully C++ GPU routines. + Harmony2, shipping in this release, is the first example of a more complex function built on this foundation. +- **CUDA-versioned wheel packaging.** + CI builds separate wheels for each CUDA major version — `rapids-singlecell-cu12` and `rapids-singlecell-cu13` — each with a `[rapids]` dependency extra that pulls in the matching RAPIDS and CuPy packages. + +The Python API and import name are unchanged: + +```python +import rapids_singlecell as rsc +``` + +Your existing analysis scripts should work without modification. + +### CUDA-specific wheels + +Because the kernels are now compiled binaries, we need to ship one wheel per CUDA major version. +(Python wheel tags don't encode CUDA version, so we encode it in the package name — the same approach used by CuPy, PyTorch, and other CUDA-dependent packages.) + +| Package | Build CUDA | Runtime CUDA | Blackwell (B200, GB200) | +| :----------------------- | :--------: | :----------: | :---------------------- | +| `rapids-singlecell-cu12` | 12.2 | 12.2 – 12.9+ | Supported via PTX JIT | +| `rapids-singlecell-cu13` | 13.0 | 13.0+ | Native binaries | + +Both wheels are available for **x86_64** and **aarch64** on Linux. + +If you have a Blackwell GPU (B200, GB200) and want the best out-of-the-box performance, the CUDA 13 wheel includes native binaries for Blackwell architectures. +The CUDA 12 wheel still supports Blackwell through PTX just-in-time compilation, so it will work, but the first kernel launch on Blackwell will be slightly slower while the driver JIT-compiles the PTX. + +### How to install + +#### Prebuilt wheel (recommended) + +Pick the wheel that matches your CUDA version: + +```bash +pip install rapids-singlecell-cu13 # CUDA 13 +pip install rapids-singlecell-cu12 # CUDA 12 +``` + +This installs rapids-singlecell with precompiled kernels, but does **not** pull in the RAPIDS stack (cupy, cuml, cudf, etc.). +If you manage those dependencies separately — for example, through conda — this is all you need. + +#### Prebuilt wheel with RAPIDS dependencies + +If you want pip to also install the matching RAPIDS and CuPy packages: + +```bash +pip install 'rapids-singlecell-cu13[rapids]' --extra-index-url=https://pypi.nvidia.com +pip install 'rapids-singlecell-cu12[rapids]' --extra-index-url=https://pypi.nvidia.com +``` + +Note: on the prebuilt wheels, the dependency extra is always `[rapids]`. +The CUDA version is determined by which package name you install — `rapids-singlecell-cu12` or `rapids-singlecell-cu13`. +If you're building from source instead, the extras are `[rapids-cu12]` and `[rapids-cu13]`. + +#### Conda / Mamba + +Environment files are provided in the repository: + +```bash +conda env create -f conda/rsc_rapids_26.04_cuda13.yml # Python 3.14, CUDA 13 +conda env create -f conda/rsc_rapids_26.04_cuda12.yml # Python 3.14, CUDA 12 +``` + +> **Note:** RAPIDS currently does not support `channel_priority: strict`. Use `channel_priority: flexible` instead. + +#### Docker / Apptainer + +Pre-built containers are available for both CUDA versions: + +```bash +docker pull ghcr.io/scverse/rapids-singlecell-cu13:latest +docker run --rm --gpus all ghcr.io/scverse/rapids-singlecell-cu13:latest +``` + +For HPC clusters using Apptainer/Singularity: + +```bash +apptainer pull rsc.sif docker://ghcr.io/scverse/rapids-singlecell-cu13:latest +apptainer run --nv rsc.sif +``` + +### Migration from 0.14.x + +For most users, upgrading is straightforward: + +1. **Change your pip install command.** + Replace `pip install rapids-singlecell` with `pip install rapids-singlecell-cu12` or `rapids-singlecell-cu13`, depending on your CUDA version. +2. **No code changes needed.** + The `import rapids_singlecell as rsc` import and all public APIs remain the same. +3. **Check your CUDA version.** + Run `nvidia-smi` or `nvcc --version` to confirm whether you're on CUDA 12.x or CUDA 13.x, and install the matching wheel. + If you're using conda, make sure the CUDA runtime library version in your environment matches the wheel you install — e.g., `cuda-cudart` from the `nvidia` channel should be 12.x for the cu12 wheel or 13.x for the cu13 wheel. + +### What about `pip install rapids-singlecell`? + +The plain install — `pip install rapids-singlecell`, without the `-cu12` or `-cu13` suffix — still works. +It will compile the CUDA extensions from source during installation. +This is perfectly functional, but please be aware of what that means: you need a CUDA toolkit with nvcc, CMake ≥ 3.24, and a compatible C++ compiler already present in your environment, and the build will take longer than downloading a prebuilt wheel. + +When building from source, you can install the matching RAPIDS dependencies with the `[rapids-cu12]` or `[rapids-cu13]` extra: + +```bash +pip install 'rapids-singlecell[rapids-cu12]' --extra-index-url=https://pypi.nvidia.com +``` + +Or install the RAPIDS stack separately before or after the build. + +For most users, we recommend the prebuilt CUDA wheels. +They're faster to install and don't require a local compiler toolchain. +For more details on source builds — including how to target custom GPU architectures — see the [installation docs](https://rapids-singlecell.readthedocs.io/en/latest/installation.html). + +Source builds are the right choice if you are: + +- **Contributing to rapids-singlecell** and need to iterate on C++ kernel code. +- **Debugging CUDA extensions** and want to compile with debug flags or sanitizers. +- **Targeting a custom GPU architecture** not covered by the prebuilt wheels (e.g. a future compute capability). +- **On a platform we don't publish wheels for** (though we cover x86_64 and aarch64 Linux). + +If none of those apply to you, use the prebuilt wheel. + +## Other highlights in 0.15.0 + +Beyond packaging, this release includes a substantial set of algorithmic and performance improvements built up across the 0.15.0 development cycle: + +### Harmony2 and C++ harmony + +Harmony was rewritten as a C++ nanobind kernel ([#578](https://github.com/scverse/rapids-singlecell/pull/578)), making it significantly faster and more memory-efficient. +On top of that, we implemented three algorithmic improvements from the Harmony2 paper (Patikas et al. 2026): a stabilized diversity penalty, dynamic per-cluster-per-batch ridge regularization, and automatic batch pruning to prevent overintegration in biologically heterogeneous datasets ([#625](https://github.com/scverse/rapids-singlecell/pull/625)). +This is also the first example of a more complex routine built on the new compiled-kernel infrastructure. + +### Contrast-based energy distance + +Perturbation experiments typically don't need a full k×k distance matrix between all groups — you want to compare each perturbation against one or two controls, possibly stratified by cell type. +The new `contrast_distances()` API ([#603](https://github.com/scverse/rapids-singlecell/pull/603)) lets you express exactly that. +You build a contrasts DataFrame — either with the `Distance.create_contrasts()` helper or by hand — where each row is a (target, reference) comparison, optionally stratified by `split_by` columns (e.g., cell type). +Under the hood, the kernel deduplicates shared distance pairs across contrasts, subsets the embedding to only the referenced cells before transferring to GPU, and launches a single kernel call for all unique pairs. +The result is a copy of your contrasts DataFrame with an `edistance` column appended. + +```python +from rapids_singlecell.pertpy_gpu import Distance + +dist = Distance("edistance") + +# Compare each perturbation against two controls, stratified by cell type +contrasts = Distance.create_contrasts( + adata, + groupby="target_gene", + selected_group=["Non_target", "Scramble"], + split_by="cell_type", +) + +result = dist.contrast_distances(adata, contrasts=contrasts) +``` + +`onesided_distances()` also now accepts a sequence of control group names via `selected_group`, returning a DataFrame with one column per control ([#601](https://github.com/scverse/rapids-singlecell/pull/601)). +Both energy distance and co-occurrence kernels gained multi-GPU support ([#545](https://github.com/scverse/rapids-singlecell/pull/545), [#546](https://github.com/scverse/rapids-singlecell/pull/546)). + +### More highlights + +- **RAPIDS 26.04 and Python 3.14 support** across all CI and conda environments. +- **Dask support for `highly_variable_genes`** with the Seurat v3 flavor ([#616](https://github.com/scverse/rapids-singlecell/pull/616)). +- **CUDA kernel error surfacing** — launch errors are now raised instead of silently continuing ([#619](https://github.com/scverse/rapids-singlecell/pull/619)). +- **Additional tutorials** such as a Pertpy-GPU tutorial ([#645](https://github.com/scverse/rapids-singlecell/pull/645)) + +A big thank you to everyone who tested the pre-releases and helped surface issues before this release went out. + +For questions and bug reports, visit the [GitHub issue tracker](https://github.com/scverse/rapids_singlecell/issues). + +--- + +*rapids-singlecell is part of the [scverse](https://scverse.org) ecosystem. +If you use it in your research, please cite the project.*