CUDA: Fix ssm_scan_f32 data-races by ORippler · Pull Request #24360 · ggml-org/llama.cpp

ORippler · 2026-06-09T13:44:19Z

Overview

Add required __synchthreads() to avoid data-races in ssm_scan_f32. Also remove unused smem from the kernel.

Additional information

Should supersede #23983 as it fixes the underlying issues (which are data-races, where 4fbecf7 applies to HIP/MUSA backends as well). For more details on the races, refer the individual commit messages.

Should resolve sporadic failures of CUDA CI such as https://github.com/ggml-org/llama.cpp/actions/runs/27192383880/job/80275487186?pr=24331 (verified this on a local DGX Spark)

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

__syncthreads() is required before being allowed to resue TempStorage smem: https://nvidia.github.io/cccl/unstable/cub/api/classcub_1_1BlockLoad.html#_CPPv4I0EN3cub9BlockLoad4LoadEv20RandomAccessIteratorRA14ItemsPerThread_1Ti

Could also double-buffer, but alternative is to simply ensure all threads have read smem* before writing to it again in the next loop iteration

IMbackK

Static analysis looks good, Can reproduce this problem and fix with gfx1100

gaugarg-nv · 2026-06-10T11:29:13Z

Change looks good to me. Does it have any perf impact?

ORippler · 2026-06-10T12:23:05Z

Does it have any perf impact?

2.7% slowdown on the kernel on a B6000. Can always go for double-buffering should it turn out to affect E2E perf significantly

* upstream/HEAD: (329 commits) vendor : update LibreSSL to 4.3.2 (ggml-org#24397) Remove padding and multiple D2D copies for MTP (ggml-org#24086) chat: fix LFM2/LFM2.5 ignoring json_schema (ggml-org#24377) CUDA: Fix ssm_scan_f32 data-races (ggml-org#24360) ci : bump komac version (ggml-org#24396) speculative : fix "ngram-map-k4v" name in logging (ggml-org#24253) webui: implement pinned conversations support (ggml-org#21387) graph: Fix granite speech model inference by applying embedding scale when deepstack is not used (ggml-org#24357) ci : fix windows release (ggml-org#24369) ui: add opt-in run_javascript frontend tool (ggml-org#24244) mtmd: build_vit batching (ggml-org#24352) vulkan: reduce iq1 shared memory usage for mul_mm (ggml-org#24287) vulkan: add `v_dot2_f32_f16` support in matrix-matrix multiplication and Flash Attention (ggml-org#24123) ui: Fix excessive style recalculation on hover (ggml-org#24243) mtmd: refactor video subproc handling (ggml-org#24316) server: log prompts to directory (ggml-org#22031) ui: fix mobile chat form overflow and bust stale bundle cache (ggml-org#24158) ggml : add GGML_OP_COL2IM_1D (ggml-org#24206) server : do not clear slots without unified KV cache (ggml-org#24190) models : fix plamo2 attention_key/value_length regression (ggml-org#24317) ...

ORippler added 3 commits June 9, 2026 13:57

Add missing syncthreads before resuing cub_temp_storage

a941c2c

__syncthreads() is required before being allowed to resue TempStorage smem: https://nvidia.github.io/cccl/unstable/cub/api/classcub_1_1BlockLoad.html#_CPPv4I0EN3cub9BlockLoad4LoadEv20RandomAccessIteratorRA14ItemsPerThread_1Ti

Add one more missing __syncthreads

4fbecf7

Could also double-buffer, but alternative is to simply ensure all threads have read smem* before writing to it again in the next loop iteration

Remove unused smem from ssm_scan_f32

814e106

ORippler requested a review from a team as a code owner June 9, 2026 13:44

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 9, 2026

ORippler mentioned this pull request Jun 9, 2026

HIP: fixing SSM_SCAN backend testcase error #23983

Closed

IMbackK approved these changes Jun 9, 2026

View reviewed changes

ORippler requested a review from gaugarg-nv June 10, 2026 08:24

gaugarg-nv approved these changes Jun 10, 2026

View reviewed changes

ORippler merged commit fb83cc9 into ggml-org:master Jun 10, 2026
22 checks passed

ORippler deleted the osimons/fix_ssm_scan_f32 branch June 10, 2026 12:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: Fix ssm_scan_f32 data-races#24360

CUDA: Fix ssm_scan_f32 data-races#24360
ORippler merged 3 commits into
ggml-org:masterfrom
ORippler:osimons/fix_ssm_scan_f32

ORippler commented Jun 9, 2026 •

edited

Loading

Uh oh!

IMbackK left a comment

Uh oh!

gaugarg-nv commented Jun 10, 2026

Uh oh!

ORippler commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ORippler commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

IMbackK left a comment

Choose a reason for hiding this comment

Uh oh!

gaugarg-nv commented Jun 10, 2026

Uh oh!

ORippler commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ORippler commented Jun 9, 2026 •

edited

Loading