CUDA: Fix ssm_scan_f32 data-races#24360
Merged
Merged
Conversation
__syncthreads() is required before being allowed to resue TempStorage smem: https://nvidia.github.io/cccl/unstable/cub/api/classcub_1_1BlockLoad.html#_CPPv4I0EN3cub9BlockLoad4LoadEv20RandomAccessIteratorRA14ItemsPerThread_1Ti
Could also double-buffer, but alternative is to simply ensure all threads have read smem* before writing to it again in the next loop iteration
IMbackK
approved these changes
Jun 9, 2026
IMbackK
left a comment
Collaborator
There was a problem hiding this comment.
Static analysis looks good, Can reproduce this problem and fix with gfx1100
Contributor
|
Change looks good to me. Does it have any perf impact? |
gaugarg-nv
approved these changes
Jun 10, 2026
Collaborator
Author
2.7% slowdown on the kernel on a B6000. Can always go for double-buffering should it turn out to affect E2E perf significantly |
Jcfunk
added a commit
to Jcfunk/llama.cpp
that referenced
this pull request
Jun 11, 2026
* upstream/HEAD: (329 commits) vendor : update LibreSSL to 4.3.2 (ggml-org#24397) Remove padding and multiple D2D copies for MTP (ggml-org#24086) chat: fix LFM2/LFM2.5 ignoring json_schema (ggml-org#24377) CUDA: Fix ssm_scan_f32 data-races (ggml-org#24360) ci : bump komac version (ggml-org#24396) speculative : fix "ngram-map-k4v" name in logging (ggml-org#24253) webui: implement pinned conversations support (ggml-org#21387) graph: Fix granite speech model inference by applying embedding scale when deepstack is not used (ggml-org#24357) ci : fix windows release (ggml-org#24369) ui: add opt-in run_javascript frontend tool (ggml-org#24244) mtmd: build_vit batching (ggml-org#24352) vulkan: reduce iq1 shared memory usage for mul_mm (ggml-org#24287) vulkan: add `v_dot2_f32_f16` support in matrix-matrix multiplication and Flash Attention (ggml-org#24123) ui: Fix excessive style recalculation on hover (ggml-org#24243) mtmd: refactor video subproc handling (ggml-org#24316) server: log prompts to directory (ggml-org#22031) ui: fix mobile chat form overflow and bust stale bundle cache (ggml-org#24158) ggml : add GGML_OP_COL2IM_1D (ggml-org#24206) server : do not clear slots without unified KV cache (ggml-org#24190) models : fix plamo2 attention_key/value_length regression (ggml-org#24317) ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Add required
__synchthreads()to avoid data-races inssm_scan_f32. Also remove unused smem from the kernel.Additional information
Should supersede #23983 as it fixes the underlying issues (which are data-races, where 4fbecf7 applies to HIP/MUSA backends as well). For more details on the races, refer the individual commit messages.
Should resolve sporadic failures of CUDA CI such as https://github.com/ggml-org/llama.cpp/actions/runs/27192383880/job/80275487186?pr=24331 (verified this on a local DGX Spark)
Requirements