HIP: fixing SSM_SCAN backend testcase error#23983
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
JohannesGaessler
left a comment
There was a problem hiding this comment.
Sorry, I don't understand the goal of this PR. It sounds to me like there are issues with correctness but whether or not there is a template specialization for 32 should only affect performance.
The issue isn't that the specialization changes correctness directly. The problem is that the host can select mmq_x = 32, but there is no matching device implementation. This PR adds the missing specialization so the selected configuration is actually supported, fixing the correctness bug seen on Strix Point (gfx1150) specifically. |
|
@jiachengjason Is this potentially affecting the cuda backend as well? https://github.com/ggml-org/llama.cpp/actions/runs/27192383880/job/80275487186?pr=24331 I quickly verified it not to be a PDL issue on DGX Spark & efbacf8 Seems like we are likely missing a syncthreads somewhere, at least in the CUB path |
|
@jiachengjason please see if #24360 fixes the correctness-issues on your side as well. |
Yes it does |
|
closing this pr as #24360 fixes this |
Overview
ctest test-backend-ops fails because SSM_SCAN on ROCm disagrees with the reference beyond the test’s maximum allowed error on Strix Point (AMD Radeon(TM) 890M Graphics, gfx1150)
The Mamba-1 HIP/CUDA SSM_SCAN path has compile-time specializations for n_tok=1..8 and for n_tok=32 previously fell through to the generic runtime-length kernel (ssm_scan_f32<threads, 16, 0>)
Add a case 32 specialization (ssm_scan_f32<threads, 16, 32>), this routes the failing Mamba-1 shape away from the general runtime-length kernel. The actual SSM recurrence math is unchanged.
Validation
Full backend op suite passed:
ctest -R test-backend-ops --verbose
12383/12383 tests passed
No regression on various models.
Full Performance result for llama-bench across various quantization
Requirements