Skip to content

HIP: fixing SSM_SCAN backend testcase error#23983

Closed
jiachengjason wants to merge 1 commit into
ggml-org:masterfrom
jiachengjason:fix/jiachengjason/ssm_scan_error
Closed

HIP: fixing SSM_SCAN backend testcase error#23983
jiachengjason wants to merge 1 commit into
ggml-org:masterfrom
jiachengjason:fix/jiachengjason/ssm_scan_error

Conversation

@jiachengjason

Copy link
Copy Markdown
Contributor

Overview

ctest test-backend-ops fails because SSM_SCAN on ROCm disagrees with the reference beyond the test’s maximum allowed error on Strix Point (AMD Radeon(TM) 890M Graphics, gfx1150)

The Mamba-1 HIP/CUDA SSM_SCAN path has compile-time specializations for n_tok=1..8 and for n_tok=32 previously fell through to the generic runtime-length kernel (ssm_scan_f32<threads, 16, 0>)

Add a case 32 specialization (ssm_scan_f32<threads, 16, 32>), this routes the failing Mamba-1 shape away from the general runtime-length kernel. The actual SSM recurrence math is unchanged.

Validation

Full backend op suite passed:

ctest -R test-backend-ops --verbose
12383/12383 tests passed

No regression on various models.

Full Performance result for llama-bench across various quantization
GPU Model Microbatch size Test t/s master t/s fix/jiachengjason/ssm_scan_error Speedup
890M Graphics llama 8B IQ1_S - 1.5625 bpw 1 pp2048 36.57 36.57 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 1 tg128 31.86 31.76 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 2 pp2048 61.27 61.29 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 2 tg128 31.81 31.81 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 4 pp2048 85.41 85.62 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 4 tg128 31.81 31.90 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 8 pp2048 102.14 102.39 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 8 tg128 31.83 31.82 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 16 pp2048 238.49 238.71 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 16 tg128 31.91 31.91 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 32 pp2048 299.34 299.41 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 32 tg128 31.91 31.91 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 64 pp2048 377.82 378.10 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 64 tg128 31.90 31.82 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 128 pp2048 422.25 426.81 1.01
890M Graphics llama 8B IQ1_S - 1.5625 bpw 128 tg128 31.81 31.91 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 256 pp2048 431.24 432.57 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 256 tg128 31.82 31.90 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 512 pp2048 437.93 437.07 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 512 tg128 31.79 31.84 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 1024 pp2048 441.16 441.53 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 1024 tg128 31.82 31.91 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 2048 pp2048 437.11 437.67 1.00
890M Graphics llama 8B IQ1_S - 1.5625 bpw 2048 tg128 31.88 31.81 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 1 pp2048 20.21 20.20 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 1 tg128 18.41 18.39 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 2 pp2048 35.94 35.76 0.99
890M Graphics llama 8B IQ2_S - 2.5 bpw 2 tg128 18.44 18.43 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 4 pp2048 57.39 57.16 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 4 tg128 18.43 18.41 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 8 pp2048 79.16 78.86 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 8 tg128 18.46 18.44 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 16 pp2048 133.13 133.26 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 16 tg128 18.45 18.44 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 32 pp2048 226.37 226.67 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 32 tg128 18.46 18.44 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 64 pp2048 305.04 305.83 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 64 tg128 18.47 18.45 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 128 pp2048 376.43 380.55 1.01
890M Graphics llama 8B IQ2_S - 2.5 bpw 128 tg128 18.44 18.42 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 256 pp2048 386.74 385.20 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 256 tg128 18.44 18.41 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 512 pp2048 384.26 383.30 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 512 tg128 18.45 18.41 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 1024 pp2048 384.29 384.02 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 1024 tg128 18.48 18.41 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 2048 pp2048 381.21 380.63 1.00
890M Graphics llama 8B IQ2_S - 2.5 bpw 2048 tg128 18.45 18.41 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 1 pp2048 20.56 20.49 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 1 tg128 18.68 18.66 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 2 pp2048 36.52 36.44 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 2 tg128 18.72 18.71 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 4 pp2048 56.93 56.88 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 4 tg128 18.69 18.70 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 8 pp2048 77.35 77.30 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 8 tg128 18.82 18.82 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 16 pp2048 130.97 131.18 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 16 tg128 18.72 18.81 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 32 pp2048 226.36 226.31 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 32 tg128 18.82 18.73 0.99
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 64 pp2048 299.03 300.98 1.01
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 64 tg128 18.73 18.81 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 128 pp2048 362.84 367.00 1.01
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 128 tg128 18.71 18.72 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 256 pp2048 377.14 376.25 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 256 tg128 18.80 18.82 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 512 pp2048 377.74 378.36 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 512 tg128 18.71 18.71 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 1024 pp2048 381.70 381.27 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 1024 tg128 18.81 18.71 0.99
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 2048 pp2048 379.66 378.78 1.00
890M Graphics llama 8B IQ2_XS - 2.3125 bpw 2048 tg128 18.71 18.72 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 1 pp2048 21.23 21.22 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 1 tg128 19.25 19.25 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 2 pp2048 37.90 37.98 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 2 tg128 19.28 19.32 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 4 pp2048 59.61 59.78 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 4 tg128 19.37 19.38 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 8 pp2048 83.44 83.66 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 8 tg128 19.39 19.30 0.99
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 16 pp2048 170.06 169.89 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 16 tg128 19.32 19.30 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 32 pp2048 230.48 230.22 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 32 tg128 19.36 19.31 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 64 pp2048 346.52 346.67 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 64 tg128 19.30 19.30 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 128 pp2048 406.85 405.05 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 128 tg128 19.26 19.28 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 256 pp2048 410.89 409.64 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 256 tg128 19.36 19.38 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 512 pp2048 414.14 414.00 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 512 tg128 19.41 19.27 0.99
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 1024 pp2048 417.32 416.88 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 1024 tg128 19.27 19.27 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 2048 pp2048 414.62 414.79 1.00
890M Graphics llama 8B IQ2_XXS - 2.0625 bpw 2048 tg128 19.39 19.27 0.99
890M Graphics llama 8B IQ3_S - 3.4375 bpw 1 pp2048 18.09 18.09 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 1 tg128 16.34 16.35 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 2 pp2048 32.50 32.62 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 2 tg128 16.36 16.38 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 4 pp2048 54.41 54.46 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 4 tg128 16.35 16.37 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 8 pp2048 81.95 82.03 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 8 tg128 16.36 16.38 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 16 pp2048 156.97 156.99 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 16 tg128 16.37 16.37 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 32 pp2048 204.42 205.03 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 32 tg128 16.37 16.37 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 64 pp2048 321.51 323.04 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 64 tg128 16.37 16.37 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 128 pp2048 402.74 402.92 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 128 tg128 16.37 16.37 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 256 pp2048 411.55 409.47 0.99
890M Graphics llama 8B IQ3_S - 3.4375 bpw 256 tg128 16.38 16.37 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 512 pp2048 410.19 409.15 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 512 tg128 16.37 16.37 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 1024 pp2048 410.15 410.97 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 1024 tg128 16.37 16.37 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 2048 pp2048 405.98 406.25 1.00
890M Graphics llama 8B IQ3_S - 3.4375 bpw 2048 tg128 16.37 16.37 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 1 pp2048 17.91 17.85 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 1 tg128 16.19 16.15 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 2 pp2048 32.35 31.62 0.98
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 2 tg128 16.22 16.20 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 4 pp2048 52.64 51.62 0.98
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 4 tg128 16.21 16.18 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 8 pp2048 74.86 73.89 0.99
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 8 tg128 16.21 16.20 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 16 pp2048 158.42 158.48 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 16 tg128 16.21 16.19 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 32 pp2048 209.39 210.10 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 32 tg128 16.21 16.20 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 64 pp2048 325.22 325.74 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 64 tg128 16.22 16.20 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 128 pp2048 402.52 403.57 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 128 tg128 16.22 16.19 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 256 pp2048 409.95 410.68 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 256 tg128 16.22 16.19 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 512 pp2048 409.22 409.52 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 512 tg128 16.22 16.18 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 1024 pp2048 409.30 411.77 1.01
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 1024 tg128 16.22 16.19 1.00
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 2048 pp2048 405.64 402.82 0.99
890M Graphics llama 8B IQ3_S mix - 3.66 bpw 2048 tg128 16.21 16.20 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 1 pp2048 18.30 18.25 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 1 tg128 16.49 16.45 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 2 pp2048 34.12 33.76 0.99
890M Graphics llama 8B IQ3_XS - 3.3 bpw 2 tg128 16.53 16.50 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 4 pp2048 54.76 54.26 0.99
890M Graphics llama 8B IQ3_XS - 3.3 bpw 4 tg128 16.52 16.50 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 8 pp2048 81.56 80.92 0.99
890M Graphics llama 8B IQ3_XS - 3.3 bpw 8 tg128 16.53 16.48 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 16 pp2048 161.65 161.32 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 16 tg128 16.53 16.50 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 32 pp2048 206.30 204.35 0.99
890M Graphics llama 8B IQ3_XS - 3.3 bpw 32 tg128 16.52 16.49 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 64 pp2048 321.84 322.08 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 64 tg128 16.52 16.49 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 128 pp2048 408.54 408.46 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 128 tg128 16.52 16.49 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 256 pp2048 415.98 416.64 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 256 tg128 16.53 16.49 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 512 pp2048 414.60 414.25 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 512 tg128 16.53 16.49 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 1024 pp2048 416.63 415.71 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 1024 tg128 16.53 16.49 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 2048 pp2048 410.27 411.02 1.00
890M Graphics llama 8B IQ3_XS - 3.3 bpw 2048 tg128 16.52 16.49 1.00
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 1 pp2048 18.81 18.71 0.99
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 1 tg128 17.20 17.12 1.00
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 2 pp2048 35.25 34.75 0.99
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 2 tg128 17.25 17.19 1.00
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 4 pp2048 55.05 54.20 0.98
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 4 tg128 17.24 17.16 1.00
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 8 pp2048 80.86 79.93 0.99
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 8 tg128 17.24 17.16 1.00
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 16 pp2048 158.49 158.32 1.00
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 16 tg128 17.25 17.17 1.00
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 32 pp2048 210.59 210.42 1.00
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 32 tg128 17.24 17.16 1.00
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 64 pp2048 322.14 323.35 1.00
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 64 tg128 17.24 17.16 1.00
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 128 pp2048 406.27 401.38 0.99
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 128 tg128 17.25 17.16 0.99
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 256 pp2048 415.45 415.68 1.00
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 256 tg128 17.25 17.19 1.00
890M Graphics llama 8B IQ3_XXS - 3.0625 bpw 512 pp2048 411.55 414.68 1.01
890M Graphics llama 8B IQ4_NL - 4.5 bpw 2 pp2048 33.89 33.66 0.99
890M Graphics llama 8B IQ4_NL - 4.5 bpw 2 tg128 15.89 15.77 0.99
890M Graphics llama 8B IQ4_NL - 4.5 bpw 32 tg128 15.83 15.71 0.99
890M Graphics llama 8B IQ4_NL - 4.5 bpw 64 pp2048 406.78 405.96 1.00
890M Graphics llama 8B IQ4_NL - 4.5 bpw 64 tg128 15.90 15.76 0.99
890M Graphics llama 8B IQ4_NL - 4.5 bpw 1024 tg128 15.83 15.70 0.99
890M Graphics llama 8B IQ4_XS - 4.25 bpw 1 pp2048 18.61 18.56 1.00
890M Graphics llama 8B IQ4_XS - 4.25 bpw 2 tg128 16.87 16.81 1.00
890M Graphics llama 8B IQ4_XS - 4.25 bpw 4 pp2048 66.56 66.24 1.00
890M Graphics llama 8B IQ4_XS - 4.25 bpw 4 tg128 16.84 16.79 1.00
890M Graphics llama 8B IQ4_XS - 4.25 bpw 32 pp2048 206.51 205.64 1.00
890M Graphics llama 8B IQ4_XS - 4.25 bpw 32 tg128 16.85 16.80 1.00
890M Graphics llama 8B IQ4_XS - 4.25 bpw 64 pp2048 395.79 395.91 1.00
890M Graphics llama 8B IQ4_XS - 4.25 bpw 64 tg128 16.87 16.81 1.00
890M Graphics llama 8B IQ4_XS - 4.25 bpw 128 pp2048 445.24 442.70 0.99
890M Graphics llama 8B IQ4_XS - 4.25 bpw 128 tg128 16.83 16.79 1.00
890M Graphics llama 8B IQ4_XS - 4.25 bpw 512 pp2048 457.69 457.57 1.00
890M Graphics llama 8B IQ4_XS - 4.25 bpw 2048 tg128 16.84 16.74 0.99
890M Graphics llama 8B Q2_K_S 2 pp2048 42.05 42.28 1.01
890M Graphics llama 8B Q2_K_S 4 tg128 23.29 23.35 1.00
890M Graphics llama 8B Q2_K_S 8 pp2048 50.49 50.66 1.00
890M Graphics llama 8B Q2_K_S 8 tg128 23.30 23.38 1.00
890M Graphics llama 8B Q2_K_S 16 pp2048 132.95 132.89 1.00
890M Graphics llama 8B Q2_K_S 16 tg128 23.31 23.38 1.00
890M Graphics llama 8B Q2_K_S 32 pp2048 175.47 175.73 1.00
890M Graphics llama 8B Q2_K_S 32 tg128 23.30 23.36 1.00
890M Graphics llama 8B Q2_K_S 64 pp2048 229.07 229.34 1.00
890M Graphics llama 8B Q2_K_S 64 tg128 23.31 23.38 1.00
890M Graphics llama 8B Q2_K_S 128 pp2048 232.60 232.67 1.00
890M Graphics llama 8B Q2_K_S 128 tg128 23.31 23.35 1.00
890M Graphics llama 8B Q2_K_S 256 pp2048 252.69 252.08 1.00
890M Graphics llama 8B Q2_K_S 256 tg128 23.22 23.35 1.01
890M Graphics llama 8B Q2_K_S 512 pp2048 318.47 317.78 1.00
890M Graphics llama 8B Q2_K_S 512 tg128 23.22 23.33 1.00
890M Graphics llama 8B Q2_K_S 1024 pp2048 355.28 355.33 1.00
890M Graphics llama 8B Q2_K_S 1024 tg128 23.22 23.31 1.00
890M Graphics llama 8B Q2_K_S 2048 pp2048 367.43 365.72 1.00
890M Graphics llama 8B Q2_K_S 2048 tg128 23.21 23.32 1.00
890M Graphics llama 8B Q3_K_S 1 pp2048 21.40 21.23 0.99
890M Graphics llama 8B Q3_K_S 1 tg128 19.12 18.98 0.99
890M Graphics llama 8B Q3_K_S 2 pp2048 37.34 37.30 1.00
890M Graphics llama 8B Q3_K_S 2 tg128 19.15 19.01 0.99
890M Graphics llama 8B Q3_K_S 4 pp2048 51.56 51.27 0.99
890M Graphics llama 8B Q3_K_S 4 tg128 19.14 19.01 0.99
890M Graphics llama 8B Q3_K_S 8 pp2048 55.21 55.31 1.00
890M Graphics llama 8B Q3_K_S 8 tg128 19.14 19.03 0.99
890M Graphics llama 8B Q3_K_S 16 pp2048 178.75 178.89 1.00
890M Graphics llama 8B Q3_K_S 16 tg128 19.15 19.01 0.99
890M Graphics llama 8B Q3_K_S 32 pp2048 270.86 270.58 1.00
890M Graphics llama 8B Q3_K_S 32 tg128 19.14 19.06 1.00
890M Graphics llama 8B Q3_K_S 64 pp2048 347.10 347.42 1.00
890M Graphics llama 8B Q3_K_S 64 tg128 19.15 19.04 0.99
890M Graphics llama 8B Q3_K_S 128 pp2048 390.06 392.93 1.01
890M Graphics llama 8B Q3_K_S 128 tg128 19.15 19.06 1.00
890M Graphics llama 8B Q3_K_S 256 pp2048 400.68 399.53 1.00
890M Graphics llama 8B Q3_K_S 256 tg128 19.16 19.05 0.99
890M Graphics llama 8B Q3_K_S 512 pp2048 403.20 403.94 1.00
890M Graphics llama 8B Q3_K_S 512 tg128 19.15 19.03 0.99
890M Graphics llama 8B Q3_K_S 1024 pp2048 405.46 406.36 1.00
890M Graphics llama 8B Q3_K_S 1024 tg128 19.14 19.02 0.99
890M Graphics llama 8B Q3_K_S 2048 pp2048 401.53 402.33 1.00
890M Graphics llama 8B Q3_K_S 2048 tg128 19.16 19.03 0.99
890M Graphics llama 8B Q4_0 1 pp2048 17.26 17.24 1.00
890M Graphics llama 8B Q4_0 1 tg128 15.69 15.68 1.00
890M Graphics llama 8B Q4_0 2 pp2048 33.75 33.70 1.00
890M Graphics llama 8B Q4_0 2 tg128 15.74 15.72 1.00
890M Graphics llama 8B Q4_0 4 pp2048 64.12 64.14 1.00
890M Graphics llama 8B Q4_0 4 tg128 15.68 15.68 1.00
890M Graphics llama 8B Q4_0 8 pp2048 107.03 107.45 1.00
890M Graphics llama 8B Q4_0 8 tg128 15.70 15.67 1.00
890M Graphics llama 8B Q4_0 16 pp2048 204.84 204.94 1.00
890M Graphics llama 8B Q4_0 16 tg128 15.68 15.69 1.00
890M Graphics llama 8B Q4_0 32 pp2048 211.92 211.25 1.00
890M Graphics llama 8B Q4_0 32 tg128 15.69 15.68 1.00
890M Graphics llama 8B Q4_0 64 pp2048 391.87 392.26 1.00
890M Graphics llama 8B Q4_0 64 tg128 15.74 15.74 1.00
890M Graphics llama 8B Q4_0 128 pp2048 435.83 434.69 1.00
890M Graphics llama 8B Q4_0 128 tg128 15.68 15.68 1.00
890M Graphics llama 8B Q4_0 256 pp2048 444.76 444.46 1.00
890M Graphics llama 8B Q4_0 256 tg128 15.68 15.68 1.00
890M Graphics llama 8B Q4_0 512 pp2048 443.89 442.61 1.00
890M Graphics llama 8B Q4_0 512 tg128 15.69 15.68 1.00
890M Graphics llama 8B Q4_0 1024 pp2048 446.93 443.11 0.99
890M Graphics llama 8B Q4_0 1024 tg128 15.68 15.68 1.00
890M Graphics llama 8B Q4_0 2048 pp2048 443.72 441.71 1.00
890M Graphics llama 8B Q4_0 2048 tg128 15.68 15.68 1.00
890M Graphics llama 8B Q4_1 1 pp2048 15.82 15.83 1.00
890M Graphics llama 8B Q4_1 1 tg128 14.49 14.51 1.00
890M Graphics llama 8B Q4_1 2 pp2048 31.11 31.14 1.00
890M Graphics llama 8B Q4_1 2 tg128 14.54 14.54 1.00
890M Graphics llama 8B Q4_1 4 pp2048 58.77 58.72 1.00
890M Graphics llama 8B Q4_1 4 tg128 14.47 14.47 1.00
890M Graphics llama 8B Q4_1 8 pp2048 98.59 98.83 1.00
890M Graphics llama 8B Q4_1 8 tg128 14.48 14.48 1.00
890M Graphics llama 8B Q4_1 16 pp2048 172.82 173.29 1.00
890M Graphics llama 8B Q4_1 16 tg128 14.48 14.48 1.00
890M Graphics llama 8B Q4_1 32 pp2048 270.18 271.56 1.01
890M Graphics llama 8B Q4_1 32 tg128 14.48 14.49 1.00
890M Graphics llama 8B Q4_1 64 pp2048 372.27 372.04 1.00
890M Graphics llama 8B Q4_1 64 tg128 14.53 14.54 1.00
890M Graphics llama 8B Q4_1 128 pp2048 391.85 397.34 1.01
890M Graphics llama 8B Q4_1 128 tg128 14.48 14.48 1.00
890M Graphics llama 8B Q4_1 256 pp2048 404.48 401.70 0.99
890M Graphics llama 8B Q4_1 256 tg128 14.47 14.48 1.00
890M Graphics llama 8B Q4_1 512 pp2048 405.96 403.57 0.99
890M Graphics llama 8B Q4_1 512 tg128 14.48 14.48 1.00
890M Graphics llama 8B Q4_1 1024 pp2048 406.38 401.09 0.99
890M Graphics llama 8B Q4_1 1024 tg128 14.48 14.48 1.00
890M Graphics llama 8B Q4_1 2048 pp2048 402.98 400.61 0.99
890M Graphics llama 8B Q4_1 2048 tg128 14.48 14.48 1.00
890M Graphics llama 8B Q4_K_S 1 pp2048 16.87 16.82 1.00
890M Graphics llama 8B Q4_K_S 1 tg128 15.39 15.31 0.99
890M Graphics llama 8B Q4_K_S 2 pp2048 29.53 28.68 0.97
890M Graphics llama 8B Q4_K_S 2 tg128 15.41 15.34 1.00
890M Graphics llama 8B Q4_K_S 4 pp2048 40.56 39.12 0.96
890M Graphics llama 8B Q4_K_S 4 tg128 15.35 15.32 1.00
890M Graphics llama 8B Q4_K_S 8 pp2048 43.67 42.78 0.98
890M Graphics llama 8B Q4_K_S 8 tg128 15.36 15.33 1.00
890M Graphics llama 8B Q4_K_S 16 pp2048 178.48 177.34 0.99
890M Graphics llama 8B Q4_K_S 16 tg128 15.35 15.31 1.00
890M Graphics llama 8B Q4_K_S 32 pp2048 258.60 257.78 1.00
890M Graphics llama 8B Q4_K_S 32 tg128 15.38 15.33 1.00
890M Graphics llama 8B Q4_K_S 64 pp2048 341.78 340.61 1.00
890M Graphics llama 8B Q4_K_S 64 tg128 15.40 15.36 1.00
890M Graphics llama 8B Q4_K_S 128 pp2048 397.15 394.15 0.99
890M Graphics llama 8B Q4_K_S 128 tg128 15.35 15.32 1.00
890M Graphics llama 8B Q4_K_S 256 pp2048 405.46 401.91 0.99
890M Graphics llama 8B Q4_K_S 256 tg128 15.35 15.30 1.00
890M Graphics llama 8B Q4_K_S 512 pp2048 403.01 399.07 0.99
890M Graphics llama 8B Q4_K_S 512 tg128 15.35 15.30 1.00
890M Graphics llama 8B Q4_K_S 1024 pp2048 405.75 401.45 0.99
890M Graphics llama 8B Q4_K_S 1024 tg128 15.36 15.31 1.00
890M Graphics llama 8B Q4_K_S 2048 pp2048 402.21 395.92 0.98
890M Graphics llama 8B Q4_K_S 2048 tg128 15.34 15.31 1.00
890M Graphics llama 8B Q5_1 1 pp2048 13.38 13.25 0.99
890M Graphics llama 8B Q5_1 1 tg128 12.35 12.23 0.99
890M Graphics llama 8B Q5_1 2 pp2048 26.21 25.91 0.99
890M Graphics llama 8B Q5_1 2 tg128 12.36 12.24 0.99
890M Graphics llama 8B Q5_1 4 pp2048 50.15 49.61 0.99
890M Graphics llama 8B Q5_1 4 tg128 12.35 12.23 0.99
890M Graphics llama 8B Q5_1 8 pp2048 87.26 85.48 0.98
890M Graphics llama 8B Q5_1 8 tg128 12.35 12.24 0.99
890M Graphics llama 8B Q5_1 16 pp2048 113.00 109.96 0.97
890M Graphics llama 8B Q5_1 16 tg128 12.35 12.24 0.99
890M Graphics llama 8B Q5_1 32 pp2048 178.24 178.22 1.00
890M Graphics llama 8B Q5_1 32 tg128 12.36 12.23 0.99
890M Graphics llama 8B Q5_1 64 pp2048 281.86 282.92 1.00
890M Graphics llama 8B Q5_1 64 tg128 12.36 12.25 0.99
890M Graphics llama 8B Q5_1 128 pp2048 347.40 349.11 1.00
890M Graphics llama 8B Q5_1 128 tg128 12.34 12.23 0.99
890M Graphics llama 8B Q5_1 256 pp2048 363.46 361.36 0.99
890M Graphics llama 8B Q5_1 256 tg128 12.34 12.23 0.99
890M Graphics llama 8B Q5_1 512 pp2048 365.58 363.69 0.99
890M Graphics llama 8B Q5_1 512 tg128 12.34 12.23 0.99
890M Graphics llama 8B Q5_1 1024 pp2048 372.09 366.82 0.99
890M Graphics llama 8B Q5_1 1024 tg128 12.34 12.23 0.99
890M Graphics llama 8B Q5_1 2048 pp2048 368.54 361.10 0.98
890M Graphics llama 8B Q5_1 2048 tg128 12.34 12.23 0.99
890M Graphics llama 8B Q5_K_S 1 pp2048 13.91 13.83 0.99
890M Graphics llama 8B Q5_K_S 1 tg128 12.86 12.79 0.99
890M Graphics llama 8B Q5_K_S 2 pp2048 26.94 25.77 0.96
890M Graphics llama 8B Q5_K_S 2 tg128 12.88 12.81 0.99
890M Graphics llama 8B Q5_K_S 4 pp2048 38.46 36.07 0.94
890M Graphics llama 8B Q5_K_S 4 tg128 12.81 12.73 0.99
890M Graphics llama 8B Q5_K_S 8 pp2048 43.22 41.55 0.96
890M Graphics llama 8B Q5_K_S 8 tg128 12.82 12.75 0.99
890M Graphics llama 8B Q5_K_S 16 pp2048 183.03 175.33 0.96
890M Graphics llama 8B Q5_K_S 16 tg128 12.82 12.76 1.00
890M Graphics llama 8B Q5_K_S 32 pp2048 282.56 280.22 0.99
890M Graphics llama 8B Q5_K_S 32 tg128 12.84 12.77 0.99
890M Graphics llama 8B Q5_K_S 64 pp2048 375.84 376.11 1.00
890M Graphics llama 8B Q5_K_S 64 tg128 12.88 12.80 0.99
890M Graphics llama 8B Q5_K_S 128 pp2048 395.64 402.77 1.02
890M Graphics llama 8B Q5_K_S 128 tg128 12.83 12.74 0.99
890M Graphics llama 8B Q5_K_S 256 pp2048 408.23 406.87 1.00
890M Graphics llama 8B Q5_K_S 256 tg128 12.82 12.74 0.99
890M Graphics llama 8B Q5_K_S 512 pp2048 409.80 407.20 0.99
890M Graphics llama 8B Q5_K_S 512 tg128 12.83 12.75 0.99
890M Graphics llama 8B Q5_K_S 1024 pp2048 420.12 411.83 0.98
890M Graphics llama 8B Q5_K_S 1024 tg128 12.83 12.75 0.99
890M Graphics llama 8B Q5_K_S 2048 pp2048 414.25 410.02 0.99
890M Graphics llama 8B Q5_K_S 2048 tg128 12.83 12.75 0.99
890M Graphics llama 8B Q6_K 1 pp2048 12.01 12.06 1.00
890M Graphics llama 8B Q6_K 1 tg128 11.15 11.19 1.00
890M Graphics llama 8B Q6_K 2 pp2048 23.90 24.16 1.01
890M Graphics llama 8B Q6_K 2 tg128 11.15 11.19 1.00
890M Graphics llama 8B Q6_K 4 pp2048 43.31 44.67 1.03
890M Graphics llama 8B Q6_K 4 tg128 11.14 11.18 1.00
890M Graphics llama 8B Q6_K 8 pp2048 56.66 57.93 1.02
890M Graphics llama 8B Q6_K 8 tg128 11.15 11.19 1.00
890M Graphics llama 8B Q6_K 16 pp2048 141.53 145.66 1.03
890M Graphics llama 8B Q6_K 16 tg128 11.15 11.19 1.00
890M Graphics llama 8B Q6_K 32 pp2048 211.04 212.63 1.01
890M Graphics llama 8B Q6_K 32 tg128 11.15 11.19 1.00
890M Graphics llama 8B Q6_K 64 pp2048 272.83 273.45 1.00
890M Graphics llama 8B Q6_K 64 tg128 11.15 11.19 1.00
890M Graphics llama 8B Q6_K 128 pp2048 301.31 302.51 1.00
890M Graphics llama 8B Q6_K 128 tg128 11.14 11.18 1.00
890M Graphics llama 8B Q6_K 256 pp2048 306.52 307.73 1.00
890M Graphics llama 8B Q6_K 256 tg128 11.14 11.18 1.00
890M Graphics llama 8B Q6_K 512 pp2048 304.33 304.21 1.00
890M Graphics llama 8B Q6_K 512 tg128 11.14 11.18 1.00
890M Graphics llama 8B Q6_K 1024 pp2048 348.21 348.34 1.00
890M Graphics llama 8B Q6_K 1024 tg128 11.14 11.18 1.00
890M Graphics llama 8B Q6_K 2048 pp2048 360.58 359.34 1.00
890M Graphics llama 8B Q6_K 2048 tg128 11.14 11.18 1.00
890M Graphics llama 8B Q8_0 1 pp2048 9.12 9.10 1.00
890M Graphics llama 8B Q8_0 1 tg128 8.44 7.95 0.94
890M Graphics llama 8B Q8_0 2 pp2048 18.54 18.47 1.00
890M Graphics llama 8B Q8_0 2 tg128 8.46 8.07 0.95
890M Graphics llama 8B Q8_0 4 pp2048 36.40 36.30 1.00
890M Graphics llama 8B Q8_0 4 tg128 8.44 8.11 0.96
890M Graphics llama 8B Q8_0 8 pp2048 68.44 68.33 1.00
890M Graphics llama 8B Q8_0 8 tg128 8.44 8.14 0.96
890M Graphics llama 8B Q8_0 16 pp2048 126.05 125.69 1.00
890M Graphics llama 8B Q8_0 16 tg128 8.44 8.21 0.97
890M Graphics llama 8B Q8_0 32 pp2048 170.86 169.19 0.99
890M Graphics llama 8B Q8_0 32 tg128 8.44 8.17 0.97
890M Graphics llama 8B Q8_0 64 pp2048 322.36 322.29 1.00
890M Graphics llama 8B Q8_0 64 tg128 8.46 8.25 0.98
890M Graphics llama 8B Q8_0 128 pp2048 389.51 377.65 0.97
890M Graphics llama 8B Q8_0 128 tg128 8.43 8.27 0.98
890M Graphics llama 8B Q8_0 256 pp2048 399.14 379.74 0.95
890M Graphics llama 8B Q8_0 256 tg128 8.43 8.27 0.98
890M Graphics llama 8B Q8_0 512 pp2048 400.56 371.29 0.93
890M Graphics llama 8B Q8_0 512 tg128 8.44 8.34 0.99
890M Graphics llama 8B Q8_0 1024 pp2048 404.39 350.36 0.87
890M Graphics llama 8B Q8_0 1024 tg128 8.43 8.32 0.99
890M Graphics llama 8B Q8_0 2048 pp2048 402.18 303.54 0.75
890M Graphics llama 8B Q8_0 2048 tg128 8.43 8.35 0.99

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes I have used AI to help me with understanding the code and give me ideas of initial implementation.

@jiachengjason jiachengjason changed the title fixing SSM_SCAN backend testcase error HIP: fixing SSM_SCAN backend testcase error Jun 1, 2026
@jiachengjason jiachengjason marked this pull request as ready for review June 1, 2026 16:50
@jiachengjason jiachengjason requested a review from a team as a code owner June 1, 2026 16:50
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 1, 2026
@ggml-gh-bot

This comment was marked as resolved.

@JohannesGaessler JohannesGaessler left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand the goal of this PR. It sounds to me like there are issues with correctness but whether or not there is a template specialization for 32 should only affect performance.

@jiachengjason

Copy link
Copy Markdown
Contributor Author

Sorry, I don't understand the goal of this PR. It sounds to me like there are issues with correctness but whether or not there is a template specialization for 32 should only affect performance.

The issue isn't that the specialization changes correctness directly. The problem is that the host can select mmq_x = 32, but there is no matching device implementation. This PR adds the missing specialization so the selected configuration is actually supported, fixing the correctness bug seen on Strix Point (gfx1150) specifically.

@ORippler

ORippler commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

@jiachengjason Is this potentially affecting the cuda backend as well? https://github.com/ggml-org/llama.cpp/actions/runs/27192383880/job/80275487186?pr=24331

I quickly verified it not to be a PDL issue on DGX Spark & efbacf8

while true; do GGML_CUDA_PDL=0 ./build/bin/test-backend-ops -o SSM_SCAN || break; done

ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 97246976
Backend 1/2: CUDA0
  Device description: NVIDIA GB10
  Device memory: 126758 MB (94967 MB free)

  SSM_SCAN(type=f32,d_state=16,head_dim=1,n_head=1024,n_group=1,n_seq_tokens=32,n_seqs=4,xbc_overlap=0): OK
  SSM_SCAN(type=f32,d_state=128,head_dim=64,n_head=16,n_group=2,n_seq_tokens=32,n_seqs=4,xbc_overlap=0): OK
  SSM_SCAN(type=f32,d_state=256,head_dim=64,n_head=8,n_group=2,n_seq_tokens=32,n_seqs=4,xbc_overlap=0): OK
  SSM_SCAN(type=f32,d_state=128,head_dim=128,n_head=4,n_group=4,n_seq_tokens=16,n_seqs=2,xbc_overlap=1): OK
  4/4 tests passed
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 126758 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 126758 MiB
Testing 2 devices

ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 97161536
Backend 1/2: CUDA0
  Device description: NVIDIA GB10
  Device memory: 126758 MB (94884 MB free)

[SSM_SCAN] ERR = 0.000432363 > 0.000000100   SSM_SCAN(type=f32,d_state=16,head_dim=1,n_head=1024,n_group=1,n_seq_tokens=32,n_seqs=4,xbc_overlap=0): FAIL
  SSM_SCAN(type=f32,d_state=128,head_dim=64,n_head=16,n_group=2,n_seq_tokens=32,n_seqs=4,xbc_overlap=0): OK
  SSM_SCAN(type=f32,d_state=256,head_dim=64,n_head=8,n_group=2,n_seq_tokens=32,n_seqs=4,xbc_overlap=0): OK
  SSM_SCAN(type=f32,d_state=128,head_dim=128,n_head=4,n_group=4,n_seq_tokens=16,n_seqs=2,xbc_overlap=1): OK
  3/4 tests passed

Failing tests:
  SSM_SCAN(type=f32,d_state=16,head_dim=1,n_head=1024,n_group=1,n_seq_tokens=32,n_seqs=4,xbc_overlap=0)
  Backend CUDA0: FAIL
Backend 2/2: CPU
  Skipping CPU backend
1/2 backends passed
FAIL

Seems like we are likely missing a syncthreads somewhere, at least in the CUB path

@ORippler

ORippler commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

@jiachengjason please see if #24360 fixes the correctness-issues on your side as well.

@jiachengjason

Copy link
Copy Markdown
Contributor Author

@jiachengjason please see if #24360 fixes the correctness-issues on your side as well.

Yes it does

@jiachengjason

Copy link
Copy Markdown
Contributor Author

closing this pr as #24360 fixes this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants