ggml: tune RDNA4 MMVQ warps for K-quants#24386
Open
ammarwa wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ncols_dst == 1warp counts forQ4_KandQ6_Kinstead of grouping them with the 8-warp simple vec-dot path.Target GPU
This optimization was developed, profiled, and benchmarked on an AMD gfx1200 GPU using the ROCm/HIP backend. The measured results below are specifically for AMD gfx1200; other AMD GPU architectures may need their own tuning.
Why this improves performance
For Q4_K_M decode workloads on AMD RDNA4 / AMD gfx1200 GPU, profiling showed that the dominant kernels are the HIP MMVQ quantized matrix-vector kernels:
mul_mat_vec_q<(ggml_type)12, 1, true, false>forQ4_Kmul_mat_vec_q<(ggml_type)14, 1, true, false>forQ6_KThese kernels run in the
ncols_dst == 1path used heavily during token generation. The previous RDNA4 parameter table groupedQ4_KandQ6_Kwith other simple vec-dot quant types atnwarps = 8. PerfXpert-guided testing on AMD gfx1200 showed that this over-provisions warps for these K-quant decode kernels, increasing overhead/pressure without improving throughput.The tuned values:
better match the measured AMD gfx1200 execution behavior for the dominant decode kernels. Prompt processing remains effectively unchanged, while token-generation throughput improves substantially.
Benchmark results
Tested on AMD gfx1200 GPU with:
Granite 3B Code Base 2K Q4_K_M
Model:
granite-3b-code-base-2k-q4_k_m.ggufAMD gfx1200 GPU results:
Qwen2.5 1.5B Instruct Q4_K_M
Model:
Qwen/Qwen2.5-1.5B-Instruct-GGUF, fileqwen2.5-1.5b-instruct-q4_k_m.ggufAMD gfx1200 GPU baseline RDNA4
Q4_K/Q6_K = 8 warps:AMD gfx1200 GPU tuned RDNA4
Q4_K = 3,Q6_K = 5:Approximate decode improvement: +36.0%
Effect on models
The change was validated on two different Q4_K_M GGUF models/architectures on AMD gfx1200 GPU:
Both contain many
Q4_Ktensors and someQ6_Ktensors, so they exercise the affected MMVQ paths. The measured effect is concentrated in token generation (tg128), matching the targetedncols_dst == 1decode path. Prompt-processing throughput (pp512) is effectively unchanged on the Qwen validation model, which is expected because this change targets single-token decode geometry rather than bulk prompt processing.Validation
cmake --build build --config Release --target llama-bench llama-simple -j 16HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m ../granite-3b-code-base-2k-q4_k_m.gguf -p 512 -n 128 -ngl 999 -fa on -r 10 -o mdHIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m ../qwen2.5-1.5b-instruct-q4_k_m.gguf -p 512 -n 128 -ngl 999 -fa on -r 7 -o mdHIP_VISIBLE_DEVICES=0 ./llama.cpp/build/bin/llama-simple -m qwen2.5-1.5b-instruct-q4_k_m.gguf -n 128 -ngl 999A focused code review also checked the non-power-of-two warp counts (
3and5) for launch/reduction/indexing assumptions and found no high-confidence correctness issues.