Skip to content

ggml: tune RDNA4 MMVQ warps for K-quants#24386

Open
ammarwa wants to merge 1 commit into
ggml-org:masterfrom
ammarwa:amd-rdna4-mmvq-warp-tuning
Open

ggml: tune RDNA4 MMVQ warps for K-quants#24386
ammarwa wants to merge 1 commit into
ggml-org:masterfrom
ammarwa:amd-rdna4-mmvq-warp-tuning

Conversation

@ammarwa

@ammarwa ammarwa commented Jun 10, 2026

Copy link
Copy Markdown

Summary

  • Tune AMD RDNA4 / AMD gfx1200 GPU MMVQ ncols_dst == 1 warp counts for Q4_K and Q6_K instead of grouping them with the 8-warp simple vec-dot path.
  • Improves single-token decode throughput on AMD gfx1200 GPU for Q4_K_M GGUF models while leaving other quant types unchanged.
  • This fix was generated and validated using the AMD ROCm PerfXpert AI Tool.

Target GPU

This optimization was developed, profiled, and benchmarked on an AMD gfx1200 GPU using the ROCm/HIP backend. The measured results below are specifically for AMD gfx1200; other AMD GPU architectures may need their own tuning.

Why this improves performance

For Q4_K_M decode workloads on AMD RDNA4 / AMD gfx1200 GPU, profiling showed that the dominant kernels are the HIP MMVQ quantized matrix-vector kernels:

  • mul_mat_vec_q<(ggml_type)12, 1, true, false> for Q4_K
  • mul_mat_vec_q<(ggml_type)14, 1, true, false> for Q6_K

These kernels run in the ncols_dst == 1 path used heavily during token generation. The previous RDNA4 parameter table grouped Q4_K and Q6_K with other simple vec-dot quant types at nwarps = 8. PerfXpert-guided testing on AMD gfx1200 showed that this over-provisions warps for these K-quant decode kernels, increasing overhead/pressure without improving throughput.

The tuned values:

case GGML_TYPE_Q4_K:
    return 3;
case GGML_TYPE_Q6_K:
    return 5;

better match the measured AMD gfx1200 execution behavior for the dominant decode kernels. Prompt processing remains effectively unchanged, while token-generation throughput improves substantially.

Benchmark results

Tested on AMD gfx1200 GPU with:

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench \
  -m <model.gguf> \
  -p 512 -n 128 -ngl 999 -fa on -o md

Granite 3B Code Base 2K Q4_K_M

Model: granite-3b-code-base-2k-q4_k_m.gguf

AMD gfx1200 GPU results:

  • Original single-gfx1200 Release baseline: ~86.01 tg128 t/s
  • Tuned RDNA4 MMVQ result: 101.92 ± 0.14 tg128 t/s
  • Approximate decode improvement: +18.5%

Qwen2.5 1.5B Instruct Q4_K_M

Model: Qwen/Qwen2.5-1.5B-Instruct-GGUF, file qwen2.5-1.5b-instruct-q4_k_m.gguf

AMD gfx1200 GPU baseline RDNA4 Q4_K/Q6_K = 8 warps:

  • pp512: 8550.27 ± 212.03 t/s
  • tg128: 140.26 ± 1.53 t/s

AMD gfx1200 GPU tuned RDNA4 Q4_K = 3, Q6_K = 5:

  • pp512: 8572.96 ± 195.18 t/s
  • tg128: 190.70 ± 0.27 t/s

Approximate decode improvement: +36.0%

Effect on models

The change was validated on two different Q4_K_M GGUF models/architectures on AMD gfx1200 GPU:

  • Granite 3B Code Base 2K, LLaMA architecture
  • Qwen2.5 1.5B Instruct, Qwen2 architecture

Both contain many Q4_K tensors and some Q6_K tensors, so they exercise the affected MMVQ paths. The measured effect is concentrated in token generation (tg128), matching the targeted ncols_dst == 1 decode path. Prompt-processing throughput (pp512) is effectively unchanged on the Qwen validation model, which is expected because this change targets single-token decode geometry rather than bulk prompt processing.

Validation

  • cmake --build build --config Release --target llama-bench llama-simple -j 16
  • HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m ../granite-3b-code-base-2k-q4_k_m.gguf -p 512 -n 128 -ngl 999 -fa on -r 10 -o md
  • HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m ../qwen2.5-1.5b-instruct-q4_k_m.gguf -p 512 -n 128 -ngl 999 -fa on -r 7 -o md
  • HIP_VISIBLE_DEVICES=0 ./llama.cpp/build/bin/llama-simple -m qwen2.5-1.5b-instruct-q4_k_m.gguf -n 128 -ngl 999

A focused code review also checked the non-power-of-two warp counts (3 and 5) for launch/reduction/indexing assumptions and found no high-confidence correctness issues.

@ammarwa ammarwa requested a review from a team as a code owner June 10, 2026 01:05
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant