ggml: tune RDNA4 MMVQ warps for K-quants by ammarwa · Pull Request #24386 · ggml-org/llama.cpp

ammarwa · 2026-06-10T01:04:58Z

Summary

Tune AMD RDNA4 / AMD gfx1200 GPU MMVQ ncols_dst == 1 warp counts for Q4_K and Q6_K instead of grouping them with the 8-warp simple vec-dot path.
Improves single-token decode throughput on AMD gfx1200 GPU for Q4_K_M GGUF models while leaving other quant types unchanged.
This fix was generated and validated using the AMD ROCm PerfXpert AI Tool.

Target GPU

This optimization was developed, profiled, and benchmarked on an AMD gfx1200 GPU using the ROCm/HIP backend. The measured results below are specifically for AMD gfx1200; other AMD GPU architectures may need their own tuning.

Why this improves performance

For Q4_K_M decode workloads on AMD RDNA4 / AMD gfx1200 GPU, profiling showed that the dominant kernels are the HIP MMVQ quantized matrix-vector kernels:

mul_mat_vec_q<(ggml_type)12, 1, true, false> for Q4_K
mul_mat_vec_q<(ggml_type)14, 1, true, false> for Q6_K

These kernels run in the ncols_dst == 1 path used heavily during token generation. The previous RDNA4 parameter table grouped Q4_K and Q6_K with other simple vec-dot quant types at nwarps = 8. PerfXpert-guided testing on AMD gfx1200 showed that this over-provisions warps for these K-quant decode kernels, increasing overhead/pressure without improving throughput.

The tuned values:

case GGML_TYPE_Q4_K:
    return 3;
case GGML_TYPE_Q6_K:
    return 5;

better match the measured AMD gfx1200 execution behavior for the dominant decode kernels. Prompt processing remains effectively unchanged, while token-generation throughput improves substantially.

Benchmark results

Tested on AMD gfx1200 GPU with:

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench \
  -m <model.gguf> \
  -p 512 -n 128 -ngl 999 -fa on -o md

Granite 3B Code Base 2K Q4_K_M

Model: granite-3b-code-base-2k-q4_k_m.gguf

AMD gfx1200 GPU results:

Original single-gfx1200 Release baseline: ~86.01 tg128 t/s
Tuned RDNA4 MMVQ result: 101.92 ± 0.14 tg128 t/s
Approximate decode improvement: +18.5%

Qwen2.5 1.5B Instruct Q4_K_M

Model: Qwen/Qwen2.5-1.5B-Instruct-GGUF, file qwen2.5-1.5b-instruct-q4_k_m.gguf

AMD gfx1200 GPU baseline RDNA4 Q4_K/Q6_K = 8 warps:

pp512: 8550.27 ± 212.03 t/s
tg128: 140.26 ± 1.53 t/s

AMD gfx1200 GPU tuned RDNA4 Q4_K = 3, Q6_K = 5:

pp512: 8572.96 ± 195.18 t/s
tg128: 190.70 ± 0.27 t/s

Approximate decode improvement: +36.0%

Effect on models

The change was validated on two different Q4_K_M GGUF models/architectures on AMD gfx1200 GPU:

Granite 3B Code Base 2K, LLaMA architecture
Qwen2.5 1.5B Instruct, Qwen2 architecture

Both contain many Q4_K tensors and some Q6_K tensors, so they exercise the affected MMVQ paths. The measured effect is concentrated in token generation (tg128), matching the targeted ncols_dst == 1 decode path. Prompt-processing throughput (pp512) is effectively unchanged on the Qwen validation model, which is expected because this change targets single-token decode geometry rather than bulk prompt processing.

Validation

cmake --build build --config Release --target llama-bench llama-simple -j 16
HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m ../granite-3b-code-base-2k-q4_k_m.gguf -p 512 -n 128 -ngl 999 -fa on -r 10 -o md
HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m ../qwen2.5-1.5b-instruct-q4_k_m.gguf -p 512 -n 128 -ngl 999 -fa on -r 7 -o md
HIP_VISIBLE_DEVICES=0 ./llama.cpp/build/bin/llama-simple -m qwen2.5-1.5b-instruct-q4_k_m.gguf -n 128 -ngl 999

A focused code review also checked the non-power-of-two warp counts (3 and 5) for launch/reduction/indexing assumptions and found no high-confidence correctness issues.

ggml : tune RDNA4 MMVQ warps for K-quants

5e3b312

ammarwa requested a review from a team as a code owner June 10, 2026 01:05

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: tune RDNA4 MMVQ warps for K-quants#24386

ggml: tune RDNA4 MMVQ warps for K-quants#24386
ammarwa wants to merge 1 commit into
ggml-org:masterfrom
ammarwa:amd-rdna4-mmvq-warp-tuning

ammarwa commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ammarwa commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Target GPU

Why this improves performance

Benchmark results

Granite 3B Code Base 2K Q4_K_M

Qwen2.5 1.5B Instruct Q4_K_M

Effect on models

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ammarwa commented Jun 10, 2026 •

edited

Loading