vulkan: disable FA mask_opt on GCN to improve performance by 0cc4m · Pull Request #24362 · ggml-org/llama.cpp

0cc4m · 2026-06-09T14:02:25Z

Overview

I accidentally noticed this while testing some other changes. It's not a universal improvement, but mostly seems to be positive on GCN. I don't really know why it slows it down so much, or if there's a good way to predict where it helps and where it doesn't. Maybe it helps on DeepSeek (GLM4.7 Flash) because of the large head sizes? Let me know if you have ideas @jeffbolznv.

AMD Radeon Pro VII on Linux RADV:

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	-1	1	pp512	845.80 ± 1.20	855.26 ± 0.66	+1.1%
llama 8B Q4_0	4.33 GiB	8.03 B	-1	1	pp512 @ d4096	455.09 ± 4.63	559.51 ± 0.31	+22.9%
llama 8B Q4_0	4.33 GiB	8.03 B	-1	1	pp512 @ d8192	274.00 ± 2.80	386.68 ± 0.59	+41.1%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	-1	1	pp512	784.35 ± 4.41	782.86 ± 4.07	-0.2%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	-1	1	pp512 @ d4096	436.08 ± 1.83	382.73 ± 1.09	-12.2%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	-1	1	pp512 @ d8192	284.13 ± 0.58	240.42 ± 0.41	-15.4%
qwen35moe 35B.A3B Q2_K - Medium	11.74 GiB	34.66 B	-1	1	pp512	722.12 ± 6.36	719.99 ± 4.29	-0.3%
qwen35moe 35B.A3B Q2_K - Medium	11.74 GiB	34.66 B	-1	1	pp512 @ d4096	649.81 ± 3.34	656.49 ± 3.78	+1.0%
qwen35moe 35B.A3B Q2_K - Medium	11.74 GiB	34.66 B	-1	1	pp512 @ d8192	592.58 ± 4.44	604.61 ± 3.73	+2.0%

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES

jeffbolznv · 2026-06-09T14:15:38Z

Do you know whether the extra time is in the mask_opt shader itself or in the flash attention shader? There might be things we could do to speed up the mask_opt shader. If it's the FA shader then it might be occupancy or something.

0cc4m · 2026-06-09T14:24:31Z

It's the mask_opt shader itself. I have tried a few things to speed it up (more workgroups, less work per thread, use shmem to reduce instead of subgroup functions, use one subgroup of 64 without barriers), but without success. It might also just be dispatch overhead + pipeline barrier?

engrtipusultan · 2026-06-11T06:53:51Z

Hi Ruben following are the results on Vega 8 ubuntu 24.04.

Master

tipu-dev-machine   ~/Development/GH/llama.cpp/build/bin  master ≡   09:52:10 

./llama-bench -m /home/tipu/AI/models/unsloth/gemma-4-26B-A4B/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -m /home/tipu/AI/models/other/Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3

| model                          |       size |     params | backend    | threads | n_ubatch |  fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | --: | ---: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           pp512 |        131.70 ± 0.49 |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           tg128 |         16.51 ± 0.39 |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   pp512 @ d8096 |        100.05 ± 1.32 |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   tg128 @ d8096 |         14.30 ± 0.31 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           pp512 |        115.69 ± 0.70 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           tg128 |         17.14 ± 0.17 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   pp512 @ d8096 |         99.20 ± 0.52 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   tg128 @ d8096 |         16.11 ± 0.01 |

build: ac4cdde (9592)

PR

 tipu-dev-machine   ~/Development/GH/llama.cpp/build/bin  0cc4m/vulkan-fa-mask-opt-gcn ≡   11:44:30 

./llama-bench -m /home/tipu/AI/models/unsloth/gemma-4-26B-A4B/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -m /home/tipu/AI/models/other/Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3

| model                          |       size |     params | backend    | threads | n_ubatch |  fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | --: | ---: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           pp512 |        132.04 ± 0.85 |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           tg128 |         17.74 ± 0.09 |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   pp512 @ d8096 |         98.55 ± 0.48 |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   tg128 @ d8096 |         15.17 ± 0.02 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           pp512 |        113.36 ± 2.94 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           tg128 |         17.24 ± 0.01 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   pp512 @ d8096 |         92.06 ± 0.16 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   tg128 @ d8096 |         16.02 ± 0.05 |

build: 6c2cbc4 (9582)

vulkan: disable FA mask_opt on GCN to improve performance

6c2cbc4

0cc4m requested a review from a team as a code owner June 9, 2026 14:02

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: disable FA mask_opt on GCN to improve performance#24362

vulkan: disable FA mask_opt on GCN to improve performance#24362
0cc4m wants to merge 1 commit into
masterfrom
0cc4m/vulkan-fa-mask-opt-gcn

0cc4m commented Jun 9, 2026

Uh oh!

jeffbolznv commented Jun 9, 2026

Uh oh!

0cc4m commented Jun 9, 2026 •

edited

Loading

Uh oh!

engrtipusultan commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

0cc4m commented Jun 9, 2026

Overview

Requirements

Uh oh!

jeffbolznv commented Jun 9, 2026

Uh oh!

0cc4m commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

engrtipusultan commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

0cc4m commented Jun 9, 2026 •

edited

Loading