Skip to content

vulkan: disable FA mask_opt on GCN to improve performance#24362

Open
0cc4m wants to merge 1 commit into
masterfrom
0cc4m/vulkan-fa-mask-opt-gcn
Open

vulkan: disable FA mask_opt on GCN to improve performance#24362
0cc4m wants to merge 1 commit into
masterfrom
0cc4m/vulkan-fa-mask-opt-gcn

Conversation

@0cc4m

@0cc4m 0cc4m commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Overview

I accidentally noticed this while testing some other changes. It's not a universal improvement, but mostly seems to be positive on GCN. I don't really know why it slows it down so much, or if there's a good way to predict where it helps and where it doesn't. Maybe it helps on DeepSeek (GLM4.7 Flash) because of the large head sizes? Let me know if you have ideas @jeffbolznv.

AMD Radeon Pro VII on Linux RADV:

model size params ngl fa mmap test t/s (before) t/s (after) diff
llama 8B Q4_0 4.33 GiB 8.03 B -1 1 0 pp512 845.80 ± 1.20 855.26 ± 0.66 +1.1%
llama 8B Q4_0 4.33 GiB 8.03 B -1 1 0 pp512 @ d4096 455.09 ± 4.63 559.51 ± 0.31 +22.9%
llama 8B Q4_0 4.33 GiB 8.03 B -1 1 0 pp512 @ d8192 274.00 ± 2.80 386.68 ± 0.59 +41.1%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B -1 1 0 pp512 784.35 ± 4.41 782.86 ± 4.07 -0.2%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B -1 1 0 pp512 @ d4096 436.08 ± 1.83 382.73 ± 1.09 -12.2%
deepseek2 30B.A3B Q3_K - Small 12.37 GiB 29.94 B -1 1 0 pp512 @ d8192 284.13 ± 0.58 240.42 ± 0.41 -15.4%
qwen35moe 35B.A3B Q2_K - Medium 11.74 GiB 34.66 B -1 1 0 pp512 722.12 ± 6.36 719.99 ± 4.29 -0.3%
qwen35moe 35B.A3B Q2_K - Medium 11.74 GiB 34.66 B -1 1 0 pp512 @ d4096 649.81 ± 3.34 656.49 ± 3.78 +1.0%
qwen35moe 35B.A3B Q2_K - Medium 11.74 GiB 34.66 B -1 1 0 pp512 @ d8192 592.58 ± 4.44 604.61 ± 3.73 +2.0%

Requirements

@0cc4m 0cc4m requested a review from a team as a code owner June 9, 2026 14:02
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jun 9, 2026
@jeffbolznv

Copy link
Copy Markdown
Contributor

Do you know whether the extra time is in the mask_opt shader itself or in the flash attention shader? There might be things we could do to speed up the mask_opt shader. If it's the FA shader then it might be occupancy or something.

@0cc4m

0cc4m commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

It's the mask_opt shader itself. I have tried a few things to speed it up (more workgroups, less work per thread, use shmem to reduce instead of subgroup functions, use one subgroup of 64 without barriers), but without success. It might also just be dispatch overhead + pipeline barrier?

@engrtipusultan

Copy link
Copy Markdown

Hi Ruben following are the results on Vega 8 ubuntu 24.04.

Master

tipu-dev-machine   ~/Development/GH/llama.cpp/build/bin  master ≡   09:52:10 

./llama-bench -m /home/tipu/AI/models/unsloth/gemma-4-26B-A4B/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -m /home/tipu/AI/models/other/Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

| model                          |       size |     params | backend    | threads | n_ubatch |  fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | --: | ---: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           pp512 |        131.70 ± 0.49 |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           tg128 |         16.51 ± 0.39 |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   pp512 @ d8096 |        100.05 ± 1.32 |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   tg128 @ d8096 |         14.30 ± 0.31 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           pp512 |        115.69 ± 0.70 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           tg128 |         17.14 ± 0.17 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   pp512 @ d8096 |         99.20 ± 0.52 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   tg128 @ d8096 |         16.11 ± 0.01 |

build: ac4cdde (9592)

PR

 tipu-dev-machine   ~/Development/GH/llama.cpp/build/bin  0cc4m/vulkan-fa-mask-opt-gcn ≡   11:44:30 

./llama-bench -m /home/tipu/AI/models/unsloth/gemma-4-26B-A4B/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -m /home/tipu/AI/models/other/Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

| model                          |       size |     params | backend    | threads | n_ubatch |  fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | --: | ---: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           pp512 |        132.04 ± 0.85 |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           tg128 |         17.74 ± 0.09 |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   pp512 @ d8096 |         98.55 ± 0.48 |
| gemma4 26B.A4B Q4_0            |  13.26 GiB |    25.23 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   tg128 @ d8096 |         15.17 ± 0.02 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           pp512 |        113.36 ± 2.94 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |           tg128 |         17.24 ± 0.01 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   pp512 @ d8096 |         92.06 ± 0.16 |
| qwen35moe 35B.A3B all F32 (guessed) |  17.32 GiB |    35.51 B | Vulkan,BLAS |       8 |     1088 |   1 |    0 |   tg128 @ d8096 |         16.02 ± 0.05 |

build: 6c2cbc4 (9582)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants