vulkan: disable FA mask_opt on GCN to improve performance#24362
Conversation
|
Do you know whether the extra time is in the mask_opt shader itself or in the flash attention shader? There might be things we could do to speed up the mask_opt shader. If it's the FA shader then it might be occupancy or something. |
|
It's the mask_opt shader itself. I have tried a few things to speed it up (more workgroups, less work per thread, use shmem to reduce instead of subgroup functions, use one subgroup of 64 without barriers), but without success. It might also just be dispatch overhead + pipeline barrier? |
|
Hi Ruben following are the results on Vega 8 ubuntu 24.04. Mastertipu-dev-machine ~/Development/GH/llama.cpp/build/bin master ≡ 09:52:10
ggml_vulkan: Found 1 Vulkan devices: build: ac4cdde (9592) PR
tipu-dev-machine ~/Development/GH/llama.cpp/build/bin 0cc4m/vulkan-fa-mask-opt-gcn ≡ 11:44:30
ggml_vulkan: Found 1 Vulkan devices: build: 6c2cbc4 (9582) |
Overview
I accidentally noticed this while testing some other changes. It's not a universal improvement, but mostly seems to be positive on GCN. I don't really know why it slows it down so much, or if there's a good way to predict where it helps and where it doesn't. Maybe it helps on DeepSeek (GLM4.7 Flash) because of the large head sizes? Let me know if you have ideas @jeffbolznv.
AMD Radeon Pro VII on Linux RADV:
Requirements