Skip to content

vulkan: add Intel Xe flash attention optimization kernels (2/3, Xe-LPG Plus/Xe2/Xe3)#24406

Draft
fish-jiang wants to merge 2 commits into
ggml-org:masterfrom
fish-jiang:intel/xe-flash-attn
Draft

vulkan: add Intel Xe flash attention optimization kernels (2/3, Xe-LPG Plus/Xe2/Xe3)#24406
fish-jiang wants to merge 2 commits into
ggml-org:masterfrom
fish-jiang:intel/xe-flash-attn

Conversation

@fish-jiang

@fish-jiang fish-jiang commented Jun 10, 2026

Copy link
Copy Markdown

Overview

Co-authors: @jxia4intel, @sliu39

PR 2/3 of the Intel Xe optimization series — see #24408 (mega PR, draft) for the full feature set.

Target platforms: Xe-LPG Plus, Xe2, Xe3

This PR adds Intel Xe-specific flash attention optimization kernels for both ARLH iGPU (Xe1, UMA, coopmat1) and Xe2/Xe3. Dependency: builds on top of #24404 (Xe-LPG Plus coopmat1 enable). Independent of #24407 (GEMM+CW).

Flash Attention (Intel Xe)

  • New Vulkan shaders: single-phase prefill (flash_attn_hdim64/96/128) and two-phase split prefill/decode variants
  • Pipelines keyed by (head_dim, gqa_ratio) for runtime dispatch across various GQA ratios without combinatorial pipeline proliferation
  • Supports non-power-of-two GQA ratios via subgroup splitting (qk_groups)
  • Intel Xe1 (integrated GPU, UMA, cooperative matrix) and Xe2 paths with separate warptile tuning
  • Two-phase decode splits softmax reduction across subgroups; shared QK state copy (fa_copy_qstate) between prefill phases

Performance (Panther Lake B390 + Windows OS)

BEFORE:
C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64>llama-bench.exe -p 8192 -n 0 -r 3 -fa 1 --delay 10 -ngl 99 -m C:\Users\dungeon\Desktop\models\Qwen3.5-35B-A3B-Q4_K_M\Qwen3.5-35B-A3B-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gpt-oss-20b-Q4_K_M\gpt-oss-20b-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gemma-4-26B-A4B-it-UD-Q4_K_M\gemma-4-26B-A4B-it-UD-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\Qwen3-0.6B.Q4_K_M\Qwen3-0.6B.Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\Qwen3.5-4B-Q4_K_M\Qwen3.5-4B-Q4_K_M.gguf
load_backend: loaded RPC backend from C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64\ggml-cpu-alderlake.dll
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |   1 |          pp8192 |        397.28 ± 1.83 |
| gpt-oss 20B Q4_K - Medium      |  10.81 GiB |    20.91 B | Vulkan     |  99 |   1 |          pp8192 |        557.55 ± 2.70 |
| gemma4 26B.A4B Q4_K - Medium   |  15.70 GiB |    25.23 B | Vulkan     |  99 |   1 |          pp8192 |        341.19 ± 1.37 |
| qwen3 0.6B Q4_K - Medium       | 456.11 MiB |   751.63 M | Vulkan     |  99 |   1 |          pp8192 |        802.22 ± 6.66 |
| qwen35 4B Q4_K - Medium        |   2.54 GiB |     4.21 B | Vulkan     |  99 |   1 |          pp8192 |       723.48 ± 11.85 |

build: 3571fa543 (9490)

C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64>

AFTER:
C:\upsteaming_build\subPR3_FA\Release>llama-bench.exe -p 8192 -n 0 -r 3 -fa 1 --delay 10 -ngl 99 -m C:\Users\dungeon\Desktop\models\Qwen3.5-35B-A3B-Q4_K_M\Qwen3.5-35B-A3B-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gpt-oss-20b-Q4_K_M\gpt-oss-20b-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gemma-4-26B-A4B-it-UD-Q4_K_M\gemma-4-26B-A4B-it-UD-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\Qwen3-0.6B.Q4_K_M\Qwen3-0.6B.Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\Qwen3.5-4B-Q4_K_M\Qwen3.5-4B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |   1 |          pp8192 |       551.12 ± 30.16 |
| gpt-oss 20B Q4_K - Medium      |  10.81 GiB |    20.91 B | Vulkan     |  99 |   1 |          pp8192 |        819.09 ± 7.08 |
| gemma4 26B.A4B Q4_K - Medium   |  15.70 GiB |    25.23 B | Vulkan     |  99 |   1 |          pp8192 |        624.12 ± 2.05 |
| qwen3 0.6B Q4_K - Medium       | 456.11 MiB |   751.63 M | Vulkan     |  99 |   1 |          pp8192 |      3395.03 ± 29.09 |
| qwen35 4B Q4_K - Medium        |   2.54 GiB |     4.21 B | Vulkan     |  99 |   1 |          pp8192 |      1031.70 ± 44.36 |

build: 472f80478 (9492)

C:\upsteaming_build\subPR3_FA\Release>

BEFORE:
C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64>llama-bench.exe -p 0 -n 128 -d 8192 -r 3 -fa 1 --delay 10 -ngl 99 -m C:\Users\dungeon\Desktop\models\Qwen3.5-35B-A3B-Q4_K_M\Qwen3.5-35B-A3B-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gpt-oss-20b-Q4_K_M\gpt-oss-20b-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gemma-4-26B-A4B-it-UD-Q4_K_M\gemma-4-26B-A4B-it-UD-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\Qwen3-0.6B.Q4_K_M\Qwen3-0.6B.Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\Qwen3.5-4B-Q4_K_M\Qwen3.5-4B-Q4_K_M.gguf
load_backend: loaded RPC backend from C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64\ggml-cpu-alderlake.dll
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         25.72 ± 0.49 |
| gpt-oss 20B Q4_K - Medium      |  10.81 GiB |    20.91 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         25.27 ± 0.06 |
| gemma4 26B.A4B Q4_K - Medium   |  15.70 GiB |    25.23 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         20.72 ± 0.11 |
| qwen3 0.6B Q4_K - Medium       | 456.11 MiB |   751.63 M | Vulkan     |  99 |   1 |   tg128 @ d8192 |         56.30 ± 0.06 |
| qwen35 4B Q4_K - Medium        |   2.54 GiB |     4.21 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         26.47 ± 0.52 |

build: 3571fa543 (9490)

C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64>

AFTER:
C:\upsteaming_build\subPR3_FA\Release>llama-bench.exe -p 0 -n 128 -d 8192 -r 3 -fa 1 --delay 10 -ngl 99 -m C:\Users\dungeon\Desktop\models\Qwen3.5-35B-A3B-Q4_K_M\Qwen3.5-35B-A3B-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gpt-oss-20b-Q4_K_M\gpt-oss-20b-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gemma-4-26B-A4B-it-UD-Q4_K_M\gemma-4-26B-A4B-it-UD-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\Qwen3-0.6B.Q4_K_M\Qwen3-0.6B.Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\Qwen3.5-4B-Q4_K_M\Qwen3.5-4B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         27.44 ± 0.19 |
| gpt-oss 20B Q4_K - Medium      |  10.81 GiB |    20.91 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         31.86 ± 0.21 |
| gemma4 26B.A4B Q4_K - Medium   |  15.70 GiB |    25.23 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         22.20 ± 0.11 |
| qwen3 0.6B Q4_K - Medium       | 456.11 MiB |   751.63 M | Vulkan     |  99 |   1 |   tg128 @ d8192 |         60.08 ± 0.15 |
| qwen35 4B Q4_K - Medium        |   2.54 GiB |     4.21 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         27.79 ± 0.60 |

build: 472f80478 (9492)

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, used claude code, then lots of manual review/tweaking.

fish-jiang and others added 2 commits June 10, 2026 17:18
…PG Plus (1/3, Xe1-ARLH)

Co-authored-by: Xia, Jie <jie.xia@intel.com>
Co-authored-by: Liu, Russell <russell.liu@intel.com>
…G Plus/Xe2/Xe3)

Co-authored-by: Xia, Jie <jie.xia@intel.com>
Co-authored-by: Liu, Russell <russell.liu@intel.com>
@fish-jiang fish-jiang requested a review from a team as a code owner June 10, 2026 09:32
@fish-jiang fish-jiang marked this pull request as draft June 10, 2026 09:32
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jun 10, 2026
@ggml-gh-bot

ggml-gh-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

Hi @fish-jiang, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant