vulkan: add Intel Xe flash attention optimization kernels (2/3, Xe-LPG Plus/Xe2/Xe3)#24406
vulkan: add Intel Xe flash attention optimization kernels (2/3, Xe-LPG Plus/Xe2/Xe3)#24406fish-jiang wants to merge 2 commits into
Conversation
…PG Plus (1/3, Xe1-ARLH) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>
…G Plus/Xe2/Xe3) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>
|
Hi @fish-jiang, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Overview
Co-authors: @jxia4intel, @sliu39
PR 2/3 of the Intel Xe optimization series — see #24408 (mega PR, draft) for the full feature set.
Target platforms: Xe-LPG Plus, Xe2, Xe3
This PR adds Intel Xe-specific flash attention optimization kernels for both ARLH iGPU (Xe1, UMA, coopmat1) and Xe2/Xe3. Dependency: builds on top of #24404 (Xe-LPG Plus coopmat1 enable). Independent of #24407 (GEMM+CW).
Flash Attention (Intel Xe)
flash_attn_hdim64/96/128) and two-phase split prefill/decode variants(head_dim, gqa_ratio)for runtime dispatch across various GQA ratios without combinatorial pipeline proliferationqk_groups)fa_copy_qstate) between prefill phasesPerformance (Panther Lake B390 + Windows OS)
Requirements