vulkan: Intel Xe flash attention, GEMM optimizations, and optional weight compression (Xe-LPG Plus/Xe2/Xe3) [MEGA PR]#24408
Conversation
…PG Plus (1/3, Xe1-ARLH) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>
…G Plus/Xe2/Xe3) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>
…n for Intel MoE path (3/3, Xe-LPG Plus/Xe2/Xe3) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>
|
Hi @fish-jiang, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Overview
Co-authors: @jxia4intel, @sliu39
Target platforms: Xe-LPG Plus (Arrow Lake-H iGPU), Xe2, Xe3
Flash Attention (Intel Xe)
flash_attn_hdim64/96/128) and two-phase split prefill/decode variants(head_dim, gqa_ratio)for runtime dispatch across various GQA ratios without combinatorial pipeline proliferationqk_groups)fa_copy_qstate) between prefill phasesGEMM kernel optimizations (Intel Xe)
LOAD_A_OPTpath: SLM-based A-matrix layout optimization for coopmat1bitfieldExtractoptimizationl_alt/a_l_alt, BM=128 warptile) for runtime selection when problem dimensions are smallvulkan-shaders-gen.cpp: registers all new pipeline variantsMoE optimizations (Intel Xe)
mul_mm.compshader optimization forMUL_MAT_ID: reduces unnecessary memory loads and matrix core operations for MoE modelsn_ubatchauto-raised to 2048 for MoE models with flash attention on Intel Xe2 to match the optimal tile sizerms_norm + mulinto a singleRMS_NORM_MULkernel dispatch for the expert gate input calculationOptional load-time weight compression for fast Intel MoE path (Intel Xe2+)
Performance (Windows OS)
ARLH

LNL

B70 Arc Pro


PTL


Requirements