Skip to content

vulkan: Intel Xe flash attention, GEMM optimizations, and optional weight compression (Xe-LPG Plus/Xe2/Xe3) [MEGA PR]#24408

Draft
fish-jiang wants to merge 3 commits into
ggml-org:masterfrom
fish-jiang:intel/xe-all-opt
Draft

vulkan: Intel Xe flash attention, GEMM optimizations, and optional weight compression (Xe-LPG Plus/Xe2/Xe3) [MEGA PR]#24408
fish-jiang wants to merge 3 commits into
ggml-org:masterfrom
fish-jiang:intel/xe-all-opt

Conversation

@fish-jiang

@fish-jiang fish-jiang commented Jun 10, 2026

Copy link
Copy Markdown

Overview

Co-authors: @jxia4intel, @sliu39

Draft / Evaluation only — not for merge. This mega PR exists solely to show the full feature set in one place and will remain as a draft. Please refer to the individual PRs below for review and merging.

Target platforms: Xe-LPG Plus (Arrow Lake-H iGPU), Xe2, Xe3

PR Series Description Target
#24404 1/3 Xe-LPG Plus coopmat1 enable + INTEL_PRE_XE2 enum Xe-LPG Plus
#24406 2/3 Intel Xe FA optimization kernels Xe-LPG Plus, Xe2, Xe3
#24407 3/3 GEMM/Group GEMM optimizations + optional load-time weight compression Xe-LPG Plus, Xe2, Xe3

Dependency graph:

#24404 (ARLH) ← #24406 (FA)
           ← #24407 (GEMM+CW)

PR6 and PR7 are independent of each other; both build on PR5.

Flash Attention (Intel Xe)

  • New Vulkan shaders: single-phase prefill (flash_attn_hdim64/96/128) and two-phase split prefill/decode variants
  • Pipelines keyed by (head_dim, gqa_ratio) for runtime dispatch across various GQA ratios without combinatorial pipeline proliferation
  • Supports non-power-of-two GQA ratios via subgroup splitting (qk_groups)
  • Intel Xe1 (integrated GPU, UMA, cooperative matrix) and Xe2 paths with separate warptile tuning
  • Two-phase decode splits softmax reduction across subgroups; shared QK state copy (fa_copy_qstate) between prefill phases

GEMM kernel optimizations (Intel Xe)

  • LOAD_A_OPT path: SLM-based A-matrix layout optimization for coopmat1
  • MXFP4, Q4_K, Q5_K dequant via bitfieldExtract optimization
  • Alt pipeline (l_alt/a_l_alt, BM=128 warptile) for runtime selection when problem dimensions are small
  • f32→f16 activation conversion for Intel coopmat GEMM, scoped to Intel devices only
  • vulkan-shaders-gen.cpp: registers all new pipeline variants

MoE optimizations (Intel Xe)

  • mul_mm.comp shader optimization for MUL_MAT_ID: reduces unnecessary memory loads and matrix core operations for MoE models
  • Separate warptile tuning for MoE expert GEMM
  • n_ubatch auto-raised to 2048 for MoE models with flash attention on Intel Xe2 to match the optimal tile size
  • Gemma4 MoE router: fuse rms_norm + mul into a single RMS_NORM_MUL kernel dispatch for the expert gate input calculation

Optional load-time weight compression for fast Intel MoE path (Intel Xe2+)

  • This enables a fast path for specific MoE models at the cost of a slight quality reduction. To disable it, use -cw off
  • Layer eligibility requires uniform Q8_0 attn QKV and MoE expert weights < 5 bpw, At model load, eligible attention QKV tensors with bpw > 4 are downgraded to Q4_0 in-memory; mmap is disabled when active
  • On tested MoE models: 1) PPL drops <5%, 2) GSM8K, 194/200 = 97.00% for both -cw on and off (Qwen3.6-35B-A3B-UD-Q4_K_M)
  • Enable on Intel Xe2+, off elsewhere

Performance (Windows OS)

ARLH
ARLH_prefill

ARLH_decode

LNL
LNL_prefill

LNL_prefill

B70 Arc Pro
B70_prefill
B70_decode

PTL
PTL_prefill
PTLdecode

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, used claude code, then lots of manual review/tweaking.

fish-jiang and others added 3 commits June 10, 2026 17:25
…PG Plus (1/3, Xe1-ARLH)

Co-authored-by: Xia, Jie <jie.xia@intel.com>
Co-authored-by: Liu, Russell <russell.liu@intel.com>
…G Plus/Xe2/Xe3)

Co-authored-by: Xia, Jie <jie.xia@intel.com>
Co-authored-by: Liu, Russell <russell.liu@intel.com>
…n for Intel MoE path (3/3, Xe-LPG Plus/Xe2/Xe3)

Co-authored-by: Xia, Jie <jie.xia@intel.com>
Co-authored-by: Liu, Russell <russell.liu@intel.com>
@fish-jiang fish-jiang requested review from a team, CISC and ggerganov as code owners June 10, 2026 09:36
@fish-jiang fish-jiang marked this pull request as draft June 10, 2026 09:36
@github-actions github-actions Bot added model Model specific Vulkan Issues specific to the Vulkan backend examples ggml changes relating to the ggml tensor library for machine learning labels Jun 10, 2026
@ggml-gh-bot

ggml-gh-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

Hi @fish-jiang, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 4 open PRs.

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning model Model specific Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant