vulkan: Intel Xe flash attention, GEMM optimizations, and optional weight compression (Xe-LPG Plus/Xe2/Xe3) [MEGA PR] by fish-jiang · Pull Request #24408 · ggml-org/llama.cpp

fish-jiang · 2026-06-10T09:36:51Z

Overview

Draft / Evaluation only — not for merge. This mega PR exists solely to show the full feature set in one place and will remain as a draft. Please refer to the individual PRs below for review and merging.

Target platforms: Xe-LPG Plus (Arrow Lake-H iGPU), Xe2, Xe3

PR Series Description Target

#24404 1/3 Xe-LPG Plus coopmat1 enable + INTEL_PRE_XE2 enum Xe-LPG Plus

#24406 2/3 Intel Xe FA optimization kernels Xe-LPG Plus, Xe2, Xe3

#24407 3/3 GEMM/Group GEMM optimizations + optional load-time weight compression Xe-LPG Plus, Xe2, Xe3

Dependency graph:
#24404 (ARLH) ← #24406 (FA)
           ← #24407 (GEMM+CW)
PR6 and PR7 are independent of each other; both build on PR5.

Flash Attention (Intel Xe)

New Vulkan shaders: single-phase prefill (flash_attn_hdim64/96/128) and two-phase split prefill/decode variants
Pipelines keyed by (head_dim, gqa_ratio) for runtime dispatch across various GQA ratios without combinatorial pipeline proliferation
Supports non-power-of-two GQA ratios via subgroup splitting (qk_groups)
Intel Xe1 (integrated GPU, UMA, cooperative matrix) and Xe2 paths with separate warptile tuning
Two-phase decode splits softmax reduction across subgroups; shared QK state copy (fa_copy_qstate) between prefill phases

GEMM kernel optimizations (Intel Xe)

LOAD_A_OPT path: SLM-based A-matrix layout optimization for coopmat1
MXFP4, Q4_K, Q5_K dequant via bitfieldExtract optimization
Alt pipeline (l_alt/a_l_alt, BM=128 warptile) for runtime selection when problem dimensions are small
f32→f16 activation conversion for Intel coopmat GEMM, scoped to Intel devices only
vulkan-shaders-gen.cpp: registers all new pipeline variants

MoE optimizations (Intel Xe)

mul_mm.comp shader optimization for MUL_MAT_ID: reduces unnecessary memory loads and matrix core operations for MoE models
Separate warptile tuning for MoE expert GEMM
n_ubatch auto-raised to 2048 for MoE models with flash attention on Intel Xe2 to match the optimal tile size
Gemma4 MoE router: fuse rms_norm + mul into a single RMS_NORM_MUL kernel dispatch for the expert gate input calculation

Optional load-time weight compression for fast Intel MoE path (Intel Xe2+)

This enables a fast path for specific MoE models at the cost of a slight quality reduction. To disable it, use -cw off
Layer eligibility requires uniform Q8_0 attn QKV and MoE expert weights < 5 bpw, At model load, eligible attention QKV tensors with bpw > 4 are downgraded to Q4_0 in-memory; mmap is disabled when active
On tested MoE models: 1) PPL drops <5%, 2) GSM8K, 194/200 = 97.00% for both -cw on and off (Qwen3.6-35B-A3B-UD-Q4_K_M)
Enable on Intel Xe2+, off elsewhere

Performance (Windows OS)

ARLH

LNL

B70 Arc Pro

PTL

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, used claude code, then lots of manual review/tweaking.

…PG Plus (1/3, Xe1-ARLH) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>

…G Plus/Xe2/Xe3) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>

…n for Intel MoE path (3/3, Xe-LPG Plus/Xe2/Xe3) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>

ggml-gh-bot · 2026-06-10T09:41:22Z

Hi @fish-jiang, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 4 open PRs.
AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.
Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

fish-jiang and others added 3 commits June 10, 2026 17:25

vulkan: add INTEL_PRE_XE2 arch enum and enable coopmat1 on Intel Xe-L…

7b24141

…PG Plus (1/3, Xe1-ARLH) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>

vulkan: add Intel Xe flash attention optimization kernels (2/3, Xe-LP…

2ab4c94

…G Plus/Xe2/Xe3) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>

vulkan: GEMM/Group GEMM optimizations and load-time weight compressio…

f9e1fd0

…n for Intel MoE path (3/3, Xe-LPG Plus/Xe2/Xe3) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>

fish-jiang requested review from a team, CISC and ggerganov as code owners June 10, 2026 09:36

fish-jiang marked this pull request as draft June 10, 2026 09:36

github-actions Bot added model Model specific Vulkan Issues specific to the Vulkan backend examples ggml changes relating to the ggml tensor library for machine learning labels Jun 10, 2026

danielmayost mentioned this pull request Jun 11, 2026

Experiment Subgroup 8 for older gpus rillomas/llama.cpp#14

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: Intel Xe flash attention, GEMM optimizations, and optional weight compression (Xe-LPG Plus/Xe2/Xe3) [MEGA PR]#24408

vulkan: Intel Xe flash attention, GEMM optimizations, and optional weight compression (Xe-LPG Plus/Xe2/Xe3) [MEGA PR]#24408
fish-jiang wants to merge 3 commits into
ggml-org:masterfrom
fish-jiang:intel/xe-all-opt

fish-jiang commented Jun 10, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PR	Series	Description	Target
#24404	1/3	Xe-LPG Plus coopmat1 enable + INTEL_PRE_XE2 enum	Xe-LPG Plus
#24406	2/3	Intel Xe FA optimization kernels	Xe-LPG Plus, Xe2, Xe3
#24407	3/3	GEMM/Group GEMM optimizations + optional load-time weight compression	Xe-LPG Plus, Xe2, Xe3

Conversation

fish-jiang commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Flash Attention (Intel Xe)

GEMM kernel optimizations (Intel Xe)

MoE optimizations (Intel Xe)

Optional load-time weight compression for fast Intel MoE path (Intel Xe2+)

Performance (Windows OS)

Requirements

Uh oh!

ggml-gh-bot Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fish-jiang commented Jun 10, 2026 •

edited

Loading