vulkan: GEMM/Group GEMM optimizations and optional load-time weight compression for Intel MoE path (3/3, Xe-LPG Plus/Xe2/Xe3) by fish-jiang · Pull Request #24407 · ggml-org/llama.cpp

fish-jiang · 2026-06-10T09:34:42Z

Overview

PR 3/3 of the Intel Xe optimization series — see #24408 (mega PR, draft) for the full feature set.

Target platforms: Xe-LPG Plus, Xe2, Xe3

This PR adds GEMM/Group GEMM kernel optimizations and load-time weight compression for the Intel quick MoE path. Dependency: builds on top of #24404 (Xe-LPG Plus coopmat1 enable). Independent of #24406 (FA).

GEMM kernel optimizations (Intel Xe)

LOAD_A_OPT path: SLM-based A-matrix layout optimization for coopmat1
MXFP4, Q4_K, Q5_K dequant via bitfieldExtract optimization
Alt pipeline (l_alt/a_l_alt, BM=128 warptile) for runtime selection when problem dimensions are small
f32→f16 activation conversion for Intel coopmat GEMM, scoped to Intel devices only
vulkan-shaders-gen.cpp: registers all new pipeline variants

MoE optimizations (Intel Xe)

mul_mm.comp shader optimization for MUL_MAT_ID: reduces unnecessary memory loads and matrix core operations for MoE models
Separate warptile tuning for MoE expert GEMM
n_ubatch auto-raised to 2048 for MoE models with flash attention on Intel Xe2 to match the optimal tile size
Gemma4 MoE router: fuse rms_norm + mul into a single RMS_NORM_MUL kernel dispatch for the expert gate input calculation

Optional load-time weight compression for fast Intel MoE path (Intel Xe2+)

This enables a fast path for specific MoE models at the cost of a slight quality reduction. To disable it, use -cw off
Layer eligibility requires uniform Q8_0 attn QKV and MoE expert weights < 5 bpw, At model load, eligible attention QKV tensors with bpw > 4 are downgraded to Q4_0 in-memory; mmap is disabled when active
On tested MoE models: 1) PPL drops <5%, 2) GSM8K, 194/200 = 97.00% for both -cw on and off (Qwen3.6-35B-A3B-UD-Q4_K_M)
Enable on Intel Xe2+, off elsewhere

Performance (Panther Lake B390 + Windows OS)

BEFORE:
C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64>llama-bench.exe -p 8192 -n 0 -r 3 -fa 0,1 --delay 10 -ngl 99 -m C:\Users\dungeon\Desktop\models\Qwen3.5-35B-A3B-Q4_K_M\Qwen3.5-35B-A3B-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gpt-oss-20b-Q4_K_M\gpt-oss-20b-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gemma-4-26B-A4B-it-UD-Q4_K_M\gemma-4-26B-A4B-it-UD-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\Qwen3-Coder-30B-A3B-Instruct-Q4_K_M\Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
load_backend: loaded RPC backend from C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64\ggml-cpu-alderlake.dll
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |   0 |          pp8192 |       523.86 ± 21.17 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |   1 |          pp8192 |        439.07 ± 2.28 |
| gpt-oss 20B Q4_K - Medium      |  10.81 GiB |    20.91 B | Vulkan     |  99 |   0 |          pp8192 |        496.90 ± 5.37 |
| gpt-oss 20B Q4_K - Medium      |  10.81 GiB |    20.91 B | Vulkan     |  99 |   1 |          pp8192 |        561.50 ± 1.79 |
| gemma4 26B.A4B Q4_K - Medium   |  15.70 GiB |    25.23 B | Vulkan     |  99 |   0 |          pp8192 |        615.07 ± 7.95 |
| gemma4 26B.A4B Q4_K - Medium   |  15.70 GiB |    25.23 B | Vulkan     |  99 |   1 |          pp8192 |        414.96 ± 0.91 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |   0 |          pp8192 |        330.11 ± 0.45 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |   1 |          pp8192 |        220.38 ± 0.36 |

build: 3571fa543 (9490)

C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64>

AFTER:
C:\upsteaming_build\subPR2_GEMM_CW\Release>llama-bench.exe -p 8192 -n 0 -r 3 -fa 0,1 --delay 10 -ngl 99 -m C:\Users\dungeon\Desktop\models\Qwen3.5-35B-A3B-Q4_K_M\Qwen3.5-35B-A3B-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gpt-oss-20b-Q4_K_M\gpt-oss-20b-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gemma-4-26B-A4B-it-UD-Q4_K_M\gemma-4-26B-A4B-it-UD-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\Qwen3-Coder-30B-A3B-Instruct-Q4_K_M\Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |   0 |          pp8192 |       819.43 ± 47.14 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |   1 |          pp8192 |        689.30 ± 4.62 |
| gpt-oss 20B Q4_K - Medium      |  10.81 GiB |    20.91 B | Vulkan     |  99 |   0 |          pp8192 |        727.97 ± 5.98 |
| gpt-oss 20B Q4_K - Medium      |  10.81 GiB |    20.91 B | Vulkan     |  99 |   1 |          pp8192 |       724.32 ± 14.87 |
| gemma4 26B.A4B Q4_K - Medium   |  15.70 GiB |    25.23 B | Vulkan     |  99 |   0 |          pp8192 |       848.26 ± 10.87 |
| gemma4 26B.A4B Q4_K - Medium   |  15.70 GiB |    25.23 B | Vulkan     |  99 |   1 |          pp8192 |        493.29 ± 4.36 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |   0 |          pp8192 |        469.47 ± 0.74 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |   1 |          pp8192 |        229.64 ± 3.29 |

build: 84de82817 (9492)

C:\upsteaming_build\subPR2_GEMM_CW\Release>

BEFORE:
C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64>llama-bench.exe -p 0 -n 128 -d 8192 -r 3 -fa 0,1 --delay 10 -ngl 99 -m C:\Users\dungeon\Desktop\models\Qwen3.5-35B-A3B-Q4_K_M\Qwen3.5-35B-A3B-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gpt-oss-20b-Q4_K_M\gpt-oss-20b-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gemma-4-26B-A4B-it-UD-Q4_K_M\gemma-4-26B-A4B-it-UD-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\Qwen3-Coder-30B-A3B-Instruct-Q4_K_M\Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
load_backend: loaded RPC backend from C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64\ggml-cpu-alderlake.dll
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |   0 |   tg128 @ d8192 |         27.68 ± 0.08 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         25.68 ± 0.04 |
| gpt-oss 20B Q4_K - Medium      |  10.81 GiB |    20.91 B | Vulkan     |  99 |   0 |   tg128 @ d8192 |         30.60 ± 0.49 |
| gpt-oss 20B Q4_K - Medium      |  10.81 GiB |    20.91 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         25.77 ± 0.11 |
| gemma4 26B.A4B Q4_K - Medium   |  15.70 GiB |    25.23 B | Vulkan     |  99 |   0 |   tg128 @ d8192 |         22.76 ± 0.07 |
| gemma4 26B.A4B Q4_K - Medium   |  15.70 GiB |    25.23 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         20.80 ± 0.07 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |   0 |   tg128 @ d8192 |         26.63 ± 0.49 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         19.04 ± 0.02 |

build: 3571fa543 (9490)

C:\Users\dungeon\Downloads\llama-b9490-bin-win-vulkan-x64>

AFTER:
C:\upsteaming_build\subPR2_GEMM_CW\Release>llama-bench.exe -p 0 -n 128 -d 8192 -r 3 -fa 0,1 --delay 10 -ngl 99 -m C:\Users\dungeon\Desktop\models\Qwen3.5-35B-A3B-Q4_K_M\Qwen3.5-35B-A3B-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gpt-oss-20b-Q4_K_M\gpt-oss-20b-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\gemma-4-26B-A4B-it-UD-Q4_K_M\gemma-4-26B-A4B-it-UD-Q4_K_M.gguf,C:\Users\dungeon\Desktop\models\Qwen3-Coder-30B-A3B-Instruct-Q4_K_M\Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |   0 |   tg128 @ d8192 |         36.15 ± 0.71 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         31.76 ± 0.38 |
| gpt-oss 20B Q4_K - Medium      |  10.81 GiB |    20.91 B | Vulkan     |  99 |   0 |   tg128 @ d8192 |         30.69 ± 0.13 |
| gpt-oss 20B Q4_K - Medium      |  10.81 GiB |    20.91 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         25.68 ± 0.05 |
| gemma4 26B.A4B Q4_K - Medium   |  15.70 GiB |    25.23 B | Vulkan     |  99 |   0 |   tg128 @ d8192 |         28.96 ± 0.14 |
| gemma4 26B.A4B Q4_K - Medium   |  15.70 GiB |    25.23 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         24.74 ± 1.05 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |   0 |   tg128 @ d8192 |         27.08 ± 1.07 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | Vulkan     |  99 |   1 |   tg128 @ d8192 |         17.24 ± 0.04 |

build: 84de82817 (9492)

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, used claude code, then lots of manual review/tweaking.

…PG Plus (1/3, Xe1-ARLH) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>

…n for Intel MoE path (3/3, Xe-LPG Plus/Xe2/Xe3) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>

ggml-gh-bot · 2026-06-10T09:39:09Z

Hi @fish-jiang, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 4 open PRs.
AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

fish-jiang and others added 2 commits June 10, 2026 17:18

vulkan: add INTEL_PRE_XE2 arch enum and enable coopmat1 on Intel Xe-L…

f7477c0

…PG Plus (1/3, Xe1-ARLH) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>

vulkan: GEMM/Group GEMM optimizations and load-time weight compressio…

500eb77

…n for Intel MoE path (3/3, Xe-LPG Plus/Xe2/Xe3) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>

fish-jiang requested review from a team, CISC and ggerganov as code owners June 10, 2026 09:34

fish-jiang marked this pull request as draft June 10, 2026 09:34

github-actions Bot added model Model specific Vulkan Issues specific to the Vulkan backend examples ggml changes relating to the ggml tensor library for machine learning labels Jun 10, 2026

This was referenced Jun 10, 2026

vulkan: add Intel Xe flash attention optimization kernels (2/3, Xe-LPG Plus/Xe2/Xe3) #24406

Draft

vulkan: Intel Xe flash attention, GEMM optimizations, and optional weight compression (Xe-LPG Plus/Xe2/Xe3) [MEGA PR] #24408

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: GEMM/Group GEMM optimizations and optional load-time weight compression for Intel MoE path (3/3, Xe-LPG Plus/Xe2/Xe3)#24407

vulkan: GEMM/Group GEMM optimizations and optional load-time weight compression for Intel MoE path (3/3, Xe-LPG Plus/Xe2/Xe3)#24407
fish-jiang wants to merge 2 commits into
ggml-org:masterfrom
fish-jiang:intel/xe-gemm-cw

fish-jiang commented Jun 10, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fish-jiang commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

GEMM kernel optimizations (Intel Xe)

MoE optimizations (Intel Xe)

Optional load-time weight compression for fast Intel MoE path (Intel Xe2+)

Performance (Panther Lake B390 + Windows OS)

Requirements

Uh oh!

ggml-gh-bot Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fish-jiang commented Jun 10, 2026 •

edited

Loading