Force NVFP4 W4A8 path for NVFP4_W4A16 layers on Blackwell, where NVFP4 normally uses the native W4A4 path. by ynankani · Pull Request #24364 · ggml-org/llama.cpp

ynankani · 2026-06-09T14:32:56Z

Overview

This PR adds support to force W4A8 path for W4A16_NVFP4 HF model layers on Blackwell, where NVFP4 normally uses the native W4A4 path.
This PR includes the below:

Adds a GGUF metadata for storing NVFP4_W4A16 layers, output weight.
Loads the NVFP4_W4A16 layer metadata in model hparams .
Use the metadata to set the new GGML_HINT_NO_QUANT_SRC1 hint for NVFP4_W4A16 layers
On Blackwell GPU, dispatch the hinted layers through W4A8 path instead of native W4A4 path
Allow the new hint for the MoE layer mul_mat_id as well
Test case for mul_mat for dense and mul_mat_id MoE models

Additional information

Observed a quality improvement for W4A8 compared to W4A4 and are able to meet stricter quality threshold of the test_mul_mat cases used by ggml.
Also it is mentioned that "W4A4 sometimes difficult to achieve for small LLMs" in this paper URL
Tested on : nvidia/Qwen3.6-35B-A3B-NVFP4

Force W4A8 for W4A16_NVFP4 layers:

Model	Size	Params	Backend	ngl	Test	t/s
qwen35moe 35B.A3B NVFP4	21.09 GiB	35.51 B	CUDA	99	pp512	4492.20 +/- 25.66
qwen35moe 35B.A3B NVFP4	21.09 GiB	35.51 B	CUDA	99	pp1024	4689.26 +/- 47.87
qwen35moe 35B.A3B NVFP4	21.09 GiB	35.51 B	CUDA	99	pp2048	4822.40 +/- 75.42
qwen35moe 35B.A3B NVFP4	21.09 GiB	35.51 B	CUDA	99	tg128	148.25 +/- 1.42
qwen35moe 35B.A3B NVFP4	21.09 GiB	35.51 B	CUDA	99	tg256	164.13 +/- 6.55

Master Baseline:

Model	Size	Params	Backend	ngl	Test	t/s
qwen35moe 35B.A3B NVFP4	21.09 GiB	35.51 B	CUDA	99	pp512	5486.02 +/- 253.52
qwen35moe 35B.A3B NVFP4	21.09 GiB	35.51 B	CUDA	99	pp1024	5445.48 +/- 211.56
qwen35moe 35B.A3B NVFP4	21.09 GiB	35.51 B	CUDA	99	pp2048	5865.13 +/- 171.27
qwen35moe 35B.A3B NVFP4	21.09 GiB	35.51 B	CUDA	99	tg128	155.60 +/- 6.73
qwen35moe 35B.A3B NVFP4	21.09 GiB	35.51 B	CUDA	99	tg256	165.52 +/- 6.55

Perplexity Improvement:

Config	Command	PPL
Force W4A8 for W4A16_NVFP4 layers	llama-perplexity -c 2048 -b 2048 -ngl 99	6.0264 +/- 0.03777
Master baseline	llama-perplexity -c 2048 -b 2048 -ngl 99	6.1234 +/- 0.03851

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, paired with Claude for this

Signed-off-by: ynankani <ynankani@nvidia.com>

sanmai · 2026-06-10T09:38:47Z

+                mul_mat_q_case<GGML_TYPE_NVFP4, true>(ctx, args, stream);
+                break;
+            }
+#endif // GGML_CUDA_HAS_BLACKWELL_TARGET


How do we opt out from this? A drop from 5486.02 to 4492.20 is very severe.

If you want a higher precision, there's a variety of Q4 quants that are just as small and even more precise (see #23572 for detailed comparisons).

There is no need to opt out from this, if you want to run W4A4 take a checkpoint with W4A4.
The intention of this PR is if a checkpoint has W4A16 layers it should have activations in higher precision.
This PR doesn't cause regression on pure W4A4_NVFP4 checkpoints.

You can confirm there is no regression by testing llama-bench with this PR on below checkpoints:

Pure W4A4_NVFP4: https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4

W4A16_NVFP4: https://huggingface.co/nvidia/Qwen3.6-35B-A3B-NVFP4

There may be a case that there aren't other checkpoints or somebody likes one's calibration over another. I think many prefer increased speed and that is why they pick NVFP4. Usually Q4_K~ will always be better precision than NVFP4 if that is what they are going for and then it may end up faster than skipping the native FP4. We're blending checkpoints with Q_K quants and NVFP4 combined that can compensate for ppl loss. I think some selection control by the user would be a good balance to let them decide.

am17an

I didn't look into the PR in detail, but does using 8-bit activation disable the fp4 tensor core?

ORippler · 2026-06-10T13:17:29Z

I didn't look into the PR in detail, but does using 8-bit activation disable the fp4 tensor core?

Yes. Basically you can think of this as a step towards the support of weight-only-quantization-schemes in llama.cpp.

Force NVFP4 W4A8 path for NVFP4_W4A16 layers

b72a8c9

Signed-off-by: ynankani <ynankani@nvidia.com>

ynankani requested review from a team, CISC and ggerganov as code owners June 9, 2026 14:32

github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 9, 2026

sanmai reviewed Jun 10, 2026

View reviewed changes

am17an reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force NVFP4 W4A8 path for NVFP4_W4A16 layers on Blackwell, where NVFP4 normally uses the native W4A4 path. #24364

Force NVFP4 W4A8 path for NVFP4_W4A16 layers on Blackwell, where NVFP4 normally uses the native W4A4 path. #24364
ynankani wants to merge 1 commit into
ggml-org:masterfrom
ynankani:ynankani/Force_W4A16_NVFP4_to_W4A8

ynankani commented Jun 9, 2026 •

edited

Loading

Uh oh!

sanmai Jun 10, 2026 •

edited

Loading

Uh oh!

ynankani Jun 10, 2026

Uh oh!

michaelw9999 Jun 10, 2026

Uh oh!

am17an left a comment

Uh oh!

ORippler commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ynankani commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

sanmai Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ynankani Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

michaelw9999 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

am17an left a comment

Choose a reason for hiding this comment

Uh oh!

ORippler commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ynankani commented Jun 9, 2026 •

edited

Loading

sanmai Jun 10, 2026 •

edited

Loading