Skip to content

Force NVFP4 W4A8 path for NVFP4_W4A16 layers on Blackwell, where NVFP4 normally uses the native W4A4 path. #24364

Open
ynankani wants to merge 1 commit into
ggml-org:masterfrom
ynankani:ynankani/Force_W4A16_NVFP4_to_W4A8
Open

Force NVFP4 W4A8 path for NVFP4_W4A16 layers on Blackwell, where NVFP4 normally uses the native W4A4 path. #24364
ynankani wants to merge 1 commit into
ggml-org:masterfrom
ynankani:ynankani/Force_W4A16_NVFP4_to_W4A8

Conversation

@ynankani

@ynankani ynankani commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Overview

This PR adds support to force W4A8 path for W4A16_NVFP4 HF model layers on Blackwell, where NVFP4 normally uses the native W4A4 path.
This PR includes the below:

  1. Adds a GGUF metadata for storing NVFP4_W4A16 layers, output weight.
  2. Loads the NVFP4_W4A16 layer metadata in model hparams .
  3. Use the metadata to set the new GGML_HINT_NO_QUANT_SRC1 hint for NVFP4_W4A16 layers
  4. On Blackwell GPU, dispatch the hinted layers through W4A8 path instead of native W4A4 path
  5. Allow the new hint for the MoE layer mul_mat_id as well
  6. Test case for mul_mat for dense and mul_mat_id MoE models

Additional information

Observed a quality improvement for W4A8 compared to W4A4 and are able to meet stricter quality threshold of the test_mul_mat cases used by ggml.
Also it is mentioned that "W4A4 sometimes difficult to achieve for small LLMs" in this paper URL
Tested on : nvidia/Qwen3.6-35B-A3B-NVFP4

Force W4A8 for W4A16_NVFP4 layers:

Model Size Params Backend ngl Test t/s
qwen35moe 35B.A3B NVFP4 21.09 GiB 35.51 B CUDA 99 pp512 4492.20 +/- 25.66
qwen35moe 35B.A3B NVFP4 21.09 GiB 35.51 B CUDA 99 pp1024 4689.26 +/- 47.87
qwen35moe 35B.A3B NVFP4 21.09 GiB 35.51 B CUDA 99 pp2048 4822.40 +/- 75.42
qwen35moe 35B.A3B NVFP4 21.09 GiB 35.51 B CUDA 99 tg128 148.25 +/- 1.42
qwen35moe 35B.A3B NVFP4 21.09 GiB 35.51 B CUDA 99 tg256 164.13 +/- 6.55

Master Baseline:

Model Size Params Backend ngl Test t/s
qwen35moe 35B.A3B NVFP4 21.09 GiB 35.51 B CUDA 99 pp512 5486.02 +/- 253.52
qwen35moe 35B.A3B NVFP4 21.09 GiB 35.51 B CUDA 99 pp1024 5445.48 +/- 211.56
qwen35moe 35B.A3B NVFP4 21.09 GiB 35.51 B CUDA 99 pp2048 5865.13 +/- 171.27
qwen35moe 35B.A3B NVFP4 21.09 GiB 35.51 B CUDA 99 tg128 155.60 +/- 6.73
qwen35moe 35B.A3B NVFP4 21.09 GiB 35.51 B CUDA 99 tg256 165.52 +/- 6.55

Perplexity Improvement:

Config Command PPL
Force W4A8 for W4A16_NVFP4 layers llama-perplexity -c 2048 -b 2048 -ngl 99 6.0264 +/- 0.03777
Master baseline llama-perplexity -c 2048 -b 2048 -ngl 99 6.1234 +/- 0.03851

Requirements

Signed-off-by: ynankani <ynankani@nvidia.com>
@ynankani ynankani requested review from a team, CISC and ggerganov as code owners June 9, 2026 14:32
@github-actions github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 9, 2026
Comment thread ggml/src/ggml-cuda/mmq.cu
mul_mat_q_case<GGML_TYPE_NVFP4, true>(ctx, args, stream);
break;
}
#endif // GGML_CUDA_HAS_BLACKWELL_TARGET

@sanmai sanmai Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we opt out from this? A drop from 5486.02 to 4492.20 is very severe.

If you want a higher precision, there's a variety of Q4 quants that are just as small and even more precise (see #23572 for detailed comparisons).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to opt out from this, if you want to run W4A4 take a checkpoint with W4A4.
The intention of this PR is if a checkpoint has W4A16 layers it should have activations in higher precision.
This PR doesn't cause regression on pure W4A4_NVFP4 checkpoints.

You can confirm there is no regression by testing llama-bench with this PR on below checkpoints:

  1. Pure W4A4_NVFP4: https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4
  2. W4A16_NVFP4: https://huggingface.co/nvidia/Qwen3.6-35B-A3B-NVFP4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a case that there aren't other checkpoints or somebody likes one's calibration over another. I think many prefer increased speed and that is why they pick NVFP4. Usually Q4_K~ will always be better precision than NVFP4 if that is what they are going for and then it may end up faster than skipping the native FP4. We're blending checkpoints with Q_K quants and NVFP4 combined that can compensate for ppl loss. I think some selection control by the user would be a good balance to let them decide.

@am17an am17an left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look into the PR in detail, but does using 8-bit activation disable the fp4 tensor core?

@ORippler

Copy link
Copy Markdown
Collaborator

I didn't look into the PR in detail, but does using 8-bit activation disable the fp4 tensor core?

Yes. Basically you can think of this as a step towards the support of weight-only-quantization-schemes in llama.cpp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants