Force NVFP4 W4A8 path for NVFP4_W4A16 layers on Blackwell, where NVFP4 normally uses the native W4A4 path. #24364
Conversation
Signed-off-by: ynankani <ynankani@nvidia.com>
| mul_mat_q_case<GGML_TYPE_NVFP4, true>(ctx, args, stream); | ||
| break; | ||
| } | ||
| #endif // GGML_CUDA_HAS_BLACKWELL_TARGET |
There was a problem hiding this comment.
How do we opt out from this? A drop from 5486.02 to 4492.20 is very severe.
If you want a higher precision, there's a variety of Q4 quants that are just as small and even more precise (see #23572 for detailed comparisons).
There was a problem hiding this comment.
There is no need to opt out from this, if you want to run W4A4 take a checkpoint with W4A4.
The intention of this PR is if a checkpoint has W4A16 layers it should have activations in higher precision.
This PR doesn't cause regression on pure W4A4_NVFP4 checkpoints.
You can confirm there is no regression by testing llama-bench with this PR on below checkpoints:
- Pure W4A4_NVFP4: https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4
- W4A16_NVFP4: https://huggingface.co/nvidia/Qwen3.6-35B-A3B-NVFP4
There was a problem hiding this comment.
There may be a case that there aren't other checkpoints or somebody likes one's calibration over another. I think many prefer increased speed and that is why they pick NVFP4. Usually Q4_K~ will always be better precision than NVFP4 if that is what they are going for and then it may end up faster than skipping the native FP4. We're blending checkpoints with Q_K quants and NVFP4 combined that can compensate for ppl loss. I think some selection control by the user would be a good balance to let them decide.
am17an
left a comment
There was a problem hiding this comment.
I didn't look into the PR in detail, but does using 8-bit activation disable the fp4 tensor core?
Yes. Basically you can think of this as a step towards the support of weight-only-quantization-schemes in llama.cpp. |
Overview
This PR adds support to force W4A8 path for W4A16_NVFP4 HF model layers on Blackwell, where NVFP4 normally uses the native W4A4 path.
This PR includes the below:
GGML_HINT_NO_QUANT_SRC1hint for NVFP4_W4A16 layersmul_mat_idas wellmul_matfor dense andmul_mat_idMoE modelsAdditional information
Observed a quality improvement for W4A8 compared to W4A4 and are able to meet stricter quality threshold of the test_mul_mat cases used by ggml.
Also it is mentioned that "W4A4 sometimes difficult to achieve for small LLMs" in this paper URL
Tested on : nvidia/Qwen3.6-35B-A3B-NVFP4
Force W4A8 for W4A16_NVFP4 layers:
Master Baseline:
Perplexity Improvement:
Requirements