Skip to content

hexagon: store HMX flash-attention softmax accumulators in FP32#24389

Draft
njsyw1997 wants to merge 1 commit into
ggml-org:masterfrom
aizip:hex-ml-fp32
Draft

hexagon: store HMX flash-attention softmax accumulators in FP32#24389
njsyw1997 wants to merge 1 commit into
ggml-org:masterfrom
aizip:hex-ml-fp32

Conversation

@njsyw1997

Copy link
Copy Markdown
Contributor

Overview

Store the online-softmax cross-block accumulators of the HMX flash-attention prefill kernel in FP32 instead of FP16. m (row max), l (row sum), and p_rowsum move to FP32; s_rowmax stays FP16 (lossless — it's a max of fp16 values).

Additional information

Q/K/V are FP16 on both the HMX path and the Metal reference. However, the HMX kernel also keeps the running softmax statistics (m/l) and the per-block p_rowsum in FP16 while these intermediate results are stored in fp32 in Metal. Since these values are re-quantized to FP16 for each KV block, the error accumulates over time.

PPL does not directly capture this shift, but in some real long-context examples, we do observe a divergence from the CPU backend, while Metal remains aligned.

This PR will slow down prefill speed by roughly 5%.

Potential bug

It works before #23796. After that PR, it may randomly produce incorrect results in some builds on Qwen3 4b Q4_0. Similar to the previous issue, a good build consistently produces correct results, while a bad build consistently produces incorrect results.

Adding #define FARF_HIGH 1 to print log fixes the problem, which suggests that the bug is likely in one of the synchronization procedures.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes. Mainly for analyzing the commit history and constructing tests. All the code is reviewed by me.

@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Jun 10, 2026
@njsyw1997

Copy link
Copy Markdown
Contributor Author

@max-krasnyansky
Hi Max, can you have a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Hexagon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant