fix(cpu): preserve raw fp16 bits in x86 quantize conversion by kkkzbh · Pull Request #685 · UbiquitousLearning/mllm

kkkzbh · 2026-06-13T11:31:06Z

Summary

This fixes x86 fp16 conversion in the common GGML quantization helpers by preserving raw fp16 bits instead of relying on implicit numeric conversion.

On non-ARM platforms, mllm_fp16_t is half_float::half. Passing it directly to helpers that take uint16_t converts the fp16 value numerically, not as raw bits. For small Q4_K/Q6_K scales this can turn the lookup index into 0, making dequantization produce all-zero output.

Root cause

The x86 path currently expands MLLM_FP16_TO_FP32(x) to lookup_fp16_to_fp32(x), while lookup_fp16_to_fp32 takes uint16_t.

When x is half_float::half, this does not pass the underlying fp16 bit pattern. For example, fp16 1.0 has raw bits 0x3c00, but numeric conversion to uint16_t yields 1.

Changes

Add mllm_fp16_bits() to read fp16 raw bits explicitly.
Add mllm_fp16_from_bits() to construct mllm_fp16_t from raw bits returned by _cvtss_sh.
Route x86 MLLM_COMPUTE_FP16_TO_FP32, MLLM_COMPUTE_FP32_TO_FP16, and MLLM_FP16_TO_FP32 through those helpers.

Context

This was found while investigating the documented Qwen3-0.6B x86 v1 q4_k path. The Qwen3 model-file naming mismatch is tracked separately in #684; this PR only fixes the common x86 quantization conversion issue.

Validation

git diff --check
Minimal Q4_K/Q6_K dequantization check compiled against this branch:

sizeof(mllm_fp16_t)=2 is_half_float=1 one_bits=0x3c00 macro_one=1
q4_sum_abs=128 first=0.5
q6_sum_abs=1856 first=-7.5

CMake object-level build for the affected quantization sources:

cmake --build /tmp/mllm-fp16-pr-build \
  --target mllm/backends/cpu/CMakeFiles/MllmCPUBackend.dir/kernels/common/ggml/quantize/quantize_q4.cpp.o \
           mllm/backends/cpu/CMakeFiles/MllmCPUBackend.dir/kernels/common/ggml/quantize/quantize_q6.cpp.o \
  -j6

A full mllm-qwen3-runner build currently reaches the link step and then fails on upstream main with the existing xxHash static-library PIC issue tracked in #683:

/usr/bin/ld.bfd: lib/libxxhash_static.a(xxhash.c.o): relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC

Summary by CodeRabbit

Bug Fixes
- Improved FP16 bit-level handling in CPU inference paths, correcting raw half-precision conversions to improve numerical accuracy, stability, and consistency of floating-point operations across platforms.

coderabbitai · 2026-06-13T11:31:19Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b35a504d-d667-43bc-9b1f-f160ea867396

📥 Commits

Reviewing files that changed from the base of the PR and between 8647bda and 7011227.

📒 Files selected for processing (1)

mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp

🚧 Files skipped from review as they are similar to previous changes (1)

mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp

📝 Walkthrough

Walkthrough

Adds type-safe helpers to extract raw 16-bit FP16 bit patterns and to construct FP16 values from raw bits, updates fp16 conversion macros to use these helpers, and adjusts the FP16-to-FP32 lookup to index by the extracted raw bits.

Changes

FP16 Bit Manipulation

Layer / File(s)	Summary
FP16 bit helpers and macro updates `mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp`	Added `<type_traits>`. Introduced `template<typename T> inline static uint16_t mllm_fp16_bits(const T& f)` that returns raw 16-bit patterns (passthrough for integral types; `memcpy` for fp16-sized non-integral types with `static_assert(sizeof(T) == sizeof(uint16_t))`). Added `inline static mllm_fp16_t mllm_fp16_from_bits(uint16_t bits)` (construct via `memcpy`). Updated non-ARM/non-MSC fp16 conversion macros to call `_cvtsh_ss(mllm_fp16_bits(x))` / `mllm_fp16_from_bits(_cvtss_sh(x, 0))`, and changed `MLLM_FP16_TO_FP32` to call `lookup_fp16_to_fp32(mllm_fp16_bits(x))`.

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit nibbles tiny bits of light,
Sixteen traces tucked away from sight,
Bits through memcpy, templates hum,
Conversions neat, no bytes undone,
Hoppy code that runs just right! 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix(cpu): preserve raw fp16 bits in x86 quantize conversion' directly and specifically describes the main change—preserving raw fp16 bit patterns in x86 quantization conversion to fix numeric conversion issues.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering Summary, Root cause, Changes, Context, and Validation sections with clear technical details and evidence of testing.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp (1)

109-119: Document mllm_fp16_bits helper (C++17+ already satisfied)

mllm_fp16_bits is a header-local conversion helper used by the fp16 quantization macros; adding a brief comment about supported T (integral vs 16-bit fp type) and that the non-integral path uses memcpy would improve readability/maintenance.
No C++17 compatibility concern: the project sets CMAKE_CXX_STANDARD 20 / CXX_STANDARD 20, which covers if constexpr and std::is_integral_v.

📝 Suggested documentation

+/**
+ * Extract the raw 16-bit representation from an fp16-like value or integral.
+ * Integral inputs are cast to uint16_t; non-integral inputs are bit-copied via
+ * memcpy after enforcing sizeof(T) == sizeof(uint16_t).
+ *
+ * `@param` f fp16-like value or integral to extract bits from
+ * `@return` Raw 16-bit representation
+ */
 template<typename T>
 inline static uint16_t mllm_fp16_bits(const T& f) {
   if constexpr (std::is_integral_v<std::decay_t<T>>) {
     return static_cast<uint16_t>(f);
   } else {
     static_assert(sizeof(T) == sizeof(uint16_t), "fp16 type must be 16 bits");
     uint16_t s;
     memcpy(&s, &f, sizeof(s));
     return s;
   }
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp` around lines 109
- 119, Add a short header comment above the mllm_fp16_bits<T> helper explaining
what T may be (integral types are returned as-is; non-integral types are
expected to be a 16-bit floating-point representation), that the non-integral
path uses memcpy to copy the 16-bit bit-pattern into a uint16_t, and that the
static_assert enforces sizeof(T)==sizeof(uint16_t); no code change needed beyond
the comment and an optional note that the project uses C++20 so if constexpr and
std::is_integral_v are available.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp`:
- Around line 121-125: Add a short doc comment for the helper
mllm_fp16_from_bits that mirrors the style of mllm_fp16_bits: describe its
purpose (constructs an mllm_fp16_t value from a raw uint16_t bit pattern),
document the parameter (bits: the raw 16-bit representation), describe the
return (an mllm_fp16_t whose bytes match bits), and note there are no error
conditions; place the comment immediately above the inline function definition
in quantize.hpp following the existing docstring conventions used in the file.

---

Nitpick comments:
In `@mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp`:
- Around line 109-119: Add a short header comment above the mllm_fp16_bits<T>
helper explaining what T may be (integral types are returned as-is; non-integral
types are expected to be a 16-bit floating-point representation), that the
non-integral path uses memcpy to copy the 16-bit bit-pattern into a uint16_t,
and that the static_assert enforces sizeof(T)==sizeof(uint16_t); no code change
needed beyond the comment and an optional note that the project uses C++20 so if
constexpr and std::is_integral_v are available.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c9fc972a-1134-42cd-91d5-96ad7b1cf3c2

📥 Commits

Reviewing files that changed from the base of the PR and between b9c3070 and 8647bda.

📒 Files selected for processing (1)

mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp

fix(cpu): preserve raw fp16 bits in x86 quantize

8647bda

kkkzbh requested a review from yirongjie as a code owner June 13, 2026 11:31

coderabbitai Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp

docs(cpu): explain x86 fp16 bit helpers

7011227

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cpu): preserve raw fp16 bits in x86 quantize conversion#685

fix(cpu): preserve raw fp16 bits in x86 quantize conversion#685
kkkzbh wants to merge 2 commits into
UbiquitousLearning:mainfrom
kkkzbh:codex/fix-x86-fp16-quantize

kkkzbh commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kkkzbh commented Jun 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Changes

Context

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kkkzbh commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading