fix(cpu): preserve raw fp16 bits in x86 quantize conversion#685
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds type-safe helpers to extract raw 16-bit FP16 bit patterns and to construct FP16 values from raw bits, updates fp16 conversion macros to use these helpers, and adjusts the FP16-to-FP32 lookup to index by the extracted raw bits. ChangesFP16 Bit Manipulation
Estimated Code Review Effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp (1)
109-119: Documentmllm_fp16_bitshelper (C++17+ already satisfied)
mllm_fp16_bitsis a header-local conversion helper used by the fp16 quantization macros; adding a brief comment about supportedT(integral vs 16-bit fp type) and that the non-integral path usesmemcpywould improve readability/maintenance.- No C++17 compatibility concern: the project sets
CMAKE_CXX_STANDARD 20/CXX_STANDARD 20, which coversif constexprandstd::is_integral_v.📝 Suggested documentation
+/** + * Extract the raw 16-bit representation from an fp16-like value or integral. + * Integral inputs are cast to uint16_t; non-integral inputs are bit-copied via + * memcpy after enforcing sizeof(T) == sizeof(uint16_t). + * + * `@param` f fp16-like value or integral to extract bits from + * `@return` Raw 16-bit representation + */ template<typename T> inline static uint16_t mllm_fp16_bits(const T& f) { if constexpr (std::is_integral_v<std::decay_t<T>>) { return static_cast<uint16_t>(f); } else { static_assert(sizeof(T) == sizeof(uint16_t), "fp16 type must be 16 bits"); uint16_t s; memcpy(&s, &f, sizeof(s)); return s; } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp` around lines 109 - 119, Add a short header comment above the mllm_fp16_bits<T> helper explaining what T may be (integral types are returned as-is; non-integral types are expected to be a 16-bit floating-point representation), that the non-integral path uses memcpy to copy the 16-bit bit-pattern into a uint16_t, and that the static_assert enforces sizeof(T)==sizeof(uint16_t); no code change needed beyond the comment and an optional note that the project uses C++20 so if constexpr and std::is_integral_v are available.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp`:
- Around line 121-125: Add a short doc comment for the helper
mllm_fp16_from_bits that mirrors the style of mllm_fp16_bits: describe its
purpose (constructs an mllm_fp16_t value from a raw uint16_t bit pattern),
document the parameter (bits: the raw 16-bit representation), describe the
return (an mllm_fp16_t whose bytes match bits), and note there are no error
conditions; place the comment immediately above the inline function definition
in quantize.hpp following the existing docstring conventions used in the file.
---
Nitpick comments:
In `@mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp`:
- Around line 109-119: Add a short header comment above the mllm_fp16_bits<T>
helper explaining what T may be (integral types are returned as-is; non-integral
types are expected to be a 16-bit floating-point representation), that the
non-integral path uses memcpy to copy the 16-bit bit-pattern into a uint16_t,
and that the static_assert enforces sizeof(T)==sizeof(uint16_t); no code change
needed beyond the comment and an optional note that the project uses C++20 so if
constexpr and std::is_integral_v are available.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c9fc972a-1134-42cd-91d5-96ad7b1cf3c2
📒 Files selected for processing (1)
mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp
Summary
This fixes x86 fp16 conversion in the common GGML quantization helpers by preserving raw fp16 bits instead of relying on implicit numeric conversion.
On non-ARM platforms,
mllm_fp16_tishalf_float::half. Passing it directly to helpers that takeuint16_tconverts the fp16 value numerically, not as raw bits. For small Q4_K/Q6_K scales this can turn the lookup index into0, making dequantization produce all-zero output.Root cause
The x86 path currently expands
MLLM_FP16_TO_FP32(x)tolookup_fp16_to_fp32(x), whilelookup_fp16_to_fp32takesuint16_t.When
xishalf_float::half, this does not pass the underlying fp16 bit pattern. For example, fp161.0has raw bits0x3c00, but numeric conversion touint16_tyields1.Changes
mllm_fp16_bits()to read fp16 raw bits explicitly.mllm_fp16_from_bits()to constructmllm_fp16_tfrom raw bits returned by_cvtss_sh.MLLM_COMPUTE_FP16_TO_FP32,MLLM_COMPUTE_FP32_TO_FP16, andMLLM_FP16_TO_FP32through those helpers.Context
This was found while investigating the documented Qwen3-0.6B x86 v1 q4_k path. The Qwen3 model-file naming mismatch is tracked separately in #684; this PR only fixes the common x86 quantization conversion issue.
Validation
git diff --checkcmake --build /tmp/mllm-fp16-pr-build \ --target mllm/backends/cpu/CMakeFiles/MllmCPUBackend.dir/kernels/common/ggml/quantize/quantize_q4.cpp.o \ mllm/backends/cpu/CMakeFiles/MllmCPUBackend.dir/kernels/common/ggml/quantize/quantize_q6.cpp.o \ -j6A full
mllm-qwen3-runnerbuild currently reaches the link step and then fails on upstreammainwith the existing xxHash static-library PIC issue tracked in #683:Summary by CodeRabbit