Claude/index llama shards a2 qzr#52
Merged
Merged
Conversation
Processes 801.47 GB (18 shards × ~43-48 GB each) of Maverick 17B-128E sequentially with 256 MB HTTP range chunks, tail-deleting output files to stay within 26 GB disk budget. Keeps last 3 outputs, drops writer handles before cleanup, and accumulates per-type compression stats across the full model. https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee
Maverick shard 1 OOM'd allocating 20 GB for a single tensor — the original stream_index_gguf loads entire tensors as f32, which fails on Maverick's massive embedding/expert tensors (5B+ elements). New stream_index_gguf_large function switches to row-by-row streaming for tensors exceeding 512M f32 elements (2 GB). Each row is read, dequanted, projected to Base17, and discarded — peak RAM per large tensor drops from 20+ GB to ~40 KB (one row). Small tensors still use the original bulk-load path. Also makes gguf::f16_to_f32 public for the F16 row reader. https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee
- Chunked sequential reads: single seek per tensor, then 128 MB bulk reads. No per-row HTTP seeks. ~42 reads per 20 GB tensor vs millions. - SIMD projection via crate::simd::F64x8 (AVX-512 f64 lanes) - f64::consts::GOLDEN_RATIO for PHI-weighted octave decay - f64::consts::EULER_GAMMA as harmonic noise floor threshold - BF16/F16/F32 bulk dequant helpers for chunk processing https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee
round(17 / φ) = 11, now computed as a const expression instead of a magic number. Verified identical value on 1.94.0. https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee
Replace chunked f32 streaming with BF16-direct path: - read_tensor_bf16_raw: reusable Vec<u16> buffer, no f32 alloc (141 MB vs 424 MB) - project_row_bf16_direct: inline BF16→f64, 136 bytes stack - project_row_bf16_strided: octave stride + halftone drop (97% fewer conversions) - stream_index_gguf_bf16: combined optimized indexer with octave_stride param - HALFTONE_POS/HALFTONE_TO_BIN: compile-time position tables - 3 new unit tests: halftone coverage, bf16_to_f64 accuracy, strided vs full https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee
a048c07 to
8d2d372
Compare
project_tensor_bf16_simd: processes 8 rows per SIMD batch using F64x8 accumulators (9 halftone bins × 8 lanes = 9 zmm registers). Per octave: 9 gather+vaddpd ops. For 5120-col at stride=16: 19 octaves × 9 = 171 vaddpd per 8-row batch (vs 2.35M scalar ops). Integrated into stream_index_gguf_bf16 BF16 fast path. Scalar tail handles remainder rows (<8). https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee
Replace naive SIMD with structured projection: - gather_bf16_x8: explicit 8-lane gather from row offsets - project_8rows_bf16_simd: 17 F64x8 accumulators (1088 bytes stack), halftone odd-bin interpolation in SIMD (normalize→average), mul_add finalization with simd_clamp - project_1row_bf16_strided: scalar fallback matching SIMD algorithm - project_tensor_bf16_simd: dispatches to 8-row batches + scalar tail - 3 new tests: constant agreement, scalar parity, tail handling https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.