Skip to content

Claude/index llama shards a2 qzr#52

Merged
AdaWorldAPI merged 8 commits into
masterfrom
claude/index-llama-shards-A2Qzr
Mar 30, 2026
Merged

Claude/index llama shards a2 qzr#52
AdaWorldAPI merged 8 commits into
masterfrom
claude/index-llama-shards-A2Qzr

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

No description provided.

claude added 6 commits March 30, 2026 07:10
Processes 801.47 GB (18 shards × ~43-48 GB each) of Maverick 17B-128E
sequentially with 256 MB HTTP range chunks, tail-deleting output files
to stay within 26 GB disk budget. Keeps last 3 outputs, drops writer
handles before cleanup, and accumulates per-type compression stats
across the full model.

https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee
Maverick shard 1 OOM'd allocating 20 GB for a single tensor — the
original stream_index_gguf loads entire tensors as f32, which fails
on Maverick's massive embedding/expert tensors (5B+ elements).

New stream_index_gguf_large function switches to row-by-row streaming
for tensors exceeding 512M f32 elements (2 GB). Each row is read,
dequanted, projected to Base17, and discarded — peak RAM per large
tensor drops from 20+ GB to ~40 KB (one row). Small tensors still
use the original bulk-load path.

Also makes gguf::f16_to_f32 public for the F16 row reader.

https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee
- Chunked sequential reads: single seek per tensor, then 128 MB bulk
  reads. No per-row HTTP seeks. ~42 reads per 20 GB tensor vs millions.
- SIMD projection via crate::simd::F64x8 (AVX-512 f64 lanes)
- f64::consts::GOLDEN_RATIO for PHI-weighted octave decay
- f64::consts::EULER_GAMMA as harmonic noise floor threshold
- BF16/F16/F32 bulk dequant helpers for chunk processing

https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee
round(17 / φ) = 11, now computed as a const expression instead of
a magic number. Verified identical value on 1.94.0.

https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee
Replace chunked f32 streaming with BF16-direct path:
- read_tensor_bf16_raw: reusable Vec<u16> buffer, no f32 alloc (141 MB vs 424 MB)
- project_row_bf16_direct: inline BF16→f64, 136 bytes stack
- project_row_bf16_strided: octave stride + halftone drop (97% fewer conversions)
- stream_index_gguf_bf16: combined optimized indexer with octave_stride param
- HALFTONE_POS/HALFTONE_TO_BIN: compile-time position tables
- 3 new unit tests: halftone coverage, bf16_to_f64 accuracy, strided vs full

https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee
@AdaWorldAPI AdaWorldAPI force-pushed the claude/index-llama-shards-A2Qzr branch from a048c07 to 8d2d372 Compare March 30, 2026 07:16
claude added 2 commits March 30, 2026 07:18
project_tensor_bf16_simd: processes 8 rows per SIMD batch using
F64x8 accumulators (9 halftone bins × 8 lanes = 9 zmm registers).
Per octave: 9 gather+vaddpd ops. For 5120-col at stride=16:
19 octaves × 9 = 171 vaddpd per 8-row batch (vs 2.35M scalar ops).

Integrated into stream_index_gguf_bf16 BF16 fast path.
Scalar tail handles remainder rows (<8).

https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee
Replace naive SIMD with structured projection:
- gather_bf16_x8: explicit 8-lane gather from row offsets
- project_8rows_bf16_simd: 17 F64x8 accumulators (1088 bytes stack),
  halftone odd-bin interpolation in SIMD (normalize→average),
  mul_add finalization with simd_clamp
- project_1row_bf16_strided: scalar fallback matching SIMD algorithm
- project_tensor_bf16_simd: dispatches to 8-row batches + scalar tail
- 3 new tests: constant agreement, scalar parity, tail handling

https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee
@AdaWorldAPI AdaWorldAPI merged commit a993794 into master Mar 30, 2026
5 of 14 checks passed
@AdaWorldAPI AdaWorldAPI deleted the claude/index-llama-shards-A2Qzr branch March 30, 2026 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants