Claude/index llama shards a2 qzr by AdaWorldAPI · Pull Request #52 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-03-30T07:09:52Z

No description provided.

Processes 801.47 GB (18 shards × ~43-48 GB each) of Maverick 17B-128E sequentially with 256 MB HTTP range chunks, tail-deleting output files to stay within 26 GB disk budget. Keeps last 3 outputs, drops writer handles before cleanup, and accumulates per-type compression stats across the full model. https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee

Maverick shard 1 OOM'd allocating 20 GB for a single tensor — the original stream_index_gguf loads entire tensors as f32, which fails on Maverick's massive embedding/expert tensors (5B+ elements). New stream_index_gguf_large function switches to row-by-row streaming for tensors exceeding 512M f32 elements (2 GB). Each row is read, dequanted, projected to Base17, and discarded — peak RAM per large tensor drops from 20+ GB to ~40 KB (one row). Small tensors still use the original bulk-load path. Also makes gguf::f16_to_f32 public for the F16 row reader. https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee

- Chunked sequential reads: single seek per tensor, then 128 MB bulk reads. No per-row HTTP seeks. ~42 reads per 20 GB tensor vs millions. - SIMD projection via crate::simd::F64x8 (AVX-512 f64 lanes) - f64::consts::GOLDEN_RATIO for PHI-weighted octave decay - f64::consts::EULER_GAMMA as harmonic noise floor threshold - BF16/F16/F32 bulk dequant helpers for chunk processing https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee

round(17 / φ) = 11, now computed as a const expression instead of a magic number. Verified identical value on 1.94.0. https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee

Replace chunked f32 streaming with BF16-direct path: - read_tensor_bf16_raw: reusable Vec<u16> buffer, no f32 alloc (141 MB vs 424 MB) - project_row_bf16_direct: inline BF16→f64, 136 bytes stack - project_row_bf16_strided: octave stride + halftone drop (97% fewer conversions) - stream_index_gguf_bf16: combined optimized indexer with octave_stride param - HALFTONE_POS/HALFTONE_TO_BIN: compile-time position tables - 3 new unit tests: halftone coverage, bf16_to_f64 accuracy, strided vs full https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee

https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee

project_tensor_bf16_simd: processes 8 rows per SIMD batch using F64x8 accumulators (9 halftone bins × 8 lanes = 9 zmm registers). Per octave: 9 gather+vaddpd ops. For 5120-col at stride=16: 19 octaves × 9 = 171 vaddpd per 8-row batch (vs 2.35M scalar ops). Integrated into stream_index_gguf_bf16 BF16 fast path. Scalar tail handles remainder rows (<8). https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee

Replace naive SIMD with structured projection: - gather_bf16_x8: explicit 8-lane gather from row offsets - project_8rows_bf16_simd: 17 F64x8 accumulators (1088 bytes stack), halftone odd-bin interpolation in SIMD (normalize→average), mul_add finalization with simd_clamp - project_1row_bf16_strided: scalar fallback matching SIMD algorithm - project_tensor_bf16_simd: dispatches to 8-row batches + scalar tail - 3 new tests: constant agreement, scalar parity, tail handling https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee

claude added 6 commits March 30, 2026 07:10

Derive GOLDEN_STEP from f64::consts::GOLDEN_RATIO at compile time

fb9d03b

round(17 / φ) = 11, now computed as a const expression instead of a magic number. Verified identical value on 1.94.0. https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee

Fix rebase duplicates, clean BF16-direct implementation

8d2d372

https://claude.ai/code/session_01HmdXNPit7QsTCfhJFef3Ee

AdaWorldAPI force-pushed the claude/index-llama-shards-A2Qzr branch from a048c07 to 8d2d372 Compare March 30, 2026 07:16

claude added 2 commits March 30, 2026 07:18

AdaWorldAPI merged commit a993794 into master Mar 30, 2026
5 of 14 checks passed

AdaWorldAPI deleted the claude/index-llama-shards-A2Qzr branch March 30, 2026 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude/index llama shards a2 qzr#52

Claude/index llama shards a2 qzr#52
AdaWorldAPI merged 8 commits into
masterfrom
claude/index-llama-shards-A2Qzr

AdaWorldAPI commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants