Rebase Scout shards to BF16-direct + F64x8 SIMD pipeline by AdaWorldAPI · Pull Request #54 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-03-30T07:31:07Z

Summary

Rebases run_llama4_shard() from the legacy f32 path (stream_index_gguf) to the BF16-direct pipeline (stream_index_gguf_bf16).

Both Scout and Maverick now use the same optimized pipeline:

BF16-direct: no Vec<f32> intermediate (saves 283 MB per tensor)
F64x8 SIMD: 8 rows projected in parallel per zmm register
Strided octave (stride=16): 97% fewer BF16→f64 conversions
Halftone drop: 9 of 17 golden positions, odd bins interpolated
Exact shard sizes: SCOUT_SHARD_SIZES const replaces 44 GB estimate

Why golden step matters at this ratio

At 4,735× compression, every Base17 bin feeds the palette clustering.
Golden step (11 mod 17) visits all 17 residues → full-rank palette centroids.
Fibonacci had 4 dead bins → 23.5% of the CAM distance matrix was comparing noise floor to noise floor.

What stays on the f32 path

test_stream_index_llama4_scout_from_hf (IQ1_S format, not BF16)
test_stream_index_openchat_q8 (Q8_0 format)
test_stream_index_synthetic_gguf (synthetic F32)

These correctly use the old path because their dtypes need actual dequantization.

run_llama4_shard() now uses stream_index_gguf_bf16() instead of stream_index_gguf(). Changes: - BF16-direct: no f32 intermediate allocation (saves 283 MB/tensor) - F64x8 SIMD: 8 rows projected in parallel per zmm register - Strided octave (stride=16): 97% fewer BF16→f64 conversions - Halftone drop: 9 of 17 golden positions, odd bins interpolated - Exact shard sizes: SCOUT_SHARD_SIZES const replaces 44 GB estimate - Reusable u16 buffer inside indexer (no per-tensor alloc) Both Scout shard tests and Maverick test now use the same BF16-direct pipeline. The old f32 path remains for non-BF16 formats (IQ1_S, Q8_0, etc).

AdaWorldAPI merged commit a8631a2 into master Mar 30, 2026
5 of 14 checks passed

AdaWorldAPI deleted the claude/bf16-direct-rebase branch March 30, 2026 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase Scout shards to BF16-direct + F64x8 SIMD pipeline#54

Rebase Scout shards to BF16-direct + F64x8 SIMD pipeline#54
AdaWorldAPI merged 1 commit into
masterfrom
claude/bf16-direct-rebase

AdaWorldAPI commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AdaWorldAPI commented Mar 30, 2026

Summary

Why golden step matters at this ratio

What stays on the f32 path

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant