- 57 HPC modules in
src/hpc/(52K+ lines) - 2,846 lines of portable SIMD polyfill:
src/simd.rs→src/simd_avx512.rs→ scalar fallback - SIMD types:
F32x16,F64x8,U8x64,I32x16,U32x16,U64x8(AVX-512) +f32x8,f64x4(AVX2) - Full operator overloading,
mul_add(FMA),sqrt,reduce_sum/min/max,simd_clamp - AVX2 dot product with 4× unrolled accumulators (
src/simd_avx2.rs:52) - Runtime dispatch via
is_x86_feature_detected!(65+ sites inbitwise.rs)
| # | Item | Status | Key Evidence |
|---|---|---|---|
| 1 | simd_map() |
PARTIAL | SIMD types exist (F32x16 etc.), VML scalar. Missing: generic lane iteration |
| 2 | SpatialArray3<T> |
PARTIAL | cam_index.rs CAM + dn_tree.rs spatial tree. Missing: f32 3D coordinates |
| 3 | xor_diff() |
DONE | bitwise.rs AVX-512BW/AVX2/scalar XOR + popcount |
| 4 | gather_scatter() |
MISSING | Only vpshufb nibble gathers in bitwise.rs |
| 5 | columnar_view() |
PARTIAL | arrow_bridge.rs schema + SoakingBuffer. Missing: ArrayView bridge |
| 6 | Zip::simd_apply() |
PARTIAL | kernels.rs K0→K1→K2 fusion. Missing: generic over closures |
| 7 | runtime_dispatch() |
DONE | 65+ is_x86_feature_detected! sites + scalar fallbacks |
| 8 | stencil() |
MISSING | BNN has neighbor patterns but no 3D stencil API |
| 9 | compact_palette() |
PARTIAL | palette_distance.rs 256-entry codebook + quantized.rs + vpshufb |
| 10 | prefetch/stream |
PARTIAL | packed.rs layout-for-prefetch. No explicit _mm_prefetch |
Exists: F32x16::from_slice(), copy_to_slice(), mul_add(), sqrt(), all operators. AVX2 dot_f32() with 4-accumulator unrolling in simd_avx2.rs:52-90.
Exists: src/hpc/vml.rs has vsexp, vssqrt, vsln, vsabs, vsadd, vsmul, vsdiv — but ALL are scalar loops.
Gap: Need vml.rs to use F32x16/f32x8 types. The types exist, the functions exist, they just aren't connected. Example:
// Current vml.rs:
pub fn vssqrt(x: &[f32], out: &mut [f32]) {
for (o, &v) in out.iter_mut().zip(x.iter()) { *o = v.sqrt(); }
}
// Should be:
pub fn vssqrt(x: &[f32], out: &mut [f32]) {
let chunks = x.len() / 16;
for i in 0..chunks {
let v = F32x16::from_slice(&x[i*16..]);
v.sqrt().copy_to_slice(&mut out[i*16..]);
}
// scalar remainder
}Exists: cam_index.rs — multi-probe LSH CAM for 49,152-bit GraphHV binary vectors. dn_tree.rs — hierarchical spatial partitioning (739 lines). parallel_search.rs — dual-path HHTL + CLAM tree search.
Gap: All CAM infrastructure operates on binary hypervectors, not f32 spatial coordinates. Need a SpatialCam3D adapter that uses spatial hashing (floor(x/cell_size)) for the Pumpkin entity bind/unbind pattern.
DONE. bitwise.rs:
hamming_avx2()(line 62): 32 bytes/iter via vpshufbhamming_avx512bw()(line 117): 64 bytes/iter via vpshufb-512hamming_avx512_vpopcnt(): native VPOPCNTDQ when available- Runtime dispatch (line 234):
avx512vpopcntdq→avx512bw→avx2→ scalar hamming_query_batch(): batch mode for tick N vs N+1 comparison
Only gap: No sparse nonzero_iter() returning positions of changed elements.
MISSING. tekamolo.rs and cam_index.rs use hash-based lookups (conceptually gather) but no VPGATHERDD/VGATHERDPS intrinsics anywhere.
Exists: arrow_bridge.rs has schema constants (s_binary, p_binary, o_binary, node_id), GateState lifecycle (Form→Flow→Freeze), SoakingBuffer { data: Vec<i8>, n_entries, n_dims }.
Gap: Missing the one-liner: unsafe { ArrayView1::from_shape_ptr(len, arrow_buf.as_ptr()) }.
Exists: kernels.rs K0→K1→K2 fused cascade (1589 lines). packed.rs stroke-aligned cascade query. Both fuse multiple passes into one traversal.
Gap: Fusion is hardcoded for binary Hamming. Need generic version accepting Fn(F32x16, F32x16) -> F32x16.
DONE. Two complementary systems:
bitwise.rs: 65+is_x86_feature_detected!with 4-tier fallbacksimd.rspolyfill: compile-time dispatch via#[cfg(target_arch)]with scalar fallback types
MISSING. bnn_causal_trajectory.rs, deepnsm.rs, clam_search.rs have neighbor traversal patterns but nothing 3D-stencil-specific.
PARTIAL. Three relevant modules:
palette_distance.rs: 256-entryPalettecodebook with precomputed pairwise L1 distance matrixquantized.rs: f32→u8 quantization with scale/zero-pointbitwise.rs: vpshufb nibble lookup (4-bit table, proven in SIMD)
Gap: No variable-width (4-15 bit) pack/unpack for Minecraft block state encoding.
PARTIAL. packed.rs uses stroke-aligned layout for hardware prefetcher ("the prefetcher handles sequential access"). No explicit _mm_prefetch or _mm_stream_ps.
| New Module | Lines | Wishlist Impact |
|---|---|---|
src/simd.rs |
829 | #1 #6 #7 — portable SIMD types with scalar fallback |
src/simd_avx512.rs |
1399 | #1 — F32x16/F64x8/U8x64 with FMA, sqrt, reduce |
src/simd_avx2.rs |
618 | #1 — f32x8/f64x4 dot product, GEMM tile sizes |
hpc/holo.rs |
new | Phase + focus + carrier (94 tests) |
hpc/zeck.rs |
new | Zeckendorf encoding + batch/top_k |
hpc/palette_distance.rs |
new | #9 — 256-entry palette with O(1) distance |
hpc/parallel_search.rs |
new | #2 — dual-path HHTL + CLAM search |
hpc/layered_distance.rs |
new | O(1) distance via palette index + precomputed matrix |
hpc/bgz17_bridge.rs |
new | Base17 bridge for palette interop |
Audit generated 2026-03-23. AdaWorldAPI/ndarray master @ 11633d06.