diff --git a/SIMD_COMPARISON.md b/SIMD_COMPARISON.md
new file mode 100644
index 00000000..99f72e97
--- /dev/null
+++ b/SIMD_COMPARISON.md
@@ -0,0 +1,130 @@
+# SIMD Wishlist Audit: AdaWorldAPI/ndarray (March 2026)
+
+## Codebase Snapshot
+
+- **57 HPC modules** in `src/hpc/` (52K+ lines)
+- **2,846 lines** of portable SIMD polyfill: `src/simd.rs` → `src/simd_avx512.rs` → scalar fallback
+- SIMD types: `F32x16`, `F64x8`, `U8x64`, `I32x16`, `U32x16`, `U64x8` (AVX-512) + `f32x8`, `f64x4` (AVX2)
+- Full operator overloading, `mul_add` (FMA), `sqrt`, `reduce_sum/min/max`, `simd_clamp`
+- AVX2 dot product with 4× unrolled accumulators (`src/simd_avx2.rs:52`)
+- Runtime dispatch via `is_x86_feature_detected!` (65+ sites in `bitwise.rs`)
+
+---
+
+## Wishlist Scorecard
+
+| # | Item | Status | Key Evidence |
+|---|---|---|---|
+| 1 | `simd_map()` | **PARTIAL** | SIMD types exist (`F32x16` etc.), VML scalar. Missing: generic lane iteration |
+| 2 | `SpatialArray3<T>` | **PARTIAL** | `cam_index.rs` CAM + `dn_tree.rs` spatial tree. Missing: f32 3D coordinates |
+| 3 | `xor_diff()` | **DONE** | `bitwise.rs` AVX-512BW/AVX2/scalar XOR + popcount |
+| 4 | `gather_scatter()` | **MISSING** | Only vpshufb nibble gathers in bitwise.rs |
+| 5 | `columnar_view()` | **PARTIAL** | `arrow_bridge.rs` schema + `SoakingBuffer`. Missing: `ArrayView` bridge |
+| 6 | `Zip::simd_apply()` | **PARTIAL** | `kernels.rs` K0→K1→K2 fusion. Missing: generic over closures |
+| 7 | `runtime_dispatch()` | **DONE** | 65+ `is_x86_feature_detected!` sites + scalar fallbacks |
+| 8 | `stencil()` | **MISSING** | BNN has neighbor patterns but no 3D stencil API |
+| 9 | `compact_palette()` | **PARTIAL** | `palette_distance.rs` 256-entry codebook + `quantized.rs` + vpshufb |
+| 10 | `prefetch/stream` | **PARTIAL** | `packed.rs` layout-for-prefetch. No explicit `_mm_prefetch` |
+
+---
+
+## Per-Item Detail
+
+### 1. `simd_map()` — Lane-Native SIMD Iteration
+
+**Exists:** `F32x16::from_slice()`, `copy_to_slice()`, `mul_add()`, `sqrt()`, all operators. AVX2 `dot_f32()` with 4-accumulator unrolling in `simd_avx2.rs:52-90`.
+
+**Exists:** `src/hpc/vml.rs` has `vsexp`, `vssqrt`, `vsln`, `vsabs`, `vsadd`, `vsmul`, `vsdiv` — but ALL are scalar loops.
+
+**Gap:** Need `vml.rs` to use `F32x16`/`f32x8` types. The types exist, the functions exist, they just aren't connected. Example:
+```rust
+// Current vml.rs:
+pub fn vssqrt(x: &[f32], out: &mut [f32]) {
+    for (o, &v) in out.iter_mut().zip(x.iter()) { *o = v.sqrt(); }
+}
+// Should be:
+pub fn vssqrt(x: &[f32], out: &mut [f32]) {
+    let chunks = x.len() / 16;
+    for i in 0..chunks {
+        let v = F32x16::from_slice(&x[i*16..]);
+        v.sqrt().copy_to_slice(&mut out[i*16..]);
+    }
+    // scalar remainder
+}
+```
+
+### 2. `SpatialArray3<T>` — Content-Addressable Memory
+
+**Exists:** `cam_index.rs` — multi-probe LSH CAM for 49,152-bit `GraphHV` binary vectors. `dn_tree.rs` — hierarchical spatial partitioning (739 lines). `parallel_search.rs` — dual-path HHTL + CLAM tree search.
+
+**Gap:** All CAM infrastructure operates on binary hypervectors, not f32 spatial coordinates. Need a `SpatialCam3D` adapter that uses spatial hashing (floor(x/cell_size)) for the Pumpkin entity bind/unbind pattern.
+
+### 3. `xor_diff()` — SIMD XOR Change Detection
+
+**DONE.** `bitwise.rs`:
+- `hamming_avx2()` (line 62): 32 bytes/iter via vpshufb
+- `hamming_avx512bw()` (line 117): 64 bytes/iter via vpshufb-512
+- `hamming_avx512_vpopcnt()`: native VPOPCNTDQ when available
+- Runtime dispatch (line 234): `avx512vpopcntdq` → `avx512bw` → `avx2` → scalar
+- `hamming_query_batch()`: batch mode for tick N vs N+1 comparison
+
+**Only gap:** No sparse `nonzero_iter()` returning positions of changed elements.
+
+### 4. `gather_scatter()` — Vectorized Gather
+
+**MISSING.** `tekamolo.rs` and `cam_index.rs` use hash-based lookups (conceptually gather) but no `VPGATHERDD`/`VGATHERDPS` intrinsics anywhere.
+
+### 5. `columnar_view()` — Zero-Copy Arrow Interop
+
+**Exists:** `arrow_bridge.rs` has schema constants (`s_binary`, `p_binary`, `o_binary`, `node_id`), `GateState` lifecycle (Form→Flow→Freeze), `SoakingBuffer { data: Vec<i8>, n_entries, n_dims }`.
+
+**Gap:** Missing the one-liner: `unsafe { ArrayView1::from_shape_ptr(len, arrow_buf.as_ptr()) }`.
+
+### 6. `Zip::simd_apply()` — Multi-Array Fused SIMD Kernel
+
+**Exists:** `kernels.rs` K0→K1→K2 fused cascade (1589 lines). `packed.rs` stroke-aligned cascade query. Both fuse multiple passes into one traversal.
+
+**Gap:** Fusion is hardcoded for binary Hamming. Need generic version accepting `Fn(F32x16, F32x16) -> F32x16`.
+
+### 7. `runtime_dispatch()` — CPU Feature Detection
+
+**DONE.** Two complementary systems:
+1. `bitwise.rs`: 65+ `is_x86_feature_detected!` with 4-tier fallback
+2. `simd.rs` polyfill: compile-time dispatch via `#[cfg(target_arch)]` with scalar fallback types
+
+### 8. `stencil()` — 3D Neighbor-Aware SIMD
+
+**MISSING.** `bnn_causal_trajectory.rs`, `deepnsm.rs`, `clam_search.rs` have neighbor traversal patterns but nothing 3D-stencil-specific.
+
+### 9. `compact_palette()` — Bit-Packed SIMD
+
+**PARTIAL.** Three relevant modules:
+- `palette_distance.rs`: 256-entry `Palette` codebook with precomputed pairwise L1 distance matrix
+- `quantized.rs`: f32→u8 quantization with scale/zero-point
+- `bitwise.rs`: vpshufb nibble lookup (4-bit table, proven in SIMD)
+
+**Gap:** No variable-width (4-15 bit) pack/unpack for Minecraft block state encoding.
+
+### 10. `prefetch_region()` + `stream_store()`
+
+**PARTIAL.** `packed.rs` uses stroke-aligned layout for hardware prefetcher ("the prefetcher handles sequential access"). No explicit `_mm_prefetch` or `_mm_stream_ps`.
+
+---
+
+## What Changed Since Last Audit
+
+| New Module | Lines | Wishlist Impact |
+|---|---|---|
+| `src/simd.rs` | 829 | #1 #6 #7 — portable SIMD types with scalar fallback |
+| `src/simd_avx512.rs` | 1399 | #1 — F32x16/F64x8/U8x64 with FMA, sqrt, reduce |
+| `src/simd_avx2.rs` | 618 | #1 — f32x8/f64x4 dot product, GEMM tile sizes |
+| `hpc/holo.rs` | new | Phase + focus + carrier (94 tests) |
+| `hpc/zeck.rs` | new | Zeckendorf encoding + batch/top_k |
+| `hpc/palette_distance.rs` | new | #9 — 256-entry palette with O(1) distance |
+| `hpc/parallel_search.rs` | new | #2 — dual-path HHTL + CLAM search |
+| `hpc/layered_distance.rs` | new | O(1) distance via palette index + precomputed matrix |
+| `hpc/bgz17_bridge.rs` | new | Base17 bridge for palette interop |
+
+---
+
+*Audit generated 2026-03-23. AdaWorldAPI/ndarray master @ 11633d06.*
diff --git a/WISHLIST_PROMPT.md b/WISHLIST_PROMPT.md
new file mode 100644
index 00000000..ac044c08
--- /dev/null
+++ b/WISHLIST_PROMPT.md
@@ -0,0 +1,379 @@
+# Claude Code Session Prompt: ndarray SIMD Wishlist Implementation
+
+Copy everything below the line into a new Claude Code session.
+
+---
+
+You are working on `AdaWorldAPI/ndarray` — a fork with 57 HPC modules (`src/hpc/`, 52K+ lines) and a 2,846-line portable SIMD polyfill (`src/simd.rs` + `src/simd_avx512.rs` + `src/simd_avx2.rs`). Develop on branch `claude/compare-simd-implementations-btTgj`. Push with `git push -u origin claude/compare-simd-implementations-btTgj`.
+
+## What Already Exists (DO NOT REBUILD)
+
+- **SIMD types:** `F32x16` (AVX-512), `f32x8`/`f64x4` (AVX2), scalar fallback on non-x86. Full ops: `+`, `-`, `*`, `/`, `mul_add` (FMA), `sqrt`, `reduce_sum/min/max`, `simd_clamp`, bitwise ops. File: `src/simd_avx512.rs`.
+- **Portable polyfill:** `src/simd.rs` — `#[cfg(target_arch = "x86_64")]` re-exports real SIMD, `#[cfg(not(...))]` provides scalar `[f32; 16]` fallback with identical API. Drop-in `std::simd` replacement.
+- **AVX2 dot:** `src/simd_avx2.rs:52` — 4-accumulator unrolled `dot_f32` using `f32x8`.
+- **Runtime dispatch:** `src/hpc/bitwise.rs` — 65+ `is_x86_feature_detected!` calls. 4-tier: `avx512vpopcntdq` → `avx512bw` → `avx2` → scalar. Full Hamming/popcount/XOR kernels at each tier.
+- **XOR diff:** `src/hpc/bitwise.rs` — `BitwiseOps` trait: `hamming_distance()`, `popcount()`, `hamming_distance_batch()`, `hamming_query_batch()`, `hamming_top_k()`. All SIMD-dispatched.
+- **Fused cascade:** `src/hpc/kernels.rs` (1589 lines) — LIBXSMM-inspired K0 Probe → K1 Stats → K2 Exact pipeline with `SliceGate` integer thresholds.
+- **Packed layout:** `src/hpc/packed.rs` — `PackedDatabase` with stroke-aligned memory for hardware prefetcher.
+- **CAM index:** `src/hpc/cam_index.rs` — Multi-probe LSH for 49,152-bit `GraphHV` vectors.
+- **Arrow bridge:** `src/hpc/arrow_bridge.rs` — Schema constants, `GateState`, `SoakingBuffer`.
+- **Palette distance:** `src/hpc/palette_distance.rs` — 256-entry codebook with precomputed pairwise L1 matrix.
+- **Quantization:** `src/hpc/quantized.rs` — f32↔u8/bf16 with scale/zero-point.
+- **VML (scalar):** `src/hpc/vml.rs` — `vsexp`, `vssqrt`, `vsln`, `vsabs`, `vsadd`, `vsmul`, `vsdiv` — ALL scalar loops.
+
+## 10 Implementation Tasks
+
+### TIER 1: Wire existing SIMD types into existing scalar code
+
+**Task 1: SIMD VML — connect `vml.rs` to `simd.rs` types**
+
+`src/hpc/vml.rs` has 7 scalar functions. Rewrite each to use `F32x16`/`f32x8`:
+
+```rust
+use crate::simd::{f32x16, f32x8};
+
+pub fn vssqrt(x: &[f32], out: &mut [f32]) {
+    let n = x.len().min(out.len());
+    let mut i = 0;
+
+    // AVX-512: 16 elements per iteration
+    while i + 16 <= n {
+        let v = f32x16::from_slice(&x[i..]);
+        v.sqrt().copy_to_slice(&mut out[i..]);
+        i += 16;
+    }
+    // Scalar remainder
+    while i < n {
+        out[i] = x[i].sqrt();
+        i += 1;
+    }
+}
+```
+
+Do the same for `vsexp`/`vdexp` (polynomial approx in SIMD lanes — Cephes or minimax), `vsln`/`vdln`, `vsabs` (bitwise AND with sign mask via `F32x16` bitcast), `vsadd`/`vsmul`/`vsdiv` (trivial: load, op, store).
+
+For `exp` and `ln`: implement the polynomial approximation using `mul_add` chains:
+```rust
+// Fast exp(x) via range reduction + degree-5 polynomial
+// x = n*ln(2) + r, exp(x) = 2^n * exp(r), exp(r) ≈ polynomial
+fn vsexp_simd(x: &[f32], out: &mut [f32]) {
+    let ln2_inv = f32x16::splat(1.4426950408889634);
+    let ln2 = f32x16::splat(0.6931471805599453);
+    // ... range reduce, polynomial via mul_add chain, reconstruct
+}
+```
+
+**Task 2: `nonzero_iter()` for sparse XOR diff**
+
+In `src/hpc/bitwise.rs`, add to `BitwiseOps`:
+```rust
+fn xor_diff_positions(&self, other: &Self) -> Vec<(usize, u8, u8)>;
+```
+SIMD: XOR 64 bytes (AVX-512), test zero with `_mm512_test_epi64_mask`, skip zero chunks, collect nonzero byte positions. Scalar fallback: XOR byte-by-byte, skip zeros.
+
+**Task 3: Arrow zero-copy view bridge**
+
+In `src/hpc/arrow_bridge.rs`, add:
+```rust
+use crate::prelude::*;
+
+/// Zero-copy 1D view from raw pointer (Arrow Buffer → ndarray).
+/// SAFETY: ptr must be valid, aligned, and live for lifetime 'a.
+pub unsafe fn array_view_f32<'a>(ptr: *const f32, len: usize) -> ArrayView1<'a, f32> {
+    ArrayView1::from_shape_ptr(len, ptr)
+}
+
+/// Zero-copy 2D columnar view: n_rows × n_cols.
+pub unsafe fn columnar_view_f32<'a>(
+    ptr: *const f32, n_rows: usize, n_cols: usize,
+) -> ArrayView2<'a, f32> {
+    ArrayView2::from_shape_ptr((n_rows, n_cols).f(), ptr)  // F-order for column-major
+}
+```
+Add safe wrappers that take `&[f32]` for testing.
+
+**Task 4: `simd_apply` generic batch processor**
+
+Create `src/hpc/simd_apply.rs`:
+```rust
+use crate::simd::{f32x16, f32x8};
+
+/// Apply f(lane) to every 16-element chunk of a slice, scalar remainder.
+#[inline]
+pub fn map_f32x16<F: Fn(f32x16) -> f32x16>(x: &[f32], out: &mut [f32], f: F) {
+    let n = x.len().min(out.len());
+    let mut i = 0;
+    while i + 16 <= n {
+        let v = f32x16::from_slice(&x[i..]);
+        f(v).copy_to_slice(&mut out[i..]);
+        i += 16;
+    }
+    while i < n {
+        // Scalar: load 1, broadcast to lane, apply, extract
+        let v = f32x16::splat(x[i]);
+        out[i] = f(v).to_array()[0];
+        i += 1;
+    }
+}
+
+/// Two-input version (Zip pattern).
+#[inline]
+pub fn zip_f32x16<F: Fn(f32x16, f32x16) -> f32x16>(
+    a: &[f32], b: &[f32], out: &mut [f32], f: F
+) {
+    let n = a.len().min(b.len()).min(out.len());
+    let mut i = 0;
+    while i + 16 <= n {
+        let va = f32x16::from_slice(&a[i..]);
+        let vb = f32x16::from_slice(&b[i..]);
+        f(va, vb).copy_to_slice(&mut out[i..]);
+        i += 16;
+    }
+    while i < n {
+        let va = f32x16::splat(a[i]);
+        let vb = f32x16::splat(b[i]);
+        out[i] = f(va, vb).to_array()[0];
+        i += 1;
+    }
+}
+
+/// In-place version.
+#[inline]
+pub fn map_inplace_f32x16<F: Fn(f32x16) -> f32x16>(data: &mut [f32], f: F) {
+    let n = data.len();
+    let mut i = 0;
+    while i + 16 <= n {
+        let v = f32x16::from_slice(&data[i..]);
+        f(v).copy_to_slice(&mut data[i..]);
+        i += 16;
+    }
+    while i < n {
+        let v = f32x16::splat(data[i]);
+        data[i] = f(v).to_array()[0];
+        i += 1;
+    }
+}
+```
+Then rewrite `vml.rs` to use these (one-liner per function).
+
+### TIER 2: New structures using existing infrastructure
+
+**Task 5: `SpatialCam3D` — 3D spatial CAM**
+
+Create `src/hpc/spatial_cam.rs`:
+```rust
+use std::collections::HashMap;
+
+pub struct SpatialCam3D<T: Clone> {
+    cells: HashMap<(i32, i32, i32), Vec<usize>>,
+    data: Vec<Option<T>>,
+    positions: Vec<[f32; 3]>,  // SoA-friendly: could split into 3 Vec<f32>
+    free_list: Vec<usize>,
+    cell_size: f32,
+    inv_cell_size: f32,
+}
+
+impl<T: Clone> SpatialCam3D<T> {
+    pub fn new(cell_size: f32) -> Self;
+    pub fn bind(&mut self, pos: [f32; 3], data: T) -> usize;
+    pub fn unbind(&mut self, handle: usize);
+    pub fn get(&self, handle: usize) -> Option<&T>;
+    pub fn update_position(&mut self, handle: usize, new_pos: [f32; 3]);
+
+    /// Query all entities in axis-aligned box.
+    pub fn region(&self, min: [f32; 3], max: [f32; 3]) -> Vec<usize>;
+
+    /// Extract contiguous position slices for SIMD batch processing.
+    /// Returns (xs, ys, zs) for entities in the region.
+    pub fn region_positions_soa(&self, min: [f32; 3], max: [f32; 3])
+        -> (Vec<f32>, Vec<f32>, Vec<f32>);
+
+    /// Batch distance check: which entities are within radius of point?
+    /// Uses F32x16 for 16-entity-at-a-time distance computation.
+    pub fn within_radius(&self, center: [f32; 3], radius: f32) -> Vec<usize>;
+}
+```
+The `within_radius` method should use `F32x16` for distance computation:
+```rust
+// dx*dx + dy*dy + dz*dz < r*r, 16 entities at once
+let dx = f32x16::from_slice(&xs[i..]) - cx;
+let dy = f32x16::from_slice(&ys[i..]) - cy;
+let dz = f32x16::from_slice(&zs[i..]) - cz;
+let dist_sq = dx.mul_add(dx, dy.mul_add(dy, dz * dz));
+// Compare with radius_sq, collect matching indices
+```
+
+**Task 6: `PaletteCodec` — variable-width bit packing**
+
+Create `src/hpc/palette_codec.rs` (distinct from existing `palette_distance.rs`):
+```rust
+pub struct PaletteCodec {
+    bits_per_entry: u8,   // 1-15
+    entries_per_word: u8,  // 64 / bits_per_entry
+    mask: u64,
+}
+
+impl PaletteCodec {
+    pub fn new(bits_per_entry: u8) -> Self;
+
+    /// Unpack N entries from packed u64 words into u16 output.
+    pub fn unpack(&self, packed: &[u64], count: usize, out: &mut [u16]);
+
+    /// Pack u16 values into u64 words.
+    pub fn pack(&self, values: &[u16], out: &mut [u64]);
+
+    /// Single-element get (scalar, for random access).
+    pub fn get(&self, packed: &[u64], index: usize) -> u16;
+
+    /// Single-element set.
+    pub fn set(&mut self, packed: &mut [u64], index: usize, value: u16);
+}
+```
+For 4-bit palettes (most common in Minecraft), SIMD unpack path: load 64 bytes (U8x64), mask low/high nibbles, expand to u16. Use existing vpshufb pattern from `bitwise.rs`.
+
+**Task 7: AVX-512 gather wrapper**
+
+Create `src/hpc/gather.rs`:
+```rust
+/// Gather f32 values from table at given u32 indices.
+/// AVX-512: VPGATHERDD, AVX2: VGATHERDPS, scalar fallback.
+pub fn gather_f32(table: &[f32], indices: &[u32], out: &mut [f32]) {
+    let n = indices.len().min(out.len());
+    let mut i = 0;
+
+    #[cfg(target_arch = "x86_64")]
+    {
+        if is_x86_feature_detected!("avx512f") {
+            while i + 16 <= n {
+                unsafe { gather_f32_avx512(table, &indices[i..], &mut out[i..]) };
+                i += 16;
+            }
+        } else if is_x86_feature_detected!("avx2") {
+            while i + 8 <= n {
+                unsafe { gather_f32_avx2(table, &indices[i..], &mut out[i..]) };
+                i += 8;
+            }
+        }
+    }
+    // Scalar remainder
+    while i < n {
+        out[i] = table[indices[i] as usize];
+        i += 1;
+    }
+}
+
+#[cfg(target_arch = "x86_64")]
+#[target_feature(enable = "avx512f")]
+unsafe fn gather_f32_avx512(table: &[f32], indices: &[u32], out: &mut [f32]) {
+    use core::arch::x86_64::*;
+    let idx = _mm512_loadu_si512(indices.as_ptr() as *const _);
+    let gathered = _mm512_i32gather_ps::<4>(idx, table.as_ptr() as *const u8);
+    _mm512_storeu_ps(out.as_mut_ptr(), gathered);
+}
+
+#[cfg(target_arch = "x86_64")]
+#[target_feature(enable = "avx2")]
+unsafe fn gather_f32_avx2(table: &[f32], indices: &[u32], out: &mut [f32]) {
+    use core::arch::x86_64::*;
+    let idx = _mm256_loadu_si256(indices.as_ptr() as *const __m256i);
+    let gathered = _mm256_i32gather_ps::<4>(table.as_ptr(), idx);
+    _mm256_storeu_ps(out.as_mut_ptr(), gathered);
+}
+```
+This unblocks Perlin noise permutation table vectorization.
+
+### TIER 3: Architecture
+
+**Task 8: 3D Von Neumann stencil**
+
+Create `src/hpc/stencil.rs`:
+```rust
+use crate::Array3;
+
+/// Apply 3D Von Neumann stencil (6 face neighbors) to every interior cell.
+/// Boundary cells use `boundary_val` for out-of-bounds neighbors.
+pub fn stencil_vonneumann_3d<T, F>(
+    input: &Array3<T>,
+    output: &mut Array3<T>,
+    boundary_val: T,
+    f: F,
+) where
+    T: Copy,
+    F: Fn(T, [T; 6]) -> T,  // center, [+x, -x, +y, -y, +z, -z]
+{
+    let (nx, ny, nz) = input.dim();
+    for x in 0..nx {
+        for y in 0..ny {
+            for z in 0..nz {
+                let center = input[[x, y, z]];
+                let neighbors = [
+                    if x + 1 < nx { input[[x+1, y, z]] } else { boundary_val },
+                    if x > 0 { input[[x-1, y, z]] } else { boundary_val },
+                    if y + 1 < ny { input[[x, y+1, z]] } else { boundary_val },
+                    if y > 0 { input[[x, y-1, z]] } else { boundary_val },
+                    if z + 1 < nz { input[[x, y, z+1]] } else { boundary_val },
+                    if z > 0 { input[[x, y, z-1]] } else { boundary_val },
+                ];
+                output[[x, y, z]] = f(center, neighbors);
+            }
+        }
+    }
+}
+
+/// SIMD-optimized version for u8 (redstone signals): 64 cells per AVX-512 op.
+/// Processes interior XY planes along Z axis using U8x64.
+pub fn stencil_vonneumann_3d_u8_max_decay(
+    input: &Array3<u8>,
+    output: &mut Array3<u8>,
+) {
+    // For each cell: output = max(neighbors).saturating_sub(1)
+    // This is redstone signal propagation.
+    // SIMD: load Z-strips of 64 u8s, compute 6-neighbor max, subtract 1
+    // ... implementation using U8x64 from crate::simd ...
+}
+```
+
+**Task 9: SIMD FFT**
+
+Rewrite `src/hpc/fft.rs` butterfly to use `F32x16`:
+```rust
+// Current: scalar butterfly per element
+// Target: 16 butterflies per AVX-512 instruction
+// Radix-2 Cooley-Tukey with SIMD twiddle factor multiplication
+```
+The butterfly is: `(u + t, u - t)` where `t = w * x`. With `F32x16`, process 16 frequency bins simultaneously.
+
+**Task 10: Explicit prefetch for `SpatialCam3D`**
+
+In `spatial_cam.rs`, add:
+```rust
+/// Hint the CPU to prefetch the next region's data into L2.
+/// Call while processing current region for pipeline overlap.
+pub fn prefetch_region(&self, min: [f32; 3], max: [f32; 3]) {
+    let handles = self.region(min, max);
+    if handles.is_empty() { return; }
+    #[cfg(target_arch = "x86_64")]
+    unsafe {
+        use core::arch::x86_64::*;
+        let base = self.positions.as_ptr().add(handles[0]);
+        _mm_prefetch(base as *const i8, _MM_HINT_T1);  // L2
+        if let Some(&last) = handles.last() {
+            let end = self.positions.as_ptr().add(last);
+            _mm_prefetch(end as *const i8, _MM_HINT_T1);
+        }
+    }
+}
+```
+
+## Rules
+
+1. **Use `crate::simd::` types** — `f32x16`, `f32x8`, `u8x64` etc. Never raw intrinsics in new code (except gather which has no polyfill wrapper yet).
+2. **Follow `bitwise.rs` dispatch pattern** for any function needing runtime CPU detection.
+3. **Scalar fallback always works** — the polyfill in `simd.rs` guarantees this on non-x86.
+4. **Register every new module in `src/hpc/mod.rs`** with `#[allow(missing_docs)] pub mod name;`.
+5. **Tests must pass on all platforms.** Use `#[cfg(target_arch = "x86_64")]` + feature detection to skip hardware-specific assertions.
+6. **No new crate dependencies.** Use `core::arch::x86_64` for anything not in the polyfill.
+7. **Run `cargo test --lib` and `cargo clippy` after each task.**
+8. **Commit after each task.** Push to `claude/compare-simd-implementations-btTgj`.
+9. **Read existing code before modifying.** VML, bitwise, kernels, packed — understand the patterns.
+10. **Do not touch `src/simd.rs`, `src/simd_avx512.rs`, `src/simd_avx2.rs`** unless adding new type wrappers needed by your task.