Skip to content

feat: hhtl_cascade_search score_fn callback — LEAF level pluggable score_fn(row, col) -> f32 replaces the hardcoded 1.0 placeholder. lance-graph passes LanceDB VectorSearch as the LEAF backend. ndarray provides the cascade logic. lance-graph provides the data. 23 p64 tests passing. https://claude.ai/code/session_01M3at4EuHVvQ8S95mSnKgtK#73

Merged
AdaWorldAPI merged 7 commits into
masterfrom
claude/qwen-claude-reverse-eng-vHuHv
Mar 31, 2026

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

No description provided.

claude added 7 commits March 31, 2026 09:35
Base17::l1(): 16 of 17 dims via I32x16 (sub, abs, reduce_sum), 17th scalar.
Base17::l1_weighted(): same + I32x16 multiply for PCDVQ weights [20,3,3,3,3,3,1,1,1,1,1,1,1,1,1,1].
Non-x86 fallback preserved (scalar loop).

Before: 611M lookups/sec, 1.8 ns/lookup, 17K tokens/sec
After:  719M lookups/sec, 1.4 ns/lookup, 22K tokens/sec

https://claude.ai/code/session_01M3at4EuHVvQ8S95mSnKgtK
All Base17 hot-path ops now SIMD:
  l1()             → I32x16 sub + abs + reduce_sum
  l1_weighted()    → I32x16 sub + abs + mul + reduce_sum
  sign_agreement() → I32x16 xor + simd_min + count
  xor_bind()       → I32x16 xor

.cargo/config.toml: target-cpu=native (not x86-64-v4)
  → GitHub CI gets AVX2/SSE4.2 fallback automatically
  → Local dev gets AVX-512 if available
  → cfg(target_feature = "avx512f") handles compile-time dispatch

728M lookups/sec, 22K tokens/sec. 19 tests passing.

https://claude.ai/code/session_01M3at4EuHVvQ8S95mSnKgtK
Per-function #[target_feature(enable = "avx512f")] / "avx2".
LazyLock runtime detection, one binary for all ISAs.

l1_avx512: _mm512_cvtepi16_epi32 + _mm512_sub + _mm512_abs + reduce_add
l1_avx2:   _mm256_cvtepi16_epi32 + _mm256_sub + _mm256_abs + horizontal sum
l1_scalar: for i in 0..17 (non-x86 fallback)

605M lookups/sec (LazyLock) vs 728M (hardcoded AVX-512).
19 tests passing. .cargo/config.toml: no global target-cpu.

https://claude.ai/code/session_01M3at4EuHVvQ8S95mSnKgtK
…azyLock

4 functions converted to multi-versioned kernels:
  l1_weighted:     I32x16 mul(abs_diff, weights) + reduce_sum
  sign_agreement:  I32x16 xor + cmpge_mask + count_ones
  xor_bind:        I32x16 xor + cvtepi32_epi16 pack-back
  inject_noise:    I32x16 add(dims, prng_noise) + clamp

Pattern: #[target_feature(enable = "avx512f")] per-function,
LazyLock runtime detection, one binary serves all ISAs.

No global target-cpu in .cargo/config.toml.
CI (AVX2) and Production (AVX-512) use same binary.
629M lookups/sec, 19K tokens/sec, 19 tests passing.

https://claude.ai/code/session_01M3at4EuHVvQ8S95mSnKgtK
p64 multi-versioned kernels (AVX-512/AVX2/scalar via LazyLock):
  attend():    8 rows/iter via _mm512_and_si512 + scalar popcnt
  nearest_k(): 8 XORs/iter via _mm512_xor_si512
  moe_gate():  all 8 planes in one zmm register

palette_distance nearest():
  4-way unrolled loop, inner l1() already SIMD-dispatched

All scalar loops from the audit now have SIMD versions:
  bgz17_bridge: l1, l1_weighted, sign_agreement, xor_bind, inject_noise
  palette_distance: nearest (4-way unroll)
  p64: attend, nearest_k, moe_gate

78 tests passing. 695M lookups/sec. 21K tokens/sec.
One universal binary — LazyLock runtime detects AVX-512/AVX2.

https://claude.ai/code/session_01M3at4EuHVvQ8S95mSnKgtK
score_fn(row, col) -> f32 replaces the hardcoded 1.0 placeholder.
lance-graph passes LanceDB VectorSearch as the LEAF backend.
ndarray provides the cascade logic. lance-graph provides the data.

23 p64 tests passing.

https://claude.ai/code/session_01M3at4EuHVvQ8S95mSnKgtK
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@AdaWorldAPI AdaWorldAPI merged commit 9462291 into master Mar 31, 2026
4 of 10 checks passed
AdaWorldAPI pushed a commit that referenced this pull request Apr 19, 2026
…ed PRs

Bookkeeping ledger pairing each prompt brief in .claude/prompts/ with its
matching PR (by filename keyword). 4 mapped to merged PRs #72/#73/#74/#105;
21 marked `none` where no keyword match existed.
AdaWorldAPI pushed a commit that referenced this pull request May 16, 2026
…only)

`tests/par_azip.rs` has `use itertools::{assert_equal, cloned, enumerate}`
under `#[cfg(feature = "approx")]`, but `test_par_azip9` called
`assert_equal(cloned(&a), x)` at line 85 unconditionally. With approx OFF
the import is excluded and compile fails with E0425.

Latent for as long as the file existed (since PR #73 era); never surfaced
because no CI matrix combination ran `--features rayon` without approx.
W-I4's new `hpc-stream-parallel` job exercises exactly that combination
and tripped the failure.

Fix: replace `assert_equal(cloned(&a), x)` with `assert_eq!(a, x)` —
both `a` and `x` are `Array<i32, _>` and the file's other tests already
use direct `assert_eq!`. Trim the now-dead `assert_equal, cloned` from
the still-needed (test_par_azip3) `enumerate` import.

Verified clean compile + 6/6 tests pass under both:
  cargo test --features rayon --test par_azip
  cargo test --features "rayon approx" --test par_azip

https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants