Skip to content

Cache-Aware Block-Transposed Chamfer/MaxSim Distance for f32 and f16#863

Open
suri-kumkaran wants to merge 18 commits intomainfrom
users/suryangupta/multi-vector-distance-impl
Open

Cache-Aware Block-Transposed Chamfer/MaxSim Distance for f32 and f16#863
suri-kumkaran wants to merge 18 commits intomainfrom
users/suryangupta/multi-vector-distance-impl

Conversation

@suri-kumkaran
Copy link
Copy Markdown
Contributor

@suri-kumkaran suri-kumkaran commented Mar 25, 2026

What

SIMD-accelerated MaxSim / Chamfer distance for f32 and f16 multi-vector queries, using block-transposed layout with L2/L1 cache-aware tiling. Introduces QueryComputer<T> — a runtime-dispatched type that hides GROUP and architecture behind a vtable.

The focus is proving the Kernel / tiled_reduce / ConvertTo abstraction is solid and type-agnostic. f32 and f16 share the same tiling loop and micro-kernel body.

Why

The fallback kernel iterates query×doc in a flat nested loop, causing repeated cache evictions. Block-transposing the query and tiling both sides to fit in L2/L1 keeps hot data resident and feeds the FMA pipeline efficiently. f16 comes for free: ConvertTo converts f16→f32 once per tile, then the f32 micro-kernel runs unchanged.

Changed Files

All paths relative to diskann-quantization/src/multi_vector/.

Newdistance/kernels/

File Purpose
mod.rs Kernel<A> unsafe trait (Left/Right layouts, full_panel/partial_panel), cache budget helpers.
layouts.rs Layout marker trait. BlockTransposed/RowMajor ZST markers. DescribeLayout bridge. ConvertTo<A, To> with blanket identity and f16→f32 specializations.
tiled_reduce.rs 5-level tiling loop (K: Kernel, LA/LB: ConvertTo). FullReduce tile planner. Reduce unroll trait.
f32/mod.rs F32Kernel<GROUP>, max_ip_kernel, Target3 dispatch, tests.
f32/v3.rs V3 (AVX2+FMA) 16×4 micro-kernel. V4 delegates via retarget().
f32/scalar.rs Scalar 8×2 micro-kernel (mul+add, no libm fma()). Neon delegates via retarget().
f16.rs F16Entry<GROUP>, Target3 dispatch, tests — drives tiled_reduce with F32Kernel + f16→f32 ConvertTo. No Kernel impl.

Newdistance/query_computer/

File Purpose
mod.rs QueryComputer<T> (Box<dyn DynQueryComputer<T>>). chamfer/max_sim methods. Chamfer/MaxSim trait impls. Tests.
f32.rs QueryComputer<f32>::new via dispatch1_no_features. BuildComputer Target1 impls.
f16.rs QueryComputer<half::f16>::new — same pattern, delegates through F16Entry.

Modified

File Change
block_transposed.rs Added padded_nrows().
matrix.rs Added as_matrix_view().
multi_vector/mod.rs Re-exports QueryComputer, Chamfer, MaxSim, MaxSimError, QueryMatRef.
distance/mod.rs Module wiring, QueryComputer re-export, doc example.
distance/max_sim.rs MaxSim/Chamfer types, MaxSimError enum.
distance/fallback.rs FallbackKernel (was SimpleKernel), QueryMatRef, fallback trait impls.

Renamed: simple.rsfallback.rs

Design Decisions

Kernel trait

Kernel<A> declares Left/Right layout types and full_panel/partial_panel. The kernel receives already-converted pointers — it knows nothing about storage formats. V4→V3 and Neon→Scalar delegate via retarget(). GROUP const generic (16 for V3/V4, 8 for Scalar/Neon) acts as a closed-world filter.

Layout markers and ConvertTo

BlockTransposed<T, GROUP, PACK> and RowMajor<T> are ZST markers. Layout impl requires T: Copy (micro-kernels load via raw pointers); Copy/Clone on the markers themselves are unconditional (PhantomData wrappers). DescribeLayout bridges matrix types to markers for type inference.

ConvertTo<A, To>: blanket identity (Buffer = (), zero cost) + f16→f32 specializations (Vec<f32> buffer, SIMD-accelerated SliceCast). Conversion is once per tile, not per panel. SliceCast dispatches through the runtime architecture token via arch.run2() — the same SIMD level used by the micro-kernel.

Tiling loop (reducing-GEMM)

5-level loop: L2 A-tiles → L1 B-tiles → A-panels → B-panels → micro-kernel. ConvertTo::convert runs at tile boundaries. Cache budgets from kernel layout element sizes (~625 KB L2, ~36 KB L1). Geometry: 16×4 (V3/V4) or 8×2 (Scalar/Neon). Source strides and kernel strides differ when conversion is active (f16→f32).

f16 path

F16Entry<GROUP> is a dispatch adapter, not a Kernel impl. Calls tiled_reduce with F32Kernel and f16→f32 ConvertTo impls. Zero SIMD code. Extends naturally to PQ/SQ/MinMax via new ConvertTo impls.

QueryComputer

QueryComputer<T> wraps Box<dyn DynQueryComputer<T>>. CPU detection once at construction via dispatch1_no_features; hot path uses Architecture::run3 with #[target_feature] — no re-dispatch. Turbofish: QueryComputer::<f32>::new(q). Per-type BuildComputer dispatch in f32.rs/f16.rs; mod.rs is generic.

Suggested Review Order

  1. distance/kernels/mod.rsKernel<A> trait.
  2. distance/kernels/layouts.rs — markers, ConvertTo, blanket identity, f16→f32.
  3. distance/kernels/tiled_reduce.rs — tiling loop, source vs kernel strides.
  4. distance/kernels/f32/v3.rsf32/scalar.rs — micro-kernels.
  5. distance/kernels/f32/mod.rsF32Kernel, max_ip_kernel, dispatch.
  6. distance/kernels/f16.rsF16Entry, no Kernel impl, no submodules.
  7. distance/query_computer/mod.rsQueryComputer<T>, tests.
  8. distance/query_computer/f32.rsf16.rs — per-type dispatch.
  9. distance/max_sim.rsdistance/fallback.rs — types, fallback kernel.
  10. block_transposed.rs, matrix.rs, distance/mod.rs, multi_vector/mod.rs — supporting.

Future Work

  1. Generalize the accumulator/reduction strategy. The current design hard-codes f32 scratch and max-reduce, which fits Chamfer/MaxSim but may over-fit. Brute-force search would need arg-max (tracking indices, not just values), and u8/i8 kernels would naturally accumulate into u32/i32 rather than f32.
  2. Dedicated Neon micro-kernels (replace Scalar delegation).
  3. Dedicated V4/AVX-512 micro-kernels (replace V3 delegation).
  4. Kernel + ConvertTo for PQ, SQ, MinMax.
  5. Cache size detection — Figure out L1/L2 sizes (env vars, platform detection, etc.) instead of hardcoded budgets.
  6. Row-major × row-major Chamfer via ConvertTo transpose.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 94.37280% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.47%. Comparing base (cfb5927) to head (e06c48a).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
...on/src/multi_vector/distance/query_computer/f16.rs 66.66% 11 Missing ⚠️
...on/src/multi_vector/distance/query_computer/f32.rs 66.66% 11 Missing ⚠️
...on/src/multi_vector/distance/query_computer/mod.rs 92.42% 10 Missing ⚠️
...ation/src/multi_vector/distance/kernels/f32/mod.rs 87.50% 6 Missing ⚠️
...ation/src/multi_vector/distance/kernels/layouts.rs 90.16% 6 Missing ⚠️
...-quantization/src/multi_vector/block_transposed.rs 91.66% 1 Missing ⚠️
...on/src/multi_vector/distance/kernels/f32/scalar.rs 97.29% 1 Missing ⚠️
...zation/src/multi_vector/distance/kernels/f32/v3.rs 97.82% 1 Missing ⚠️
.../src/multi_vector/distance/kernels/tiled_reduce.rs 99.70% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #863      +/-   ##
==========================================
+ Coverage   89.31%   89.47%   +0.16%     
==========================================
  Files         447      460      +13     
  Lines       83250    84627    +1377     
==========================================
+ Hits        74354    75723    +1369     
- Misses       8896     8904       +8     
Flag Coverage Δ
miri 89.47% <94.37%> (+0.16%) ⬆️
unittests 89.32% <94.37%> (+0.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...quantization/src/multi_vector/distance/fallback.rs 98.43% <100.00%> (ø)
...ntization/src/multi_vector/distance/kernels/f16.rs 100.00% <100.00%> (ø)
...ntization/src/multi_vector/distance/kernels/mod.rs 100.00% <100.00%> (ø)
...zation/src/multi_vector/distance/kernels/reduce.rs 100.00% <100.00%> (ø)
...-quantization/src/multi_vector/distance/max_sim.rs 100.00% <ø> (ø)
diskann-quantization/src/multi_vector/matrix.rs 95.59% <100.00%> (+0.33%) ⬆️
...-quantization/src/multi_vector/block_transposed.rs 96.88% <91.66%> (-0.07%) ⬇️
...on/src/multi_vector/distance/kernels/f32/scalar.rs 97.29% <97.29%> (ø)
...zation/src/multi_vector/distance/kernels/f32/v3.rs 97.82% <97.82%> (ø)
.../src/multi_vector/distance/kernels/tiled_reduce.rs 99.70% <99.70%> (ø)
... and 5 more

... and 28 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread diskann-quantization/src/multi_vector/distance/cache_aware/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/kernels/tiled_reduce.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/cache_aware/kernel.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/cache_aware/f32_kernel.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/cache_aware/f32_kernel.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/cache_aware/f32_kernel.rs Outdated
Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on the right track, but there are some claims made that I don't think are fully backed up yet.

First, this is not exactly type agnostic. While an implementation does exist for f32, one does not exist for another data type as proof of generality. It would be nice to see at least an f16 implementation, which requires functionality not present in this PR (i.e., lazily unpacking f16 panels into f32 panels before entering micropanel loops to hoist the conversion out of the micro-kernel).

It would also be nice to see this used to implement row-major x row-major kernels again with packing on each tile load. The packing algorithms need not be super optimized (SIMD shuffles can be added later), but it would be great to see that this is possible within the kernel abstraction.

Second, we really should make kernel implementations micro-architecture aware from the get-go. Passing around diskann_wide::arch::Current isn't super helpful as that type is always available. Rather, we should parameterize something (perhaps at the trait level) on the Architecture to enable a clean extension point for AVX-512, Neon, etc..

Finally, since the original experimentation for this contained implementations for 8-bit integers, showing that this abstraction layer works there too would make a strong case for the abstraction.

Comment thread diskann-quantization/src/multi_vector/distance/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/cache_aware/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/block_transposed.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/cache_aware/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/cache_aware/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/cache_aware/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/cache_aware/kernel.rs Outdated
@suri-kumkaran
Copy link
Copy Markdown
Contributor Author

suri-kumkaran commented Apr 1, 2026

This is on the right track, but there are some claims made that I don't think are fully backed up yet.

First, this is not exactly type agnostic. While an implementation does exist for f32, one does not exist for another data type as proof of generality. It would be nice to see at least an f16 implementation, which requires functionality not present in this PR (i.e., lazily unpacking f16 panels into f32 panels before entering micropanel loops to hoist the conversion out of the micro-kernel).

It would also be nice to see this used to implement row-major x row-major kernels again with packing on each tile load. The packing algorithms need not be super optimized (SIMD shuffles can be added later), but it would be great to see that this is possible within the kernel abstraction.

Second, we really should make kernel implementations micro-architecture aware from the get-go. Passing around diskann_wide::arch::Current isn't super helpful as that type is always available. Rather, we should parameterize something (perhaps at the trait level) on the Architecture to enable a clean extension point for AVX-512, Neon, etc..

Finally, since the original experimentation for this contained implementations for 8-bit integers, showing that this abstraction layer works there too would make a strong case for the abstraction.

Thanks for the through and insightful feedback — addressed in the latest push.

f16 implementation: Done. F16Kernel lazily unpacks f16→f32 in prepare_a/prepare_b before the micro-panel loops. The micro-kernel body is unchanged f32 SIMD — zero additional micro-kernel code. This required adding APrepared/BPrepared associated types and self-owned staging buffers to the Kernel trait. The f32 kernel stays zero-sized with identity pass-throughs.

Architecture parameterization: Kernel<A: Architecture> is parameterized at the trait level with concrete impls for V3, V4, Scalar, and Neon. Dedicated arch-specific micro-kernels are future work.

Row-major × row-major: Not in this PR, but the prepare_a hook can transpose a row-major panel on the fly using the same staging buffer mechanism the f16 kernel already uses. Future work.

8-bit integers: Same story — an i8 kernel would dequantize in prepare_a/prepare_b and delegate to existing micro-kernels. The prepare hooks are designed for this. Future work alongside PQ/SQ/MinMax.

The goal of this PR is proving the abstraction with two concrete types (f32 identity + f16 lazy unpacking) sharing one tiling loop and micro-kernel. Remaining implementations are mechanical from here.

@suri-kumkaran suri-kumkaran changed the title Cache-Aware Block-Transposed Chamfer/MaxSim Distance for f32 Cache-Aware Block-Transposed Chamfer/MaxSim Distance for f32 and f16 Apr 1, 2026
@suri-kumkaran suri-kumkaran marked this pull request as ready for review April 1, 2026 21:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new SIMD-accelerated, cache-tiled kernel framework for multi-vector MaxSim/Chamfer distance, targeting block-transposed query layouts and supporting both f32 and f16 (via staged f16→f32 preparation).

Changes:

  • Added a new distance::kernels module with an unsafe Kernel<A> abstraction and a shared tiled_reduce 5-level cache tiling loop.
  • Implemented f32 (AVX2+FMA + scalar/Neon delegation) and f16 (SIMD/scalar conversion + delegation to f32 microkernels) kernel families with correctness tests vs the fallback.
  • Extended matrix and block-transposed utilities (MatRef::as_matrix_view, BlockTransposedRef::available_rows) and renamed the previous “simple” implementation to fallback.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
diskann-quantization/src/multi_vector/matrix.rs Adds MatRef<Standard<T>>::as_slice() and as_matrix_view() plus a roundtrip test.
diskann-quantization/src/multi_vector/distance/mod.rs Wires in fallback and new kernels module; updates re-exports and docs.
diskann-quantization/src/multi_vector/distance/fallback.rs Renames Simple→Fallback kernel and disambiguates a test conversion.
diskann-quantization/src/multi_vector/block_transposed.rs Adds available_rows() and validates it in tests.
diskann-quantization/src/multi_vector/distance/kernels/mod.rs Introduces kernel framework module and cache-budget helpers.
diskann-quantization/src/multi_vector/distance/kernels/tiled_reduce.rs Adds generic tiling loop + reduction helper trait + planner tests.
diskann-quantization/src/multi_vector/distance/kernels/f32/mod.rs Adds f32 kernel entrypoint and MaxSim/Chamfer impls + tests.
diskann-quantization/src/multi_vector/distance/kernels/f32/v3.rs Adds AVX2+FMA 16×4 microkernel and V4→V3 delegation.
diskann-quantization/src/multi_vector/distance/kernels/f32/scalar.rs Adds scalar 8×2 microkernel and Neon→Scalar delegation.
diskann-quantization/src/multi_vector/distance/kernels/f16/mod.rs Adds f16 kernel entrypoint and MaxSim/Chamfer impls + tests.
diskann-quantization/src/multi_vector/distance/kernels/f16/v3.rs Adds SIMD f16→f32 prepare hooks and delegates to f32 V3 microkernel.
diskann-quantization/src/multi_vector/distance/kernels/f16/scalar.rs Adds scalar f16→f32 prepare hooks and delegates to scalar f32 microkernel.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diskann-quantization/src/multi_vector/distance/kernels/tiled_reduce.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/kernels/tiled_reduce.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/kernels/f32/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/kernels/f32/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/kernels/f16/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/kernels/f16/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/mod.rs Outdated
Copy link
Copy Markdown
Contributor

@arrayka arrayka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description suggests that the existing simple kernel is slower than the proposed solution. Could you please support this claim with easy‑to‑reproduce benchmark results?

That would allow others to replicate the numbers to verify the performance claims.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diskann-quantization/src/multi_vector/distance/kernels/tiled_reduce.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/kernels/tiled_reduce.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/query_computer/mod.rs Outdated
@suri-kumkaran suri-kumkaran self-assigned this Apr 13, 2026
Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Suryansh - this is getting close. The big thing I noticed is that the data preparation step is happening in the wrong place. It should be done at the tile level, not the panel level, to maximize the reuse of the preparation step.

Comment thread diskann-quantization/src/multi_vector/matrix.rs Outdated
Comment thread diskann-quantization/src/multi_vector/block_transposed.rs
Comment thread diskann-quantization/src/multi_vector/distance/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/kernels/f16/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/kernels/f16/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/kernels/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/kernels/mod.rs Outdated
a: *const Self::APrepared,
b: *const Self::BPrepared,
k: usize,
r: *mut f32,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this type of accumulator, are we over-fitting for Chamfer/MaxSim? What if we wanted to implement arg max for brute-force search? Also, in the case of u8/i8, we wouldn't want f32 as the result. We'd want u32/i32, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we’ll address this in a follow-up. I’ll spend some time thinking it through, outline an approach, and add the proposed direction to the PR description/comments.

Comment thread diskann-quantization/src/multi_vector/distance/kernels/tiled_reduce.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/kernels/tiled_reduce.rs Outdated
Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Suryansh - this is getting there. In addition to the inline nits, I have some larger overall concerns:

Documentation: While documentation is good, verbose documentation that is redundant (e.g. repeats what the type signature already says) or what can be retrieved from rustdoc, or the function name, or a quick glance as the code is more harmful than helpful. Documentation of intent, invariants, surprised, etc. is great! But please look through many of the comments/docs and prune out the ones that are low signal.

Testing: Testing a lot of edge cases for matrix sizes is great. However. The test cases exercised here do not actually hit the main body of the loop nest. They all fall into the peeled section. This means that as-is, one of the main loops in this PR is not being tested in any way. Please fix this. Unfortunately, matrices of a realistic dimension relative to the L1 and L2 cache sizes are needed to exercise these paths, which is slow and means Miri tests will be especially lethargic. The best way I see to remedy that is to make the cache sizes configurable or overridable (which we will in any case need). But this really needs coverage.

Comment thread diskann-quantization/src/multi_vector/matrix.rs
Comment thread diskann-quantization/src/multi_vector/distance/query_computer/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/query_computer/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/query_computer/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/kernels/mod.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/kernels/mod.rs Outdated
///
/// The blanket identity impl covers every layout converting to itself
/// with `Buffer = ()` and zero cost. Explicit impls handle f16→f32 via
/// [`SliceCast`].
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terminology around row * k and being contiguous makes me worry a little bit about future scenarios for strided access. What are the plans for systematically and safely updating the code when that lands?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let’s address it in a follow-up once the design for strided access is clearer. I’m still working out the right level at which to handle K-splitting and perform the conversion.

Comment thread diskann-quantization/src/multi_vector/distance/kernels/f32/v3.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/kernels/f16/mod.rs Outdated
Copy link
Copy Markdown
Contributor

@arkrishn94 arkrishn94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Suryansh, I think this is almost ready to merge! Left some small comments but other than that, the things that I wanted to highlight -

  1. Testing. From what I can tell, the tiled_reduce loop is not tested across architectures. If that's the case, we should definitely add that.
  2. On specializing the output type for the scratch to support other reductions like argmax and inputs like i8/u8 - am I misunderstand that this should be easy by augmenting Kernel with an associated return type and wiring that through?
  3. This is probably cause I don't understand all the details very well but I still don't see how supporting metadata for quantized vectors will be wired through in terms of where it will live and how it'll be accessed in the main micro-kernel for post-op processing.

Comment thread diskann-quantization/src/multi_vector/distance/kernels/mod.rs Outdated
///
/// * `a` must point to `A_PANEL * k` contiguous `APrepared` values.
/// * `b` must point to `B_PANEL * k` contiguous `BPrepared` values.
/// * `r` must point to at least `A_PANEL` writable `f32` values.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And be valid for the lifetime of this function execution only?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lifetime-of-call validity is the implicit raw-pointer convention (stdlib's pointer APIs don't spell it out either) - the kernel doesn't store any of the pointers across the call. Left the contracts in their concise form to keep them scannable; happy to add it explicitly if you'd rather have the trait be paranoid-self-contained.

Comment thread diskann-quantization/src/multi_vector/distance/kernels/mod.rs
/// # Safety
///
/// * `src` must point to `rows * k` valid elements in `Self`'s layout.
/// * `buf` must come from [`new_buffer`](Self::new_buffer) with
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing buf has to be created with the same k as used in convert?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - added to the safety contract: buf must come from new_buffer with the same k (and max_tile_rows >= rows). The f16→f32 impls allocate max_tile_rows * k and convert writes rows * k via &mut buf[..count], so a smaller k would short-write or panic on the slice bound. The blanket identity impl ignores both, so this is purely a contract for non-identity converters.

Comment thread diskann-quantization/src/multi_vector/distance/kernels/f32/v3.rs
//! | [`BlockTransposedRef`] | Immutable view of a block-transposed matrix |
//! | [`BlockTransposedMut`] | Mutable view of a block-transposed matrix |
//! | [`QueryMatRef`] | Query wrapper for asymmetric distances |
//! | [`QueryComputer`] | Architecture-dispatched SIMD query computer |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we separate the matrix types from the computer type in the documentation? Might be wroth adding separate documentation for it here since it's a core type in the new distance computation path?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - I'd lean toward keeping the table flat. It's meant as a fast inventory of what multi_vector re-exports, and MaxSim/Chamfer are equally first-class on the new distance path; pulling QueryComputer out into its own section while leaving them in the table would be inconsistent. The detailed docs already live on the type itself in query_computer/mod.rs (dispatch model, build cost, usage). Happy to expand that type-level doc if you feel anything's missing — just don't think the module-level overview is the right place for it. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add tiled_reduce function with Kernel and ConvertTo abstraction (f16 and f32 impls with hardcoded reduce)

8 participants