Skip to content

Support for inline-beta filtered search with expressions#782

Open
gopalrs wants to merge 61 commits intomainfrom
sync-from-cdb-diskann
Open

Support for inline-beta filtered search with expressions#782
gopalrs wants to merge 61 commits intomainfrom
sync-from-cdb-diskann

Conversation

@gopalrs
Copy link
Copy Markdown
Contributor

@gopalrs gopalrs commented Feb 16, 2026

This PR has the following changes:

  • Add support for inline-beta search with filter expressions that support AND, OR expressions and equality comparisons.

  • Benchmark to evaluate perf and recall on small dataset and which also serves as an example on how to set things up to use filtered search with expressions.

- Refactored recall utilities in diskann-benchmark
- Updated tokio utilities
- Added attribute and format parser improvements in label-filter
- Updated ground_truth utilities in diskann-tools
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates label-filtered (“document”) insertion and inline beta filtered search into the DiskANN benchmark/tooling flow, enabling benchmarks that operate on { vector, attributes } documents and evaluate filtered queries.

Changes:

  • Added DocumentInsertStrategy and supporting public types to insert/query Document objects (vector + attributes) through DocumentProvider.
  • Extended inline beta filter search to handle predicate encoding failures and added a constructor for InlineBetaStrategy.
  • Added a new benchmark input/backend (document-index-build) plus example config for running document + filter benchmarks.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
test_data/disk_index_search/data.256.label.jsonl Updates LFS pointer for label test data used in filter benchmarks.
diskann-tools/src/utils/ground_truth.rs Adds array-aware label matching/expansion and extensive tracing diagnostics for filter ground-truth generation.
diskann-tools/Cargo.toml Adds serde_json dependency (and adjusts manifest metadata).
diskann-providers/src/model/graph/provider/async_/inmem/full_precision.rs Adds Vec<T> query support for full-precision in-mem provider (for inline beta usage).
diskann-label-filter/src/lib.rs Exposes the new document_insert_strategy module under encoded_attribute_provider.
diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs Adds InlineBetaStrategy::new and introduces is_valid_filter fast-path logic.
diskann-label-filter/src/inline_beta_search/encoded_document_accessor.rs Adjusts filter encoding to be optional and threads is_valid_filter into the query computer.
diskann-label-filter/src/encoded_attribute_provider/roaring_attribute_store.rs Makes RoaringAttributeStore public for cross-crate use.
diskann-label-filter/src/encoded_attribute_provider/encoded_filter_expr.rs Changes encoded filter representation to Option, allowing “invalid filter” fallback behavior.
diskann-label-filter/src/encoded_attribute_provider/document_provider.rs Allows vector types used in documents to be ?Sized.
diskann-label-filter/src/encoded_attribute_provider/document_insert_strategy.rs New strategy wrapper enabling insertion/search over Document values.
diskann-label-filter/src/encoded_attribute_provider/ast_label_id_mapper.rs Simplifies lookup error messaging and signature for attribute→id mapping.
diskann-label-filter/src/document.rs Makes Document generic over ?Sized vectors.
diskann-benchmark/src/utils/tokio.rs Adds a reusable multi-thread Tokio runtime builder.
diskann-benchmark/src/utils/recall.rs Re-exports knn recall helper for benchmark use.
diskann-benchmark/src/inputs/mod.rs Registers a new document_index input module.
diskann-benchmark/src/inputs/document_index.rs New benchmark input schema for document-index build + filtered search runs.
diskann-benchmark/src/backend/mod.rs Registers new document_index backend benchmarks.
diskann-benchmark/src/backend/index/result.rs Extends search result reporting with query count and wall-clock summary columns.
diskann-benchmark/src/backend/document_index/mod.rs New backend module entrypoint for document index benchmarks.
diskann-benchmark/src/backend/document_index/benchmark.rs New end-to-end benchmark: build via DocumentInsertStrategy + filtered search via InlineBetaStrategy.
diskann-benchmark/example/document-filter.json Adds example job configuration for document filter benchmark runs.
Cargo.lock Adds serde_json to the lockfile dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diskann-benchmark/src/backend/document_index/benchmark.rs Outdated
Comment thread diskann-tools/Cargo.toml
Comment thread diskann-tools/src/utils/ground_truth.rs Outdated
Comment thread diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs Outdated
Comment thread diskann-benchmark/src/utils/tokio.rs Outdated
Comment thread diskann-benchmark/src/backend/document_index/benchmark.rs Outdated
Comment thread diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs Outdated
Comment thread diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs Outdated
Comment thread diskann-label-filter/src/encoded_attribute_provider/encoded_filter_expr.rs Outdated
Comment thread diskann-benchmark/src/backend/document_index/benchmark.rs Outdated
Comment thread diskann-benchmark/src/backend/document_index/benchmark.rs Outdated
Comment thread diskann-benchmark/src/backend/document_index/benchmark.rs Outdated
Comment thread diskann-benchmark/src/backend/document_index/benchmark.rs Outdated
Comment thread diskann-benchmark/src/inputs/document_index.rs
Comment thread diskann-label-filter/src/inline_beta_search/encoded_document_accessor.rs Outdated
Comment thread diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs Outdated
Comment thread diskann-providers/src/model/graph/provider/async_/inmem/full_precision.rs Outdated
Comment thread diskann-providers/src/model/graph/provider/async_/inmem/full_precision.rs Outdated
Comment thread test_data/disk_index_search/data.256.label.jsonl Outdated
@sampathrg sampathrg requested a review from hildebrandmw March 23, 2026 13:44
Comment thread diskann-tools/src/utils/ground_truth.rs Outdated
Comment thread diskann-tools/src/utils/ground_truth.rs Outdated
Comment thread diskann-label-filter/src/query.rs Outdated
pub struct FilteredQuery<V> {
query: V,
pub struct FilteredQuery<'a, V: ?Sized> {
query: &'a V,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, now that we've made the API change, there's one thing we can do to clean up how this works a little. Instead of

pub struct FilteredQuery<V> {
    query: V,
    filter_expr: ASTExpr,
}

impl<V> FilteredQuery<V> {
    fn query<'a>(&'a self) -> V::Target
    where
        V: Reborrow<'a>,
    {
        self.query.reborrow()
    }
}

And instead of requiring &V for the inner trait bounds, we use <V as Reborrow<'a>>::Target (or just V::Target when the associated lifetime is unambiguous.

This does a couple things. First, it lets FilteredQuery have an owned query if needed and gets rid of the repeated lifetime bound.

Second, it will compose slightly better with providers that use non-slice types (e.g. multi-vectors).

Comment thread diskann-benchmark/src/backend/document_index/benchmark.rs
let filtered_query = FilteredQuery::new(query_vec, ast_expr.clone());

// Use a concrete IdDistance scratch buffer so that both the IDs and distances
// are captured. Afterwards, the valid IDs are forwarded into the framework buffer.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps diskann-benchmark-core should be updated to capture distances as well. I think this can be done in a non-breaking way (not a blocker for this PR).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I can do it in a separate PR.

let query_vec = self.queries.row(index);
let (_, ref ast_expr) = self.predicates[index];
let strategy = InlineBetaStrategy::new(self.beta, common::FullPrecision);
let filtered_query = FilteredQuery::new(query_vec, ast_expr.clone());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One theme I've been observing throughout diskann-label-filter is the design kind of inherently forces patterns like cloning the ast_expr for the query.

I'm not reviewing the benchmark code in too much detail, but I strongly encourage looking for patterns like forced clones in loops as opportunities for making the underlying implementation better.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the benchmark code. This clone right here seems like the only one that would cause performance issues. This would mean holding a reference to the ast_expr instead of owning it.

@sampathrg sampathrg requested a review from hildebrandmw April 8, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants