Skip to content

feat: external IVF-PQ vector index over parquet (Phase 1) — RFC draft#1

Draft
sezruby wants to merge 23 commits into
mainfrom
external-index-rfc-draft
Draft

feat: external IVF-PQ vector index over parquet (Phase 1) — RFC draft#1
sezruby wants to merge 23 commits into
mainfrom
external-index-rfc-draft

Conversation

@sezruby

@sezruby sezruby commented May 22, 2026

Copy link
Copy Markdown
Owner

Draft PR for tracking — see sezruby/lance-spark#4 for the full RFC + design + benchmark numbers.

Summary

Phase 1 of the External Vector Index RFC. Lets Lance build and query an IVF-PQ vector index directly over a list of caller-supplied parquet files, with no Lance dataset required for either build or query — including full refinement via Lance-owned parquet random access.

The user-facing primitive is "a Lance index file over a list of parquet files" — a self-contained artifact that doesn't depend on a Lance dataset and doesn't push parquet random-access code into the engine layer. Unlocks Lance vector search for Spark over parquet/Delta, DuckDB over parquet, in-memory engines, embedded use cases.

What's in this PR

Rust surface

rust/lance/src/index/vector/external/:

  • mod.rsExternalIvfPqIndex handle: build / open / search / fetch_rows
  • types.rsParquetFileSpec, SearchResult, ParquetRowKey, RowFilter trait
  • params.rsExternalIvfPqIndexParams builder
  • parquet_source.rsParquetVectorSource (training sample + rid-annotated batch stream); also handles List<Float32> from JVM parquet writers (Spark's parquet writer doesn't preserve FixedSizeList even when all rows have the same length)
  • manifest.rs — sidecar manifest.json with parquet file list + build params
  • build.rs — orchestrates KMeans → PQ codebook → IvfTransformer → shuffle → write
  • open.rs — reads manifest + index file, decodes IVF + PQ from existing protobuf
  • search.rs — IVF probe + per-file parquet refinement via PageIndexPolicy::Required; pluggable RowFilter
  • fetch.rs — post-topK random row fetch with arbitrary projection (the killer feature for engine integration)

rust/lance/src/index/vector/ivf.rs:

  • New pub async fn write_ivf_pq_file_external(&ObjectStore, &Path, ..., dataset_version: u64, ivf, pq, streams) — parallel to the existing write_ivf_pq_file_from_existing_index but with the &Dataset parameter dropped. Existing dataset-backed path becomes a thin wrapper.

rust/lance/Cargo.toml:

  • parquet promoted from dev-dep to runtime dep.

JNI + Java surface

java/lance-jni/src/external_index.rs:

  • nativeBuild, nativeOpen, nativeClose, nativeSearch, nativeFetchRows, plus accessors for numPartitions, numFiles, vectorColumn.
  • RowFilter exposed as a packed LE byte[] of (file_id << 32) | row_index deleted rids — sidesteps cross-language callbacks while still supporting Delta DV / Iceberg position deletes.
  • fetchRows returns Arrow IPC stream bytes (decoded JVM-side via ArrowStreamReader).

java/src/main/java/org/lance/index/external/:

  • ExternalIvfPqIndex — handle, AutoCloseable, JavaBean accessors per java/CLAUDE.md style
  • ExternalIvfPqIndexParams — builder pattern with Metric enum
  • SearchResult(filePath, rowIndex, distance) value class
  • ParquetRowKey(filePath, rowIndex) value class

Tests

  • rust/lance/tests/external_index_phase1.rs — end-to-end integration test: build → open → search (recall above threshold) → fetchRows projection in input order with duplicates → RowFilter exclusion. 16 left vectors × 3 parquet files of 320 rows.
  • 14 unit tests across the new module (params builder, types, manifest round-trip, parquet source dim/sample/iter_batches).
  • 5 JNI smoke tests (handle load, exception bridge, packed-rid round-trip, params builder).

All passing.

Headline benchmark numbers

Cluster Spark 3.5 (8 × 4 cores × 16 GB executor pods). Configs:

  • B-narrow / B-wide: per-query temp-Lance write + kNearestJoin (the existing fallback path for non-Lance R)
  • C-indexed-narrow / C-indexed-wide: Lance-native R + IVF-PQ index (apples-to-apples reference: same algorithm, R already in Lance)
  • E: external-index over parquet via this PR's ExternalIvfPqIndex

Per-query latency at K=10, |L|=100, dim=128, IVF=256, PQ=16:

Scale B-narrow C-indexed-narrow* E warm E build (one-time)
wide-tiny (|R|=100K, 16 cols) 5,696 ms (running) 1,875 ms 7,230 ms
wide-medium (|R|=1M, 16 cols) 18,128 ms (running) 570 ms 29,617 ms
wide-large (|R|=1M, 64 cols) 17,187 ms 521 ms 30,140 ms

* Earlier benchmark used un-indexed Lance-native R as "C". Re-running with indexed C this round; numbers will be folded into the issue.

E-warm is 30-33× faster than B-narrow at scale. E pays a fixed index build (~30s on cluster for 1M × dim=128 IVF=256 PQ=16) that amortizes after ~2 queries vs B-narrow.

The full benchmark methodology + cross-compile workflow (cargo zigbuild for linux x86-64 from macOS arm64) lives in sezruby/lance-spark#4.

What's NOT in this PR

  • HNSW and IVF-FLAT external builds (RFC Phase 2/3)
  • append() / compact() for incremental index updates (RFC Phase 4)
  • Manifest fingerprint validation (mtime/size/footer-hash) for staleness detection — Phase 1 manifest stores only (file_path, num_rows). Cross-Spark-session persistent index reuse needs this; in-process driver-side cache is sufficient for the current lance-spark integration.
  • SQL Catalyst integration on the lance-spark side (Phase 3 follow-up; DataFrame API entry point ships today).

Status

Marked DRAFT — this is for review of the design + API surface + integration approach. Not a merge candidate to upstream lance-format/lance until RFC discussion lands and any agreed-upon scope/API changes are folded in.

References

touch-of-grey and others added 23 commits May 19, 2026 12:14
…rmat#6848)

## Summary
- allow MemWAL initialization on append-only tables without unenforced
primary key metadata
- keep bucket sharding constrained to the primary key when a primary key
exists, but allow no-PK tables to bucket by a non-nested column
- update Rust, Python, and Java tests plus MemWAL docs

Fixes lance-format#6846

## Tests
- cargo fmt --all
- cargo test -p lance test_initialize_mem_wal_bucket_sharding
- uv run make build
- uv run pytest
python/tests/test_mem_wal.py::test_initialize_mem_wal_bucket_sharding_without_primary_key
- git diff --check
…-format#6752)

Carries on lance-format#5997 (and the benchmarking in discussion lance-format#5947), and follows
up on lance-format#6728 where moving S3 Express away from O(n) manifest listing to a
version hint was raised — picking that up here.

## What

On object stores where `list` is **not** lexicographically ordered (e.g.
S3 Express, the local filesystem), resolving the latest manifest version
is O(n) in the number of versions. To avoid this, after every successful
commit on such a store we write a small JSON file
`_versions/latest_version_hint.json` with content `{"version":N}`. A
reader then does a GET on the hint file plus a few HEAD probes (O(k),
where k = versions added since the hint was written), and falls back to
a full listing if the hint is missing (older datasets) or stale.

- The hint is written/read **only on non-lexically-ordered stores**. On
S3 Standard / GCS / Azure / OSS / Tencent / DynamoDB / memory the
ordered listing already resolves the latest version in roughly one
request, so the hint would only add a PUT per commit for nothing.
- `current_manifest_path` uses the hint for non-lexically-ordered,
non-local stores (the local filesystem keeps its existing
single-directory-read fast path);
`CommitHandler::list_manifest_locations_since` (used by
`load_new_transactions`) follows the same strategy.
- The hint write is **awaited** as part of the commit (no
fire-and-forget mode). It is best-effort: failures are logged and
ignored, since the hint only accelerates reads and never affects
correctness — readers always verify the hinted version and probe upward
from it. Detached versions are never written to the hint.
- A transient (non-`NotFound`) object-store error while probing abandons
the hint path so the caller falls back to a full listing rather than
trust a possibly-stale or incomplete result. The gap-fill HEADs are
bounded by `io_parallelism()`, and a far-behind reader (gap > 1000)
falls back to a single paginated listing.

## Differences from lance-format#5997

- Only the JSON hint format is kept (the alternative file-size-encoded
format and its env var are dropped).
- The fire-and-forget / async hint-write mode is removed — the hint is
always written synchronously, which keeps concurrent writes simpler with
no meaningful latency cost.
- The hint is gated to non-lexically-ordered stores, where it's actually
read.
- `current_manifest_path` picks one strategy based on the store rather
than racing a HEAD-probe against a listing, keeping IO behavior
deterministic.

A `manifest_commit` benchmark is included to measure commit/load latency
growth with many small fragments.

Co-Authored-By: Jack Ye <yezhaoqin@gmail.com>
…ter (lance-format#6818)

## Problem

`test_ann_prefilter` is flaky and failed on CI (`linux-arm`, Rust) —
e.g. on the unrelated PR lance-format#6757 — with the HNSW+SQ parametrization
returning a near-miss neighbor (row 10 instead of 6).

## Root cause

HNSW node-level assignment uses an **unseeded** thread RNG
(`rand::rng()`) in both the offline (`HnswBuilder`) and online
(`OnlineHnswBuilder`) builders, so every index build produces a
different random graph. On a tiny 300-vector dataset, an approximate
HNSW+SQ search over a different graph each run can return a near
neighbor instead of the exact one. `main` was green by luck of the RNG,
not correctness.

This is **not** caused by lance-format#6757 (the `String`→`Uuid` index-id refactor):
index cache keys and on-disk index paths are byte-identical before/after
that change; the test only surfaced the pre-existing flakiness.

## Fix

- Seed both level-assignment sites with a shared fixed constant
(`HNSW_LEVEL_RNG_SEED`) via `SmallRng`, making graph construction
reproducible. Recall is statistically unaffected (identical level
distribution; only the draws are fixed). A constant — rather than a new
`HnswBuildParams` field — keeps the change contained (no
serde/proto/binding changes).
- Harden `test_ann_prefilter` to assert the property it actually
validates (prefilter honored: `filterable > 5`) instead of an exact
nearest-neighbor id, per the repo guideline that vector-index tests
assert recall, not exact matches.

Co-authored-by: Vova Kolmakov <wombatukun@apache.org>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… rows (lance-format#6774)

## Summary

* Fixes lance-format#6735.
* `transaction.rs` — `resolve_update_version_metadata`: in the per-row
`created_at`
  mapping, check `row_id_to_source.contains_key(&rid)` before calling
`resolve_created_at_version`. Rows not in the map are INSERT branch rows
(no source
in existing fragments); they now receive `new_version` as `created_at`
instead of the
  previous fallback of `UNKNOWN_CREATED_AT_VERSION` (1).
* `resolve_created_at_version` doc comment updated to clarify it is only
called for
UPDATE branch rows (source confirmed present). The unmapped-row-ID
branch inside the
function is unused when called from `resolve_update_version_metadata`;
`UNKNOWN` (1)
  still applies for UPDATE rows whose source fragment has missing or bad
`created_at_version_meta` (cache miss, decode failure, or out-of-range
offset).
* Two existing tests updated to assert the corrected behavior; one new
test added.


## Background

MERGE INTO commits through `Operation::Update` and produces both UPDATE
branch rows
(rewritten into `new_fragments` with a source row in the previous
manifest, stable row
ID present in `row_id_to_source`) and INSERT branch rows (new rows also
in
`new_fragments`, stable row ID assigned fresh, not present in any
existing fragment).

Before this change, `resolve_update_version_metadata` built the per-row
`created_at_versions` vector by calling `resolve_created_at_version` for
every row ID.
For UPDATE branch rows that function correctly copies `created_at` from
the source
fragment. For INSERT branch rows the map lookup fails and the function
returns
`UNKNOWN_CREATED_AT_VERSION = 1`, producing a wrong historical version
for every newly
inserted row. CDF consumers cannot distinguish merge-inserted rows from
updated rows via
`_row_created_at_version`, and the value 1 is meaningless for rows that
first appeared
in a recent commit.

The fix is a single guard at the call site: only call
`resolve_created_at_version` for
rows confirmed to have a source (UPDATE branch); for all other rows use
`new_version`
directly.


## Implementation notes

* The guard uses `row_id_to_source.contains_key(&rid)`, which is an O(1)
hash lookup
on the same map already built for the UPDATE branch path — no additional
data
  structures or iteration.
* No lance-spark changes are needed. The Spark commit path
(SparkPositionDeltaWrite)
already attaches `RowIdMeta` to new fragment rows. This change activates
the correct
behavior automatically for all callers of `Operation::Update`, including
lance-spark
  MERGE INTO.
* lane-spark test update
lance-format/lance-spark#530


## Test plan

* `test_update_version_tracking_insert_branch_gets_new_version` (renamed
from
`test_update_version_tracking_unknown_row_id_defaults_to_1`): new
fragment with one
UPDATE branch row (ID 10, source `created_at = 5`) and one INSERT branch
row (ID 999);
asserts `created_at = [5, 5]` — UPDATE branch copies from source, INSERT
branch gets
  `new_version` (5).
*
`test_update_version_tracking_merge_into_distinguishes_insert_and_update_branch`
(new):
new fragment interleaves UPDATE branch rows (IDs 10, 11, source
`created_at = 3`) and
INSERT branch rows (IDs 500, 501); asserts `created_at = [3, 5, 3, 5]`
to verify
  per-row correctness across both branches in the same fragment.
* `test_update_version_tracking_no_row_id_meta_fallback`: assertion
updated from
`[1, 1, 1]` to `[5, 5, 5]` — a fragment with no `row_id_meta` gets fresh
stable IDs
assigned by `assign_row_ids`; those IDs have no source and are INSERT
branch rows,
  so `created_at` equals `new_version`.
*
`test_update_version_tracking_source_fragment_no_created_at_defaults_to_1`
(unchanged):
confirms that UPDATE branch rows whose source fragment has no
`created_at_version_meta`
still fall back to `UNKNOWN` (1) — the remaining reachable path through
  `resolve_created_at_version`.

Co-authored-by: Jing chen He <jingh@adobe.com>
## Problem

Legacy branches, i.e. branches whose `BranchContents` were written
without a persisted `branch_identifier`, currently deserialize through
`BranchIdentifier::none()`. That fallback generates a fresh random UUID
on each read, so the same unchanged branch can surface a different
`branch_identifier` across repeated loads.

This makes branch identity unstable in both Python and Java for legacy
datasets. On the Python side, `branches.list()` / `branches_ordered()`
expose `branch_identifier` directly, so callers that diff, cache, or
snapshot branch metadata can observe false changes even when the branch
itself has not changed. On the Java side, the same legacy branch can
also appear with a different identifier across refreshes, which makes
equality-style comparisons unstable as well.

## Summary
- stabilize fallback branch identifiers for legacy branch metadata by
replacing the missing-identifier sentinel with a deterministic synthetic
UUID during branch metadata reads
- keep the fallback logic localized to Rust branch metadata loading so
Python and Java continues returning stable `branch_identifier` values
without API shape changes
- add a lightweight Rust regression test that exercises
`BranchContents::from_path` on in-memory branch metadata and verifies
stable repeated reads plus distinct identifiers for different branch
names
Document the PR publishing requirements in `AGENTS.md` so agent-created
PRs match the checks enforced by CI before they are opened or updated.

This records two required gates:

- PR titles must follow Conventional Commits because
`.github/workflows/pr-title.yml` validates the title and body with
commitlint.
- PRs must run lint checks for every touched language surface before
creation or update. Rust changes require `cargo fmt --all` and `cargo
clippy --all --tests --benches -- -D warnings`; Python changes require
the `python/AGENTS.md` environment workflow and `uv run make lint` from
`python/`.

If a required lint check cannot be run, the blocker must be stated
explicitly in the PR summary.
…mat#6793)

Makes BTree scalar index cache entries serializable, so a persistent
cache backend can store and reload them without re-reading from storage.

Previously the whole BTree index was cached as `Arc<dyn ScalarIndex>`
under an `UnsizedCacheKey`, which can never carry a codec, and each
`FlatIndex` page was cached in-memory only.

Changes:
- `CacheCodecImpl for FlatIndex` (one BTree page) and
`BTreePageKey::codec()`.
- Top-level scalar index caching becomes a plugin implementation detail
via the existing `ScalarIndexPlugin::get_from_cache`/`put_in_cache`
hooks. The default impl preserves today's in-memory unsized caching
(backwards compatible); the BTree plugin overrides it with a sized,
codec-backed `BTreeIndexState` (the lookup `RecordBatch` + `batch_size`
+ `ranges_to_files`, from which `try_from_serialized` rebuilds the index
with no IO).
- Caching moves into `scalar::open_scalar_index` (get → miss → load →
put); the dataset-level `ScalarIndexCacheKey` logic is removed from
`Dataset::open_scalar_index`.

This keeps index-type-specific knowledge in `lance-index` rather than
leaking a state trait + dispatch into `lance/src/index.rs`.

Adds an integration test asserting that after prewarming with a
serializing cache backend, an indexed-filter query does 0 read IOPS.

Bitmap index will follow the same pattern in a separate PR.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cache vector index configuration within the index metadata, such as the
distance type and build parameters.

Previously, to determine things like the distance type or index type of
a vector index, the index file itself had to be opened. This PR stores
that information in `VectorIndexDetails` within the manifest's
`index_details` field, which is fetched and cached eagerly when loading
the manifest.

Old indexes have this field left blank. When blank, the details are
extracted from the index files and cached. This migration happens on the
first write with a new library version.

## What's stored in VectorIndexDetails

**Core build parameters** (typed fields — required for any runtime to
build the index):
- `metric_type`
- `target_partition_size` (IVF)
- `hnsw_index_config` — `max_connections`, `construction_ef`,
`max_level` (HNSW)
- `compression` — PQ/SQ/RQ/flat, including `num_bits`,
`num_sub_vectors`, `rotation_type`

**Runtime hints** (`map<string, string> runtime_hints`):
Optional build preferences that don't affect index structure. Stored so
a background rebuild process can reproduce the original configuration.
Runtimes that don't recognize a key must silently ignore it. Only
non-default values are written.

Keys use reverse-DNS namespacing: `lance.*` for core Lance hints, other
prefixes for runtime-specific hints (e.g., `lancedb.accelerator` for GPU
acceleration in LanceDB Enterprise).

Current `lance.*` hints: `lance.ivf.max_iters`, `lance.ivf.sample_rate`,
`lance.ivf.shuffle_partition_batches`,
`lance.ivf.shuffle_partition_concurrency`, `lance.pq.max_iters`,
`lance.pq.sample_rate`, `lance.pq.kmeans_redos`, `lance.sq.sample_rate`,
`lance.hnsw.prefetch_distance`, `lance.skip_transpose`.

Also adds `apply_runtime_hints()` to read hints back into build params
for future rebuild logic.

Closes lance-format#5963

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…es (lance-format#6874)

Adds `CacheCodec` impls so Bitmap and LabelList index cache entries
survive through a persistent cache backend, mirroring the BTree work in
lance-format#6793.

- `CacheCodecImpl for RowAddrTreeMap` (delegates to existing
`serialize_into`/`deserialize_from`), so per-value bitmap entries cached
under `BitmapKey` are codec-backed.
- `BitmapIndexState` captures the value→offset map (Arrow IPC), the null
bitmap, and the value type. `BitmapIndexPlugin` overrides
`get_from_cache`/`put_in_cache` to store this sized state.
- `LabelListIndexState` wraps an inner `BitmapIndexState` plus
`list_nulls` and gets the same plugin-level codec treatment.
- `open_scalar_index` skips the LabelList compatibility check on cache
hits, so a fully-cached LabelList query no longer pays an extra
`bitmap_page_lookup.lance` open per call.

## Tests

- Unit codec round-trip for `BitmapIndexState` (empty + populated).
- Integration tests
`test_{bitmap,label_list}_prewarm_with_serializing_backend_serves_query_with_no_io`
asserting zero IOPS after prewarm through a serializing cache backend.

Closes lance-format#6744
…mat#6816)

## Problem

In the LSM scanner, every query against an L0 (frozen/flushed)
generation re-opens that generation's Lance dataset from object storage.
There are **three** identical cold-open sites — scan (`planner.rs`),
point lookup (`point_lookup.rs`), and vector search (`vector_search.rs`)
— each doing `DatasetBuilder::from_uri(path).load()` with no session.
Per query, per flushed generation, this pays: manifest version discovery
+ manifest read + decode, file-metadata decode, and scalar/vector index
load. For an LSM tree, frozen generations are the single best caching
target, yet they were the only data source paying full cold-open cost on
every query.

## Key invariant

Flush writes each generation **once** to a globally-unique,
content-addressed path (`memtable/flush.rs`). Same path ⟹ same bytes,
forever — a cache hit can never be stale. This is the rare cache that
needs **no TTL and no invalidation for correctness**; pruning is
desirable only to reclaim memory.

## Changes (OSS `lance`)

Two complementary, independently-useful, opt-in pieces — defaults
preserve existing behavior exactly:

1. **`with_session` plumbing** — thread an existing `Arc<Session>` into
the scanner/planners so the first open of each generation populates and
reuses the shared index + file-metadata caches. `LsmScanner::new`
defaults this to the base table's session; `without_base_table` defaults
to `None`.

2. **`FlushedDatasetCache`** — a `moka`-backed, single-flight cache of
`Arc<Dataset>` keyed by resolved flushed path, owned and sized by the
consumer and injected per-request. After the first open, every
subsequent query for that generation is a pure `Arc::clone` with zero
object-store I/O. `retain_paths(live_paths)` prunes retired generations
at compaction (memory-only; correctness never depends on it).

A single shared `open_flushed_dataset(path, session, cache)` helper
replaces all three cold-open sites (repo rule: dedupe logic in 2+
places). `None`/`None` reproduces the original behavior precisely, so no
existing test changes.

`data_source.rs` / `collector.rs` are untouched — opening stays lazy
inside the planner, preserving bloom-filter pruning on point lookups.
Planner wiring uses chainable `with_session`/`with_flushed_cache`
builder methods rather than constructor changes, keeping `new()`
signatures (and every existing test/bench) untouched.

## Testing

- New unit tests for `FlushedDatasetCache`: miss opens once; hit returns
the same `Arc` (pointer eq); 16-way concurrent `get_or_open` opens
exactly once (single-flight); `retain_paths` drops the right keys;
no-cache path cold-opens each call.
- Regression: full `mem_wal::scanner` suite (78 tests) passes untouched.
- `cargo clippy -p lance --tests --benches` clean; `cargo fmt` clean.

## Notes

The `sophon` consumer side (process-bootstrap cache ownership, scanner
wiring, compaction `retain_paths`) is out of scope for this PR. Phase 1
(`with_session`) is independently shippable ahead of the cache.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an Arrow-native MemWAL sharding evaluator and exposes it through
the Java API/JNI.

- Evaluates MemWAL sharding specs against Arrow RecordBatch values for
bucket, identity, and unsharded fields.
- Resolves sharding source IDs through a Java-provided
source-id-to-column map.
- Adds Java-facing ShardingEvaluator returning an Arrow reader for the
evaluated sharding key batch.

This is needed by lance-spark to route writes using Lance's sharding
semantics instead of duplicating Spark-side bucket logic.
## Summary
- Add granular tracing targets for Lance event categories under
`lance::events::...`.
- Preserve the original tracing target in Python log passthrough so
`LANCE_LOG` can filter individual event types directly.
- Document the new event targets and add Python regression coverage for
target-specific log filtering.

## Testing
-
`PATH=/Users/beinan/.rustup/toolchains/1.94.0-aarch64-apple-darwin/bin:$PATH
cargo fmt --all -- --check`
- `git diff --check`
-
`PATH=/Users/beinan/.rustup/toolchains/1.94.0-aarch64-apple-darwin/bin:$PATH
cargo check -p lance-core -p lance-io -p lance-index -p lance -p
lance-datafusion`
-
`PATH=/Users/beinan/.rustup/toolchains/1.94.0-aarch64-apple-darwin/bin:$PATH
cargo check --manifest-path python/Cargo.toml`
- `uv sync --python 3.12 --no-install-project`
- `.venv/bin/maturin develop --uv -v`
- `uv run --no-sync --python 3.12 pytest -q
python/tests/test_log.py::test_lance_log_filters_trace_event_targets
python/tests/test_tracing.py::test_tracing_callback`
- `uv run --no-sync --python 3.12 ruff format --check
python/tests/test_log.py python/tests/test_tracing.py`
- `uv run --no-sync --python 3.12 ruff check python/tests/test_log.py
python/tests/test_tracing.py`

---------

Co-authored-by: Beinan Wang <beinanwang@microsoft.com>
Split from lance-format#6856 — vector-search portion.

A primary key written multiple times into one memtable, or into both a
memtable and an older generation, used to leak through as distinct rows:
HNSW indexes every insert as its own graph node, so KNN could return
both V1 and V2 of the same PK from a single source.

Each per-source KNN now runs through `LsmSourceTagExec`, which appends
`(_memtable_gen, _freshness)`. A single `LsmGlobalPkDedupExec` over the
union keeps the row with the largest tuple per PK — newer generations
win, ties fall to the normalized within-source order. This replaces the
bloom-based `FilterStaleExec` design and is exact (no false-positive
recall loss, no top-k under-fill).

After global dedup + sort + top-k, a `TakeExec` materializes any
user-projected columns not in the per-source KNN output by fetching from
the base dataset via `_rowid`. `plan_search()` also accepts
`refine_factor` so callers can enable base-table refine. Exposed in the
Python and Java bindings.

Removes `FilterStaleExec` and `GenerationBloomFilter`.

Part of splitting lance-format#6856 into focused PRs. Co-authored with @jackye1995.

Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
…e-format#6880)

Split from lance-format#6856 — point-lookup portion.

A primary key written multiple times into one active memtable used to
leak through to the user as distinct rows: `FilterExec + LIMIT 1` over
an insert-ordered scan returned the *oldest* match among duplicates.

The active arm now runs `WithinSourceDedupExec(KeepMaxFreshness)`, which
collapses by PK and keeps the freshest row. Flushed and base arms still
rely on `LIMIT 1` under the reverse-write / forward-write conventions.

Part of splitting lance-format#6856 into focused PRs. Co-authored with @jackye1995.

Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
…ance-format#6799)

## Summary

Fixes lance-format#5155.

`InstrumentedRecordBatchStreamAdapter` measures a node's
`elapsed_compute` by timing its outer `poll_next`. For an
`ExecutionPlan` node with child inputs, that poll transitively polls
every child — so `EXPLAIN ANALYZE` shows the parent's CPU as `parent +
child + grandchild + ...`, and each ancestor double-counts its
descendants.

This PR fixes five nodes that have child inputs and were using the
broken wrapper. The fix takes two shapes depending on the node:

1. **Four nodes** with a clean "child input -> per-batch transform ->
output" shape are converted to use a new helper,
`InstrumentedChildInputStream<F, Fut>` (in
`rust/lance/src/io/exec/utils.rs`), modeled on DataFusion's
`FilterExecStream`. The helper pulls from a child input **without** the
timer running, then drives a per-batch async transform **with** the
timer running.

2. **`FlatMatchQueryExec`** doesn't fit that shape — its work all
happens inside `flat_bm25_search_stream`, which consumes the child input
and `spawn_cpu`s the per-batch tokenize/count work internally. For this
node, the fix instruments `flat_bm25_search_stream` directly so it can
report CPU time on a metric handle supplied by the caller.

## Nodes fixed

| Node | File | Mechanism |
|---|---|---|
| `AddRowAddrExec` | `rust/lance/src/io/exec/rowids.rs` | helper |
| `MapIndexExec` | `rust/lance/src/io/exec/scalar_index.rs` | helper |
| `FlatMatchFilterExec` | `rust/lance/src/io/exec/fts.rs` | helper |
| `KNNVectorDistanceExec` | `rust/lance/src/io/exec/knn.rs` | helper +
caller-side `Instant::now()` around `compute_distance(...).await` to
capture `spawn_blocking` work |
| `FlatMatchQueryExec` | `rust/lance/src/io/exec/fts.rs` +
`rust/lance-index/src/scalar/inverted/index.rs` | new
`flat_bm25_search_stream_with_metrics` that records CPU on a supplied
`Time` handle |

For `KNNVectorDistanceExec`, this also addresses the spawned-CPU
undercount from lance-format#5155: the helper's timer doesn't measure work happening
on `spawn_blocking` worker threads. KNN's transform closure wraps
`compute_distance(...).await` with `Instant::now()` and adds the elapsed
duration to `elapsed_compute` from the caller side. `compute_distance`'s
public signature is unchanged.

For `FlatMatchQueryExec`, the analogous mechanism lives in
`lance-index`: `flat_bm25_search_stream_with_metrics` accepts an
`Option<Time>` and records CPU around both the `spawn_cpu` body in
`tokenize_and_count` (phase 1) and the synchronous `initialize_scorer` +
`flat_bm25_score` (phase 2). The original `flat_bm25_search_stream` is
preserved as a `#[deprecated]` thin wrapper that delegates with `None` —
no breaking API change.

## Test plan

### Unit tests

-
**`utils.rs::instrumented_child_input_stream_excludes_child_poll_time`**:
a child stream sleeps 60ms per `poll_next`; the transform sleeps 30ms.
The helper's `elapsed_compute` should be `~3 × 30ms`, not `~3 × 90ms`.
Verified to **fail** on the pre-fix implementation.
-
**`utils.rs::instrumented_child_input_stream_propagates_child_error`**:
an error mid-stream from the child propagates through the helper without
losing preceding OK batches.
- **`fts.rs::test_flat_match_filter_find_matches_large_utf8`**:
exercises the `i64`-offset specialization of `find_matches` (production
path for long text columns).
-
**`lance-index/scalar/inverted/index.rs::flat_bm25_search_stream_with_metrics_records_elapsed_compute`**:
runs `_with_metrics` with `Some(time)`, asserts `time.value() > 0` after
drain.

### Existing tests

```
cargo test -p lance --lib io::exec::utils io::exec::knn io::exec::rowids io::exec::scalar_index io::exec::fts
cargo test -p lance-index --lib scalar::inverted::index
```

All pass. `cargo fmt --all --check` and `cargo clippy -p lance
--no-deps` are clean for changed files.

### End-to-end EXPLAIN ANALYZE before/after

1M-row synthetic dataset with BTREE + INVERTED + IVF_PQ indexes,
comparing `elapsed_compute` between this branch's parent (pre-fix) and
the branch tip:

| Scenario | Node | Before | After | Change |
|---|---|---|---|---|
| KNN + BTREE scalar prefilter | `KNNVectorDistance` (493 rows out) |
623µs | 32µs | **~20× smaller** |
| KNN + FTS post-filter | `FlatMatchFilter` (0 rows out) | 102µs | 38µs
| **~2.7× smaller** |
| KNN + FTS prefilter | `FlatMatchQuery` (2.45K rows out) | 932µs |
116ms | **was severely under-counted** (spawn_cpu worker CPU was
invisible to the poll-based timer the new value reflects real
cross-thread CPU and naturally exceeds wall-clock when work is parallel)
|
| KNN + FTS prefilter (control) | `MatchQuery` | unchanged | unchanged |
noise only |

The first two rows mirror the issue's described shape: heavy synchronous
upstream inflating a downstream node's metric. The third row
demonstrates the spawned-CPU half of the fix — `FlatMatchQuery`'s
tokenize/score work was nearly invisible pre-fix and is now correctly
attributed.
* API and namespace stubs for new create_materialized_view API
* New REST route for materialized view refresh

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Unreleased version after creating v7.0.0-rc.1
…e-format#6871)

The function `mask_to_offset_ranges` is used at scan planning time to
determine which rows to read from the file. This was a bottleneck when
the mask was the result of a zonal index search because the old
implementation materialized all of the offsets only to convert them back
into ranges.

Luckily, roaring recently implemented a range-based iterator. Using this
we can skip the materialization step. On my zonemap benchmark this
doubles the speed of the search and, perhaps more importantly, removes a
penalty I observed when the index is used even on queries that are not
highly selective.

Generated with the assistance of Claude code.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Phase 1 of the External Vector Index RFC: build, open, search, and
post-topK fetch_rows over a list of parquet files, with no Lance dataset
dependency. Full design at sezruby/lance-spark#4.

Rust surface (rust/lance/src/index/vector/external/):
  - ExternalIvfPqIndex: handle with build/open/search/fetch_rows
  - ParquetFileSpec, SearchResult, ParquetRowKey value types
  - ExternalIvfPqIndexParams builder
  - RowFilter trait (Delta DV / Iceberg position-deletes extensibility)
  - ParquetVectorSource: training sample + rid-annotated batch stream
  - manifest.json sidecar (parquet file list + build params)
  - parquet refinement via PageIndexPolicy::Required random access
  - List<Float32> coercion for JVM parquet writers (Spark emits List, not
    FixedSizeList, even when all rows have the same length)

ivf.rs: new pub fn write_ivf_pq_file_external(&ObjectStore, &Path, ...)
parallel to write_ivf_pq_file_from_existing_index but with no &Dataset
parameter. Existing dataset-backed path becomes a thin wrapper.

JNI surface (java/lance-jni/src/external_index.rs):
  - nativeBuild, nativeOpen, nativeClose, nativeSearch, nativeFetchRows
  - RowFilter exposed as a packed LE byte[] of (file_id<<32)|row_index
    deleted rids (sidesteps cross-language callbacks while still supporting
    Delta DV / Iceberg position deletes)
  - fetchRows returns Arrow IPC stream bytes

Java surface (java/src/main/java/org/lance/index/external/):
  - ExternalIvfPqIndex (handle, AutoCloseable, JavaBean accessors)
  - ExternalIvfPqIndexParams (builder pattern)
  - SearchResult, ParquetRowKey value classes

Tests:
  - rust/lance/tests/external_index_phase1.rs: end-to-end integration
    test (build → open → search recall → fetchRows projection in input
    order with duplicates → RowFilter exclusion). 16 left vectors × 3
    parquet files of 320 rows; passing.
  - 14 unit tests across the new module. All passing.
  - 5 JNI smoke tests (handle load, exception bridge, packed-rid round
    trip, params builder). All passing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.