Skip to content

Tracking: External Lance vector index — Phase 1 PRs filed #6

@sezruby

Description

@sezruby

Tracking: External Lance vector index — Phase 1 PRs filed

Tracker for the External Lance vector index delivery. RFC + design + cluster benchmark numbers + architecture deep-dive live in:

Phase 1 PRs (DRAFT, gated on RFC review)

Two parallel PRs implement Phase 1 end-to-end:

Repo PR What it ships
sezruby/lance sezruby/lance#1 Rust core + JNI bindings + Java surface for ExternalIvfPqIndex
sezruby/lance-spark #5 Scala wrapper, IndexedNearestJoinExternal, fused probe+materialize stage, benchmark vs temp-Lance / Lance-native-indexed

The lance-spark PR depends on the lance-core PR — lance.version is bumped from 6.0.0-beta.4 → 7.1.0-beta.1 to pull in the new external-index API. Both PRs need RFC review before becoming merge candidates.

Test status

  • ✅ lance-core: 14 unit tests + 1 integration test (tests/external_index_phase1.rs) — build → open → search recall → fetchRows projection → RowFilter exclusion. All passing.
  • ✅ JNI: 5 smoke tests (handle load, exception bridge, packed-rid round trip, params builder). All passing.
  • ✅ lance-spark: IndexedNearestJoinExternalTest end-to-end (16 left vectors × 2 parquet files of 320 rows). Passing locally.
  • ✅ Cluster benchmark on Spark 3.5 — passing, numbers in External Lance vector index over parquet — RFC + tracking issue #4.

Highlights from #4

The headline finding from the cluster benchmark (full table + methodology in #4):

At wide-medium (|R|=1M, dim=128, 16 string payload columns, K=10, |L|=100), the external-index path's warm per-query latency is 570 ms vs the temp-Lance path's 18,128 ms — a 31.8× speedup. The external-index pays a one-time 30-second index build that amortizes after ~2 queries at this scale.

External-index is the right answer when:

  1. R is a stable parquet/Delta table on storage (sidecar pattern).
  2. The same R is queried many times — index build amortizes immediately.
  3. Wide payload columns — temp-Lance write cost grows linearly with width; external-index reads only top_K × |L| × projection_cols from source parquet.

Temp-Lance (#3) stays the right answer when R is an arbitrary subplan — joins, projections, computed columns. The two paths are additive, not competing.

Phase 1.6 — cloud-storage perf optimizations: ✅ shipped

Optimization layer on top of Phase 1.5. Same external-index API, same recall, abfss/cloud-storage path now ~75% faster (12s → 3s warm at wide-tiny on a 2-worker × 4-core DBR cluster). See #7 §7.4 and #4 "Cloud-storage (abfss) — Phase 1.6 follow-up" for the timeline + per-batch breakdown.

Surface changes:

  • ExternalIvfPqIndex.search_batch (Rust core) / nativeSearchBatch (JNI) / searchBatch(float[][], ...) (Java) / probeBatch(Array[Array[Float]], ...) (Scala wrapper)
  • ExternalFusedStage.fusedPartition issues one batched probe call per Spark task instead of N per-query calls
  • ParquetMetaCache on OpenedExternalIndex reuses footer + page-index across queries within a task
  • CoalescingParquetReader wraps ParquetObjectReader with a tunable coalesce gap + parallelism
  • Per-batch timing diagnostics gated on LANCE_LOG=lance::index::vector::external=info

The headline 570 ms wide-medium / 8.6 sec mega-medium numbers above are local-fs uniform-pool measurements. Cloud-storage is a different operating point — see the linked sections for those numbers.

What's NOT in Phase 1

Tracked for follow-up; covered in #4:

  • Persistent index across Spark sessions (Phase 2). Phase 1 has an in-process driver-side cache; cross-application reuse needs manifest fingerprint validation (mtime/size/footer-hash) at open() to detect stale indexes after parquet rewrites.
  • SQL Catalyst integration (Phase 3 in the lance-spark side). Phase 1 ships the DataFrame API entry point only.
  • HNSW / IVF-FLAT external builds (Phase 2/3 lance-core). Same shape as IVF-PQ, separate work.
  • append() / compact() for incremental index updates (Phase 4 lance-core).

Cross-compile notes for contributors

Cluster runs require nativelib/linux-x86-64/liblance_jni.so. Local cargo build on macOS arm64 produces only darwin-aarch64/liblance_jni.dylib. Workflow that works:

brew install zig
cargo install cargo-zigbuild
rustup target add x86_64-unknown-linux-gnu

cd $LANCE_REPO/java/lance-jni
cargo zigbuild --target x86_64-unknown-linux-gnu --release

# Graft into the lance-core JAR before building the lance-spark fat JAR:
cd /tmp/jar-patch
mkdir -p nativelib/linux-x86-64
cp $LANCE_REPO/java/lance-jni/target/x86_64-unknown-linux-gnu/release/liblance_jni.so \
   nativelib/linux-x86-64/
jar uf ~/.m2/repository/org/lance/lance-core/$VERSION/lance-core-$VERSION.jar \
   nativelib/linux-x86-64/liblance_jni.so

cross (Docker) hits the aws-lc-sys GCC blocklist on its default image; centos image has GCC 10 but its bundled Rust toolchain is too old for the workspace's edition = "2024". cargo zigbuild sidesteps both: uses host's rustup toolchain + zig-supplied glibc-targeted clang.

Update (Phase 1.6): cross also works with the Ubuntu-based ghcr.io/cross-rs/x86_64-unknown-linux-gnu:main image plus a small Cross.toml that pre-installs protoc 25.1, plus a [patch.crates-io] to a vendored copy of aws-lc-sys 0.41.0 with the gcc-bug-95189 self-check assertion stripped (assertion mis-fires under QEMU emulation on darwin-arm64; the actual crypto code is fine). Either path produces an equivalent linux .so.

Upstream lance-core CI cross-builds for all 3 architectures and publishes the multi-arch JAR; this workflow is only needed for in-flight branches not yet published.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions