Tracking: External Lance vector index — Phase 1 PRs filed

# Tracking: External Lance vector index — Phase 1 PRs filed

Tracker for the **External Lance vector index** delivery. RFC + design + cluster benchmark numbers + architecture deep-dive live in:

- **RFC + benchmark**: #4

## Phase 1 PRs (DRAFT, gated on RFC review)

Two parallel PRs implement Phase 1 end-to-end:

| Repo | PR | What it ships |
|---|---|---|
| `sezruby/lance` | https://github.com/sezruby/lance/pull/1 | Rust core + JNI bindings + Java surface for `ExternalIvfPqIndex` |
| `sezruby/lance-spark` | #5 | Scala wrapper, `IndexedNearestJoinExternal`, fused probe+materialize stage, benchmark vs temp-Lance / Lance-native-indexed |

The lance-spark PR depends on the lance-core PR — `lance.version` is bumped from 6.0.0-beta.4 → 7.1.0-beta.1 to pull in the new external-index API. Both PRs need RFC review before becoming merge candidates.

## Test status

- ✅ lance-core: 14 unit tests + 1 integration test (`tests/external_index_phase1.rs`) — build → open → search recall → fetchRows projection → RowFilter exclusion. All passing.
- ✅ JNI: 5 smoke tests (handle load, exception bridge, packed-rid round trip, params builder). All passing.
- ✅ lance-spark: `IndexedNearestJoinExternalTest` end-to-end (16 left vectors × 2 parquet files of 320 rows). Passing locally.
- ✅ Cluster benchmark on Spark 3.5 — passing, numbers in #4.

## Highlights from #4

The headline finding from the cluster benchmark (full table + methodology in #4):

> At wide-medium (|R|=1M, dim=128, 16 string payload columns, K=10, |L|=100), the external-index path's **warm per-query latency is 570 ms** vs the temp-Lance path's **18,128 ms** — a **31.8× speedup**. The external-index pays a one-time **30-second** index build that amortizes after **~2 queries** at this scale.

External-index is the right answer when:

1. R is a stable parquet/Delta table on storage (sidecar pattern).
2. The same R is queried many times — index build amortizes immediately.
3. Wide payload columns — temp-Lance write cost grows linearly with width; external-index reads only `top_K × |L| × projection_cols` from source parquet.

Temp-Lance (#3) stays the right answer when R is an arbitrary subplan — joins, projections, computed columns. The two paths are **additive, not competing**.

## Phase 1.6 — cloud-storage perf optimizations: ✅ shipped

Optimization layer on top of Phase 1.5. Same external-index API, same recall, abfss/cloud-storage path now ~75% faster (12s → 3s warm at wide-tiny on a 2-worker × 4-core DBR cluster). See [#7 §7.4](https://github.com/sezruby/lance-spark/issues/7) and [#4 "Cloud-storage (abfss) — Phase 1.6 follow-up"](https://github.com/sezruby/lance-spark/issues/4) for the timeline + per-batch breakdown.

Surface changes:
- `ExternalIvfPqIndex.search_batch` (Rust core) / `nativeSearchBatch` (JNI) / `searchBatch(float[][], ...)` (Java) / `probeBatch(Array[Array[Float]], ...)` (Scala wrapper)
- `ExternalFusedStage.fusedPartition` issues one batched probe call per Spark task instead of N per-query calls
- `ParquetMetaCache` on `OpenedExternalIndex` reuses footer + page-index across queries within a task
- `CoalescingParquetReader` wraps `ParquetObjectReader` with a tunable coalesce gap + parallelism
- Per-batch timing diagnostics gated on `LANCE_LOG=lance::index::vector::external=info`

The headline 570 ms wide-medium / 8.6 sec mega-medium numbers above are local-fs uniform-pool measurements. Cloud-storage is a different operating point — see the linked sections for those numbers.


## What's NOT in Phase 1

Tracked for follow-up; covered in #4:

- **Persistent index** across Spark sessions (Phase 2). Phase 1 has an in-process driver-side cache; cross-application reuse needs manifest fingerprint validation (mtime/size/footer-hash) at `open()` to detect stale indexes after parquet rewrites.
- **SQL Catalyst integration** (Phase 3 in the lance-spark side). Phase 1 ships the DataFrame API entry point only.
- **HNSW / IVF-FLAT external builds** (Phase 2/3 lance-core). Same shape as IVF-PQ, separate work.
- **`append()` / `compact()`** for incremental index updates (Phase 4 lance-core).

## Cross-compile notes for contributors

Cluster runs require `nativelib/linux-x86-64/liblance_jni.so`. Local `cargo build` on macOS arm64 produces only `darwin-aarch64/liblance_jni.dylib`. Workflow that works:

```bash
brew install zig
cargo install cargo-zigbuild
rustup target add x86_64-unknown-linux-gnu

cd $LANCE_REPO/java/lance-jni
cargo zigbuild --target x86_64-unknown-linux-gnu --release

# Graft into the lance-core JAR before building the lance-spark fat JAR:
cd /tmp/jar-patch
mkdir -p nativelib/linux-x86-64
cp $LANCE_REPO/java/lance-jni/target/x86_64-unknown-linux-gnu/release/liblance_jni.so \
   nativelib/linux-x86-64/
jar uf ~/.m2/repository/org/lance/lance-core/$VERSION/lance-core-$VERSION.jar \
   nativelib/linux-x86-64/liblance_jni.so
```

`cross` (Docker) hits the [aws-lc-sys GCC blocklist](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95189) on its default image; centos image has GCC 10 but its bundled Rust toolchain is too old for the workspace's `edition = "2024"`. `cargo zigbuild` sidesteps both: uses host's rustup toolchain + zig-supplied glibc-targeted clang.

**Update (Phase 1.6):** `cross` also works with the Ubuntu-based `ghcr.io/cross-rs/x86_64-unknown-linux-gnu:main` image plus a small `Cross.toml` that pre-installs protoc 25.1, plus a `[patch.crates-io]` to a vendored copy of `aws-lc-sys` 0.41.0 with the gcc-bug-95189 self-check assertion stripped (assertion mis-fires under QEMU emulation on darwin-arm64; the actual crypto code is fine). Either path produces an equivalent linux .so.

Upstream lance-core CI cross-builds for all 3 architectures and publishes the multi-arch JAR; this workflow is only needed for in-flight branches not yet published.


Repo	PR	What it ships
`sezruby/lance`	sezruby/lance#1	Rust core + JNI bindings + Java surface for `ExternalIvfPqIndex`
`sezruby/lance-spark`	#5	Scala wrapper, `IndexedNearestJoinExternal`, fused probe+materialize stage, benchmark vs temp-Lance / Lance-native-indexed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: External Lance vector index — Phase 1 PRs filed #6

Tracking: External Lance vector index — Phase 1 PRs filed

Phase 1 PRs (DRAFT, gated on RFC review)

Test status

Highlights from #4

Phase 1.6 — cloud-storage perf optimizations: ✅ shipped

What's NOT in Phase 1

Cross-compile notes for contributors

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Tracking: External Lance vector index — Phase 1 PRs filed #6

Description

Tracking: External Lance vector index — Phase 1 PRs filed

Phase 1 PRs (DRAFT, gated on RFC review)

Test status

Highlights from #4

Phase 1.6 — cloud-storage perf optimizations: ✅ shipped

What's NOT in Phase 1

Cross-compile notes for contributors

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions