SPIP: Lance-backed approximate nearest-neighbor join for Apache Spark — three complementary paths

# SPIP: Lance-backed approximate nearest-neighbor join for Apache Spark — three complementary paths

> A proposal to add Lance-backed indexed `NearestByJoin` execution to lance-spark, covering all three real-world R-side shapes: data already in Lance, data in a generic Spark subplan, and data in direct parquet / Delta scans. Includes per-component justification for why the hot paths must live in Lance (Rust + SIMD), not in Spark JVM code.

---

## 📊 Measured results summary (added 2026-06-16)

> Headline across every configuration we measured on a real cluster (Spark 3.5.4, Azure ABFSS storage) and locally: **native Lance (Rust) search is the performance floor, and the less work done in the Spark JVM, the faster the join runs.** Distribution only helps when single-machine search is genuinely too slow (unindexed brute-force at scale) — and even then the minimal-JVM `mapPartitions` shape beats the staged probe→shuffle→merge→materialize pipeline.

### Single-query retrieval (M=1, driver-side native `Dataset.newScan(...nearest)`, IVF-PQ)

| Dataset | \|R\| | dim | p50 | p90 | p99 |
|---|---|---|---|---|---|
| Cohere `wikipedia-2023-11-embed-v3` | 10M | 1024 | **56 ms** | 182 ms | 514 ms |
| Synthetic uniform (worst case for IVF) | 100M | 128 | 157 ms | 776 ms | 1,213 ms |
| Synthetic uniform | 250M | 128 | 1,084 ms | 1,528 ms | 2,323 ms |

Sub-200 ms p50 at 100M means **distributing a single indexed query through Spark cannot win** — Spark's task-scheduling floor (~100–300 ms warm) alone exceeds the single-machine time. Online retrieval should stay single-machine / native.

### Batch KNN-join, indexed (synthetic 100M × dim=128, IVF-PQ, M=5000 queries, 8×4 cluster)

| Config | Median | vs SINGLE |
|---|---|---|
| SINGLE (driver loop) | ~490 s | 1.00× |
| **SPARK_DIST** (`mapPartitions` → `LanceProbe.probe`, no shuffle) | **~42 s** | **~13× faster** |
| SPARK (`kNearestJoin` staged pipeline, `probeParallelism=1`) | ~472 s | ~1.04× (break-even) |

At indexed scale the staged 3-stage pipeline never beat the dumb `mapPartitions` shape — its shuffle + merge + second-scan materialize is pure overhead because per-query Lance work is already tiny.

### Batch KNN-join, **no index** (Cohere wiki 10M × dim=1024, brute-force, M=100, 8×4 cluster)

| Config | Median (100 queries) | vs SINGLE |
|---|---|---|
| SINGLE (driver loop) | ~50 min | 1.00× |
| SPARK (`kNearestJoin`, `probeParallelism=8`) | ~19.4 min | 2.6× |
| SPARK (`kNearestJoin`, `probeParallelism=32`) | ~13.9 min | 3.6× |
| **SPARK_DIST** (`mapPartitions`) | **~5.0 min** | **~10× faster** |

Even where distribution genuinely pays (single-machine is 30 s/query brute-force), the minimal-JVM `mapPartitions` shape still beats the staged pipeline by ~2.8×. The staged pipeline's per-query fan-out (replicate L × fragments, shuffle, merge, materialize) creates `M × N_frag × K` intermediate JVM rows that don't otherwise need to exist.

### Native batch-query API (lance-core [#6828](https://github.com/lance-format/lance/pull/6828), shared flat-KNN scan)

Local (1M × dim=512, flat KNN): batched `Scanner::nearest` over M queries shares one scan/decode and gives **~1.5× at M=10, ~1.7–1.8× plateau at M≥100** vs M independent queries. The win is bounded by the point where per-query SIMD distance compute matches the saved I/O; it grows on slower storage where the shared scan is the expensive component.

**Takeaway for design:** keep the hot path in Lance Rust. Spark should be a thin fragment-dispatch + top-K-merge layer (à la PR [lance-format/lance-spark#608](https://github.com/lance-format/lance-spark/pull/608)); every Arrow→`InternalRow`→shuffle→`InternalRow`→Arrow round-trip in the JVM is measured overhead, not speedup.

---

## Outline

1. [Q1: What are we trying to do?](#q1-what-are-we-trying-to-do)
2. [Q2: What this is NOT designed to solve](#q2-what-this-is-not-designed-to-solve)
3. [Q3: How is it done today, and what are the limits?](#q3-how-is-it-done-today-and-what-are-the-limits)
4. [Q4: What is new in this approach?](#q4-what-is-new-in-this-approach)
5. [Q4.6: Beyond join — single-query and batched-query retrieval](#46-beyond-join-single-query-and-batched-query-retrieval)
6. [Q5: Why is each hot-path piece in Lance, not Spark?](#q5-why-is-each-hot-path-piece-in-lance-not-spark)
7. [Q6: Who cares?](#q6-who-cares)
8. [Q7: Benchmarks](#q7-benchmarks)
9. [Q8: Risks](#q8-risks)
10. [Q9: Implementation status](#q9-implementation-status)
11. [Q10: Phasing](#q10-phasing)
12. [Open questions](#open-questions)
13. [References](#references)

---

## Q1: What are we trying to do?

We want **`leftDf.kNearestJoin(rightDf, leftVec, rightVec, k, ...)` in Spark to run fast and at production scale**, regardless of what `rightDf` actually is. "Fast" means single-digit seconds at million-row R, not minutes. "Regardless" means: a Lance dataset, a parquet table, a Delta table, a wide-payload table, or an arbitrary upstream Spark subplan — the user shouldn't have to learn three different APIs.

Concretely, the proposal adds three execution paths backed by Lance, plus a routing layer that picks the right one:

| If R is... | The path is... |
|---|---|
| A Lance dataset (with or without a vector index) | **Path A: Lance-native KNN** (existing — [#541 upstream](https://github.com/lance-format/lance-spark/issues/541)) |
| A direct parquet or Delta scan | **Path C: external Lance vector index over the source files** (new — [#4](https://github.com/sezruby/lance-spark/issues/4) / [PR #5](https://github.com/sezruby/lance-spark/pull/5)) |
| Anything else (joins, projections, filters, computed columns, subqueries) | **Path B: per-query temp Lance materialization** (new — [#2](https://github.com/sezruby/lance-spark/issues/2) / PR #3) |

All three paths share the same probe-and-refine algorithm internally. The differences are about **where the data lives** and **what migration cost the user can tolerate**.

Beyond the bulk join, the same Lance primitives also power **single-query retrieval** — the RAG / "find the 10 nearest docs to this query" shape. The Rust core and JNI surface already accept a single query and return top-k; the Scala wrapper `LanceParquetIndex` (PR #5) adds a driver-side `searchToDF` so the result composes with the rest of a Spark pipeline without paying Spark's ~100-300 ms scheduler floor. Section 4.6 covers all three caller patterns: any-JVM service (no `SparkSession`), Spark-driver single-query, Spark-distributed batched queries.

The proposal also explains, component by component, why every piece of the hot path must live in **Lance Rust** (with SIMD distance kernels and page-index-aware parquet random access), not in **Spark JVM code** — and why a pure-Spark approach hits a wall well before production scale.

## Q2: What this is NOT designed to solve

- **Generic ANN library replacement.** We reuse Lance's existing IVF-PQ implementation (with HNSW and IVF-FLAT as later phases). We don't propose new index types. Users who want HNSW-on-everything have other libraries.
- **Real-time index updates.** Builds happen offline or at query time; live `INSERT`/`UPDATE` propagation into the index is a Phase 4 problem, not Phase 1.
- **Distributed index build.** Lance has a [distributed indexing](https://lancedb.github.io/lance/guide/distributed_indexing/) primitive; this proposal can be invoked from each Spark task, but coordination is the caller's job today. (Phase 4 candidate.)
- **Data migration tooling.** This proposal does NOT push users toward "rewrite everything as Lance." Path C explicitly lets users keep parquet as the source of truth.
- **Cross-engine portability.** This is a Spark integration. Velox / Gluten / DuckDB are separate efforts ([Velox PR #16556](https://github.com/facebookincubator/velox/pull/16556), lance-c work) that this proposal complements but does not subsume.

## Q3: How is it done today, and what are the limits?

### 3.1 Spark's brute-force `crossJoin` — the natural-but-wrong default

```scala
// What a Spark user writes today on parquet R
left.crossJoin(right)
    .withColumn("__dist", l2Udf(col("lvec"), col("rvec")))
    .groupBy("lid")
    .agg(slice(sort_array(collect_list(struct("__dist", "rid")), asc = true), 1, k))
```

This is what Spark 4.2's `RewriteNearestByJoin` lowers to. The work it does:

1. **Cartesian product of `|L| × |R|` rows.** At `|L|=1000, |R|=10M` this is 10 billion (vector, vector) pairs.
2. **One UDF call per pair** to compute L2 — 10B JVM-level distance evaluations.
3. **One groupBy + sort_array per left row.** Memory pressure scales with `K × |L|`.

For a real workload (`|L|=1000, |R|=10M, dim=128`), this is ~30+ minutes of cluster time at minimum, dominated by the 10B distance evaluations and the giant shuffle.

**This is the baseline every Spark user hits today.** It's correct, it's exact, and it's unusable past tens of millions of pairs.

### 3.2 Spark MLlib `BucketedRandomProjectionLSH`

The "obvious" approximate alternative built into Spark MLlib. Sketch:

```scala
val lsh = new BucketedRandomProjectionLSH().setBucketLength(2.0).setNumHashTables(5)
val model = lsh.fit(rightDfWithVecCol)
val pairs = model.approxSimilarityJoin(left, right, threshold = 1e9, distCol = "__dist")
// Then: groupBy lid, take top-K
```

Reading [LSH.scala](https://github.com/apache/spark/blob/v3.5.5/mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala) reveals what `approxSimilarityJoin` actually does:

```scala
val explodedA = processDataset(datasetA, ...)   // |L| × numHashTables rows
val explodedB = processDataset(datasetB, ...)   // |R| × numHashTables rows
val joined = explodedA.join(explodedB, "hashValue").drop(...).distinct()
val withDist = joined.withColumn("__dist", l2Udf(...))
withDist.filter("__dist < threshold")
```

The cost analysis:

- **Explode**: `|R| × numHashTables` rows. At `|R|=1M, numHashTables=5` that's 5M rows in `explodedB`.
- **Hash join on `hashValue`**: a Spark **shuffle join** keyed on the LSH bucket. Number of (L, R) collision pairs scales with `|L| × |R| / num_buckets`, summed across hash tables.
- **Distance UDF on every collision pair.** Each is ~100s of nanoseconds in JIT-compiled Scala.
- **Plus a `distinct()`** that costs another shuffle.

At our `wide-medium` config (`|L|=100, |R|=1M, dim=128, numHashTables=5`) this is **30-60 seconds per query** dominated by the shuffle. At `mega-medium` (`|R|=10M`) it's **minutes per query**.

**The fundamental problem isn't tuning.** LSH was designed for streaming similarity joins — where you receive a continuous stream of pairs and want approximate threshold filtering. It is the wrong algorithm for the "given a query vector, return top-K from a static R" workload, which is exactly the kNN-by-join shape. For that workload, IVF-PQ visits a fixed `nprobes × partition_size` candidates regardless of `|R|` (sub-linear scaling), while LSH visits hash-bucket collisions that scale with `|R|` (linear at best, quadratic worst-case if buckets collapse).

We left LSH in the bench harness behind `BENCH_SKIP_LSH=true` for honesty, but skipped it from cluster runs as not informative — the comparison would just confirm "LSH is much slower."

### 3.3 The Spark-native KNN wall

A natural question is: "Can't we just write IVF-PQ probe + refine in Spark?" The honest answer is no — the wall is built from six independent costs, each of which alone is enough to disqualify the approach.

#### 3.3.1 JVM row population overhead

Spark's row representation for `FixedSizeList[Float]` is `UnsafeArrayData` wrapping an `UnsafeRow`. Reading a single dim=128 vector means:

- 1 outer-row dereference
- 1 array-pointer dereference into off-heap UnsafeRow memory
- 128 `getFloat(i)` calls, each computing an offset and reading 4 bytes from UnsafeRow
- A `Vectors.dense(arr.toArray.map(_.toDouble))` conversion if calling MLlib (allocates a new `double[]` and a `DenseVector` object)

For a probe that touches `K × refine_factor = 80` candidate vectors per query, that's **80 × 128 = 10,240 boxed reads + 80 fresh DenseVector allocations PER QUERY**. At `|L|=1000` queries × this work, it's ~10M boxed reads and ~80K allocations per query batch — and that's before doing any actual distance computation.

Lance, by contrast, reads the entire candidate set as a flat `Float32Array` (one allocation, contiguous memory) and feeds it directly to a SIMD distance kernel that processes 16 floats per AVX2 cycle. The throughput difference is roughly 50-100×.

#### 3.3.2 Shuffle bandwidth for partition-grouping

If we tried to implement IVF probe in Spark, the natural shape would be:

```
left.rdd
   ├── FOR EACH left row, compute partition_id = nearest_centroid(query)
   └── shuffle by partition_id, joining against R-partitioned-by-centroid
```

But R doesn't naturally partition by IVF centroid — you'd have to repartition R first, which means **shuffling all of R's vector data once before any query runs**. At wide-medium that's ~500 MB; at mega it's ~5 GB. **This shuffle is per-query** unless R is pre-shuffled into a persistent dataset — which is exactly what a Lance dataset already does internally for free.

In other words: doing IVF-PQ in Spark requires recreating Lance's on-disk partition layout via Spark shuffles. You'd duplicate Lance's storage layout work in shuffle code that doesn't survive across queries.

#### 3.3.3 Distance kernel performance

A 128-dim L2 distance computation in optimal native code:

```
# Lance lance-linalg AVX2 path:
8× vsubps  (16 floats × 4 floats per ymm × 2 for accumulator)
8× vfmadd231ps
1× horizontal sum
≈ 20-30 CPU cycles per pair on modern Skylake/Zen
```

A 128-dim L2 distance in JVM Scala with `Math.fma` UDF:

```
# JVM JIT'd path (assuming UDF gets JIT compiled):
128× scalar fmsub (no auto-vectorization for boxed floats)
128× boxed Float access (or unboxed if Tungsten kicks in, but the UDF wrapper doesn't help)
≈ 500-1000 CPU cycles per pair
```

Plus UDF call overhead per invocation (~50-200ns per call), Scala closure deserialization, codec inflation. Spark's Tungsten can vectorize *some* expression-level ops but it does NOT vectorize user-supplied UDFs over `ArrayType`.

**At 80 candidate distances per query × 1000 queries** (wide-medium), Spark's 50-100× per-distance penalty turns ~5 ms of native L2 into ~250-500 ms of JVM L2. That alone disqualifies a Spark-native refine path.

#### 3.3.4 No page-index-aware parquet random access in Spark

Refinement needs to read 80 specific (file, row_index) pairs out of a 1M-row parquet file with **no full scan**. parquet-rs ≥58.3 supports this via `PageIndexPolicy::Required` + `RowSelection::from_consecutive_ranges` + the parquet column-and-page index. The parquet reader:

1. Reads the page index out of the parquet footer
2. For each requested row, finds the (row group, page) it lives in
3. Issues a coalesced byte-range read for just those pages
4. Decodes only the requested rows, not the surrounding 100K-row pages

**parquet-mr (Spark's JVM parquet library) does NOT have this primitive.** It supports row-group-level pruning (skip whole row groups based on column min/max) but not page-level random access. A "fetch 80 rows by row index" operation on parquet-mr decompresses entire pages, which means reading and decompressing ~100K rows to extract 1.

We'd need to add page-index-aware random access to parquet-mr to do refinement in Spark. That's a substantial parquet-mr contribution — and even if it lands, it's still JVM-level decompression + JVM row population (see 3.3.1) on the hot path.

#### 3.3.5 No vector index file format in Spark

Spark has no concept of "an index file alongside a parquet table." There's no Catalyst plan node for "open this index file"; there's no DataSource v2 hook for "the table has an associated index"; there's no catalog metadata for index existence. We'd need to invent all of this in lance-spark code, or persuade upstream Spark to take a fundamental schema additions PR.

What we have today (Path C in this proposal) sidesteps all of that: the index file is a Lance artifact opened directly from the JNI handle, completely outside Catalyst's plan tree.

#### 3.3.6 GC pressure under multi-query batches

The natural Spark code shape is "for each left row, run a probe, collect refs, then materialize." Each probe call allocates:

- A `double[]` for the query vector
- An ArrayList of `(distance, rid)` candidate tuples
- A sorted `ArrayList[K]` for top-K

At `|L|=1000` × per-query allocations of ~1-10 KB, it's tens of MB of short-lived object churn per query batch. Multiplied across the cluster, this triggers minor GC every few hundred ms in the executor JVMs and major GC every few seconds. Lance's Rust path has zero JVM heap allocation on the hot path — distance scoring is in-place on contiguous `Vec<f32>` buffers.

**Combined**: any single one of these costs is ~10× per-query overhead vs the Lance-native approach. Stacked together they're 100× to 1000× — easily the difference between a one-second per-query latency and a one-minute one. This is the wall.

## Q4: What is new in this approach?

### 4.1 The three-path architecture

We don't propose a single hot-path implementation. We propose **three implementations of the same logical operation**, with a Catalyst rule choosing the right one based on what `rightDf` is:

```
leftDf.kNearestJoin(rightDf, ...)
    │
    ├──┬─ rightDf is a Lance scan
    │  └─ ▶ Path A: Lance-native KNN (existing)
    │
    ├──┬─ rightDf is a direct parquet/Delta file scan (no joins, computed columns)
    │  └─ ▶ Path C: external Lance vector index over those files
    │
    └──┬─ Anything else (subplans, filters, joins, projections)
       └─ ▶ Path B: per-query temp Lance materialization, then Path A
```

The user writes the same API in every case. Routing is automatic via a Catalyst `postHocResolutionRule`. Three paths because **R has three meaningfully different shapes** in real Spark workloads, and a single hot-path for all of them necessarily compromises somewhere.

### 4.2 Path A: Lance-native KNN (R is already in Lance)

Reuses lance-spark's existing `IndexedNearestJoin` ([upstream issue #541](https://github.com/lance-format/lance-spark/issues/541)).

**Hot path inside Lance:**
1. For each left vector, JNI call to `Dataset.scan().nearest(query, k, refine_factor).execute()`
2. Lance's IVF probe walks `nprobes` partitions, scores PQ codes, keeps top `K × refine_factor` candidates
3. Lance refines by reading actual vectors from the Lance dataset (Lance random access, not parquet) and computing exact L2 in lance-linalg SIMD
4. Top-K returned across JNI as `(rowid, distance)` pairs

**Spark side** (3 staged execs):

```
LanceProbeExec        (open Lance dataset, run nearest()/refine() per left row, emit refs)
   ↓ Exchange(_leftId)   ← Catalyst-inserted shuffle, AQE-engaged
LanceMergeExec        (TopKHeap.merge across fragment-grouped contributions)
   ↓
LanceMaterializeExec  (per task: open Lance dataset, point-fetch surviving rids, emit final Row)
```

The shuffle exists because Phase 1.5 supports `probeParallelism > 1` (fragment-grouped probe), where multiple Spark tasks contribute partial top-K for the same left row. The merge stage's `reduceByKey` combines them.

**When this path applies:** R is already a Lance dataset. Optimal latency, no temp write, no extra storage.

### 4.3 Path B: per-query temp Lance materialization (R is an arbitrary subplan)

The general-purpose answer for "R is anything else." See [#2](https://github.com/sezruby/lance-spark/issues/2) for the full motivation.

**Hot path:**

1. Catalyst rule sees `kNearestJoin(left, right)` and `right` doesn't unwrap to a Lance scan or a direct parquet scan
2. Project `right` to `(rid := monotonically_increasing_id(), rvec, ...payload)`, write to a temp Lance dataset under `spark.lance.knn.tempR.dir`
3. Proceed exactly as Path A against the temp Lance URI
4. SparkListener cleans up the temp dir on `onApplicationEnd` + JVM shutdown hook

**When this path applies:** any non-trivial R — parquet that has been filtered, joined, transformed, or computed; in-memory DataFrames; subqueries; SQL scenarios with views.

**Cost:** the temp Lance write is real. At `|R|=1M, dim=128, narrow projection`, it's ~15-20 sec on a cluster (shared object store + Lance encoding). At `|R|=10M` it's ~2.5 min. **This cost dominates Path B**, and is the load-bearing reason Path C exists for the parquet case.

### 4.4 Path C: external index over parquet (R is a direct parquet/Delta scan)

The new path. Lance builds an IVF-PQ index *directly over the source parquet files*, no temp Lance dataset, no column copies. Lance owns the parquet reader.

**Hot path:**

1. **Build (once per logical "table version", amortized across queries):**
   - Lance reads the vector column from each parquet file via parquet-rs with `PageIndexPolicy::Required`
   - Trains kmeans + PQ codebook (lance-index existing APIs)
   - Streams (vec, rid) batches through `IvfTransformer` + `shuffle_dataset`
   - Writes `index.idx` (standard Lance IVF-PQ encoding) + `manifest.json` (parquet file list)
   - rid encoding: `(file_id_u32 << 32) | row_index_u32`

2. **Search (per query):**
   - IVF probe → top `K × refine_factor` candidates with PQ-approx scores
   - Decode rids back to `(file_path, row_index)`
   - Per-file: open parquet with `PageIndexPolicy::Required`, `RowSelection::from_consecutive_ranges` for the requested rows, read actual vectors
   - Compute exact L2 in lance-linalg SIMD
   - Return top-K as `Vec<SearchResult { file_path, row_index, distance }>`

3. **Materialize (post-topK fetch):**
   - Caller passes `(file_path, row_index)` pairs + projection columns
   - Lance batches by file, issues page-index-aware reads per file
   - Returns Arrow IPC stream → JVM decodes to Spark Rows

**Spark side** (1 fused exec — no shuffle):

```
left.rdd
   ↓ ExternalFusedStage (per task: open external index, probe each left row,
                         batched fetch_rows for the partition, emit final Row)
```

Why no shuffle? Lance's `idx.search()` returns the **already-merged global top-K** for one query — the cross-partition merge happens inside Lance, not Spark. So the Path A shuffle (which exists to merge partial contributions from fragment-grouped probes) is vestigial here. The shuffle was inherited from the Path A pipeline; we removed it after benchmarking confirmed it was a passthrough.

**When this path applies:** R is `spark.read.parquet(path)` or `spark.read.format("delta").load(path)`, possibly with a passthrough `Project` and/or supported `Filter` on top. Anything else (joins, computed columns, unsupported predicates) falls through to Path B.

### 4.5 The routing decision tree

```
Catalyst rule (postHocResolutionRule, opt-in via spark.lance.knn.indexedNearestByJoin.enabled):

NearestByJoin(L, R, approx=true, k, ranking, direction) seen in plan
    │
    ├── unwrap R: SubqueryAlias / View / passthrough Project / supported Filter
    │
    ├── R unwraps to Lance DSv2 relation
    │   └── ▶ Path A (existing IndexedNearestJoin pipeline)
    │
    ├── R unwraps to parquet/Delta LogicalRelation (Phase 3, not yet shipped)
    │   ├── spark.lance.knn.externalIndex.enabled = true
    │   │   └── ▶ Path C (IndexedNearestJoinExternal)
    │   └── else fall through to Path B
    │
    ├── spark.lance.knn.tempRForSqlRule.enabled = true  (existing)
    │   └── ▶ Path B (LanceTempR.materialize → Path A)
    │
    └── else: fall through to Spark's default crossJoin rewrite
```

User-visible: a single `df.kNearestJoin(rightDf, ...)` API. Three execution strategies under the hood.

### 4.6 Beyond join: single-query and batched-query retrieval

The same primitives that power the bulk join also serve a second use case that came up in feedback: **single-query retrieval** — one (or a small batch of) query vector(s) producing top-k rows. This is the RAG / search-bar / "nearest 10 docs to this prompt" shape. The bulk join is the wrong tool for it: shipping a 1-row DataFrame to executors, running one probe, shipping one row back is pure scheduler overhead.

The Rust core and JNI already accept a single query: `ExternalIvfPqIndex::search(query, k, nprobes, refine_factor, filter) -> Vec<SearchResult>`. PR #5 ships a thin Scala wrapper, `LanceParquetIndex`, that adapts the JNI surface to `DataFrame`-returning entry points so the result composes with downstream Spark transforms. Three caller patterns, in increasing order of "Spark-ness":

#### Pattern 1: any JVM caller, no Spark dependency

Index files are self-contained in object storage (`manifest.json` + `index.idx`). Any JVM process can open and search:

```java
try (ExternalIvfPqIndex idx = ExternalIvfPqIndex.open(indexUri)) {
  List<SearchResult> hits = idx.search(qvec, /*k*/ 10, /*nprobes*/ 16, /*refineFactor*/ 8, null);
  byte[] arrowIpc = idx.fetchRows(
      hits.stream().map(h -> ParquetRowKey.of(h.getFilePath(), h.getRowIndex())).toList(),
      List.of("doc_id", "title", "url"));
  // decode arrowIpc with ArrowStreamReader
}
```

No `SparkSession`, no `org.apache.spark` dependency on the classpath. Build the index once via Spark or an offline job, then serve queries from a Trino UDF, Presto plugin, a Java request-handler, a Jupyter kernel — anywhere the JVM runs. Per-query latency is bounded by IVF probe + 80 candidate refinement reads (~1-5 ms warm, ~50 ms cold-mmap).

This pattern is **fully shipped today** as part of Phase 1; no Spark wrapper is needed for it. We call it out in the SPIP because it's the load-bearing reason the index file format is engine-independent — Spark is the integration we're proposing here, but Lance's vector index over parquet is a primitive that other engines can reuse.

#### Pattern 2: Spark notebook / job, one query at a time

A user is in a Spark session (notebook, structured streaming sink, batch job) and wants top-k rows for a single query vector to feed into downstream Spark transforms (joins, projections, aggregations, UDFs). PR #5 ships `LanceParquetIndex` for this:

```scala
import org.lance.spark.knn.LanceParquetIndex

implicit val s: SparkSession = spark
val idx = LanceParquetIndex.buildIfMissing(
  spark,
  filePaths    = Seq("/data/embeddings-0.parquet", "/data/embeddings-1.parquet"),
  vectorColumn = "vec")
try {
  // 10 nearest rows to qvec, payload cols materialized via Lance's parquet reader
  val topK: DataFrame = idx.searchToDF(qvec, k = 10, projection = Seq("doc_id", "title", "url"))

  // Composes with normal Spark transforms downstream:
  val enriched = topK
    .join(spark.table("metadata"), Seq("doc_id"))
    .filter("category = 'tech'")
} finally idx.close()
```

**Why driver-side, not a Spark stage.** A single probe is ~1-5 ms warm (the JNI call + IVF probe + 80 candidate refinements + Arrow IPC encode). Wrapping that in a `parallelize(Seq(qvec)).mapPartitions { idx.search(...) }` would cost:

| Cost | Magnitude |
|---|---|
| Spark task launch | ~50-200 ms |
| Stage submission overhead | ~50-100 ms |
| Result block fetch back to driver | ~10-50 ms |
| **Spark scheduler floor** | **~110-350 ms total** |

versus ~1-5 ms for the driver-local call. The Spark wrapping isn't just "no benefit" — it's strictly slower by 30-100×. So `LanceParquetIndex.search` and `searchToDF` both run on the driver. The returned `DataFrame` has 1 partition (because k is bounded; nothing to parallelize), and from that point on it composes with the rest of the Spark plan — that's the only reason it's a `DataFrame` at all rather than a `Seq`.

**Returned schema:**

| `searchToDF` form | Schema |
|---|---|
| `searchToDF(qvec, k)` (no projection) | `(file_path STRING, row_index LONG, score FLOAT)` — keys + exact distance |
| `searchToDF(qvec, k, projection = Seq("doc_id", "title"))` | `(doc_id ..., title ..., score FLOAT)` — payload columns from parquet + score |

When `projection` is non-empty, the wrapper issues `idx.fetchRows(keys, projection)` after search and joins the payload into the row. The schema for the projection columns is read from the parquet footer (cached on first call). One Arrow IPC round-trip; no Spark shuffle.

**Index lifecycle is shared with the bulk-join path.** `buildIfMissing` keys the index file on `sha256(sorted file paths + vector column + params)`. If the bulk-join path (`IndexedNearestJoinExternal`) already built an index over the same files in this Spark application, `LanceParquetIndex.buildIfMissing` returns immediately with the existing URI. Cleanup is application-scoped (`SparkListenerApplicationEnd` + JVM shutdown hook).

**Cluster requirement.** `spark.lance.knn.externalIndex.dir` must point at a shared filesystem (s3://, abfss://, hdfs://, ...) when running on a non-local master, because the index file lives there and any Spark task that later reads it needs cross-executor visibility. The wrapper fails fast at `buildIfMissing` time if this isn't set on a non-local master.

**Filtering and deletes.** `search` accepts a packed `byte[]` of deleted `(file_id << 32) | row_index` rids — useful for honoring Delta deletion vectors / Iceberg position deletes without rebuilding the index. The `RowFilter` plumbing is the same as the bulk-join path (Phase 1).

**What this pattern is NOT.** This is not for batched query workloads — see Pattern 3. The driver call is sequential; running 10K queries through it on the driver is `10K × ~3 ms = 30 sec single-threaded`. For batches, use the bulk join, which distributes across executors.

#### Pattern 3: Spark-cluster caller, many independent queries

A *batch* of query vectors — e.g., 10K user queries to score against a 100M-row corpus, or "for each row in a small probe table, find its top-10 neighbors." This is **already Path C with the role of L and R inverted**: queries play the role of L (distributed), and the indexed corpus plays the role of R.

```scala
val queries: DataFrame = ...          // small, distributed
val corpus:  DataFrame = spark.read.parquet(corpusPath)

// Shipping today (Phase 1.5):
val topK = queries.kNearestJoin(corpus, "qvec", "vec", k = 10)
```

Nothing new is required for this — the join already covers it, and the bulk-join path uses the same `ExternalIndexLifecycle` cache as `LanceParquetIndex.buildIfMissing`, so an index built once is shared across both. The only sugar we'd add later is a method-on-index alias for callers who prefer that shape:

```scala
// Phase 2 sugar (NOT shipped — sketch only):
val idx = LanceParquetIndex.buildIfMissing(spark, filePaths, "vec")
val topK: DataFrame = idx.searchBatch(queries, qCol = "qvec", k = 10)
// Equivalent to: queries.kNearestJoin(corpus, "qvec", "vec", k) under Path C
```

#### Index lifetime is shared across all three patterns

The index built once — by any caller — can be opened by any of the three patterns. The manifest (`manifest.json` next to `index.idx`) is self-describing: caller doesn't need to know how it was built. This is the load-bearing reason the index lives in object storage as a self-contained directory, not inside a Lance dataset directory tree. A maintenance script can build the index nightly; a Spark job, a Java service, and a Trino UDF can all read it concurrently.

#### Surface area (in PR #5 today unless noted)

| Pattern | Class | Method | Status | Purpose |
|---|---|---|---|---|
| 1 | `org.lance.index.external.ExternalIvfPqIndex` | `build` / `open` / `search` / `fetchRows` | shipped (Phase 1) | JNI handle. No Spark dependency. Any JVM caller. |
| 2 | `LanceParquetIndex` | `build` / `buildIfMissing` / `open` | shipped | Scala factory. `buildIfMissing` shares the SHA-256-keyed cache with `IndexedNearestJoinExternal`. |
| 2 | `LanceParquetIndex` | `search(query, k, ...): Seq[SearchResult]` | shipped | Driver-side single-query top-k. |
| 2 | `LanceParquetIndex` | `searchToDF(query, k, projection?): DataFrame` | shipped | 1-partition DataFrame for pipeline composition. With `projection`, materializes payload via Lance's parquet reader. |
| 2 | `LanceParquetIndex` | `fetchRowsToDF(refs, projection): DataFrame` | shipped | Random-access fetch in caller order. |
| 3 | `IndexedNearestJoinExternal` | `apply(left, right, k, ...)` | shipped | Bulk join. Pattern 3 today via `df.kNearestJoin(...)`. |
| 3 | `LanceParquetIndex` | `searchBatch(queries, qCol, k, ...): DataFrame` | proposed (Phase 2 sugar) | Method-on-index alias for the bulk join. Not load-bearing — `kNearestJoin` already covers the use case. |

The driver-side surface (Pattern 2) is exercised by `LanceParquetIndexTest`: 5 tests covering JNI round-trip, DataFrame schema shape, projection materialization, `fetchRowsToDF` ordering, and cache reuse with `IndexedNearestJoinExternal`. Recall is exercised in `external_index_phase1.rs` on the Rust side; the Scala test focuses on what the wrapper itself can break.

## Q5: Why is each hot-path piece in Lance, not Spark?

This section answers the question more precisely than Q3.3 by walking through each step of the probe-refine-materialize loop and showing what Spark would have to do to replicate it.

### 5.1 SIMD distance kernels

Lance's `lance-linalg` crate ships AVX2 / NEON / FP16 implementations of L2, dot, and cosine distance. These are the kernels every probe call inside Lance hits. Examples from the Rust source:

- `lance-linalg/src/distance/l2.rs::l2()` — auto-detects target CPU features at runtime, dispatches to AVX2 unrolled loop on x86_64
- 16 floats per AVX2 ymm register × FMA = 32 flops per cycle on Skylake+
- Tested at 30-50 GFLOPS sustained on a single core at dim=128

The equivalent in Spark JVM:

- Best case: `udf((a: Seq[Float], b: Seq[Float]) => l2(a, b))` — Tungsten cannot vectorize this UDF. Each pair is ~500-1000 cycles in JIT'd code.
- With Java Vector API (Spark 3.5+ optionally): some auto-vectorization, but only for 128-bit lanes safely; higher-width incubator API isn't enabled by default in Spark builds.
- Spark Catalyst expressions for vector ops (Spark 4.2 added `VectorL2Distance`, `VectorCosineSimilarity`, `VectorInnerProduct`) — these CAN go to Tungsten codegen, but the kernels are not SIMD-tuned and run in the JVM's bytecode interpreter or JIT'd scalar paths.

**Honest comparison at dim=128**: Lance's native kernel is 30-50 GFLOPS; Spark's UDF path is 0.5-2 GFLOPS depending on JIT. **20-50× slower per distance.**

For one query × `K × refine_factor = 80` distances at dim=128, that's:
- Lance: ~80 × 128 × 2 / 30 GFLOPS ≈ **0.7 microseconds**
- Spark UDF: ~80 × 128 × 2 / 1 GFLOPS ≈ **20 microseconds**

Per query the gap is 20μs — small. But at `|L|=1000` queries × refine, it's **20 ms vs 0.7 ms** of pure distance work, plus the UDF-call and row-population overhead per evaluation. And this is just one component of the wall.

### 5.2 Page-index-aware parquet random access

Refinement reads ~80 specific rows per query out of a 1M-row parquet file. The two ways to do this:

| Method | What it costs | Where it's available |
|---|---|---|
| **Page index + RowSelection** (the right way) | ~16 KB read for the page index + ~80 × ~4 KB page reads = ~330 KB total per query | parquet-rs ≥58.3 with `PageIndexPolicy::Required`. Lance uses this. |
| **Row group filter + full decompression** | Decompress 1-10 row groups (~50K-500K rows × 512B vec = 25-250 MB) to extract 80 rows | parquet-mr (Spark's JVM library) today. **No fix in flight upstream.** |

**Lance's path**: ~330 KB I/O per query, decoded to ~80 vectors in microseconds.

**Spark's parquet-mr path**: ~50-250 MB I/O per query, decoded to one full row group worth of vectors, then 80 of them extracted. **150-700× more bytes read per query.**

This isn't a Spark configuration issue. parquet-mr genuinely doesn't support page-level row selection. Building it is a months-long parquet-mr contribution + Spark integration. Lance just ships parquet-rs as a dep and gets the right behavior immediately.

### 5.3 PQ-encoded vector storage

Product Quantization (PQ) is an approximate vector encoding scheme:

- A `dim=128, num_sub_vectors=16` PQ codebook stores 256 representative codewords per 8-dim sub-vector
- Each vector is encoded as 16 bytes (one byte per sub-vector, 8 bits = 256 codewords)
- Distance to a query is approximated via lookup tables: precompute distance from query's sub-vector slice to each of 256 codewords, then sum 16 lookups per candidate

This compresses dim=128 vectors from 512 bytes to 16 bytes **and** makes distance scoring 32× faster (SIMD lookup vs full-vector L2).

**Spark has no PQ.** To replicate, we'd need:

- A PQ codebook training algorithm (k-means × 16 sub-vectors, ~100 LoC)
- An on-disk format for PQ codes
- A lookup-table distance scorer
- Catalyst plan nodes for the IVF probe + PQ scoring

All of which exists in Lance. Reimplementing in Spark is a 6-month project even if no one cares about SIMD performance — and at production scale you very much do care.

### 5.4 IVF posting list traversal

IVF stores K-means cluster centroids and a posting list per cluster. A query:

1. Scores against all `num_partitions` centroids (one big distance kernel call)
2. Picks the closest `nprobes`
3. For each probed cluster, walks its posting list applying PQ scoring to candidates

This is a tight loop. Lance does it in a single JNI call (`idx.search(query, k, ...)` returns a `Vec<SearchResult>`).

In Spark, doing this would require:

- A Catalyst plan node that opens the index file
- A custom physical operator that runs the loop (no Catalyst expression maps to "walk a posting list")
- Or: do it in a UDF, paying JVM-overhead per row in the posting list (~10K-100K rows per probed cluster)

The posting list is fundamentally **a tight inner loop**, not a Catalyst expression. Spark's architecture is built for big batched columnar operations; per-row tight loops with branches (like "is this candidate worth refining?") are precisely what falls off Tungsten codegen and into slow JVM interpretation.

### 5.5 Refinement — needs all of the above

Refinement is where 5.1, 5.2, 5.3, 5.4 stack:

```
For each query in left:
    1. IVF probe (5.4) → 80 candidate rids with PQ-approx scores
    2. Decode rids to (file_path, row_index)
    3. Per-file batch: parquet random access (5.2) → 80 actual vectors
    4. Compute 80 exact L2 distances (5.1) on contiguous Vec<f32> buffer
    5. Sort by distance, return top-K
```

If any one of these steps is in Spark (UDF, JVM, parquet-mr), the whole pipeline is bottlenecked by that step. There's no "do most of it in Lance and one piece in Spark" middle ground that performs well — the slowest step dominates.

This is why the Lance + Spark integration looks like "give Lance the input data, get the final results back" — not "do half the algorithm in one and half in the other."

### 5.6 Catalyst's reach: what we DID push into Spark planning

Spark contributes the parts it's actually good at:

- **Plan-shape recognition.** The Catalyst rule (`IndexedNearestByJoinRule`) detects `NearestByJoin(L, R, approx=true, ...)`, unwraps R through `SubqueryAlias`/`View`/passthrough `Project`, translates supported `Filter` predicates to a Lance-side prefilter SQL string, decides routing among Paths A/B/C.
- **Distribution + AQE.** Spark inserts the `Exchange` between probe and merge stages (Path A's pipeline). AQE's `CoalesceShufflePartitions`, `OptimizeSkewJoin`, `OptimizeShuffleWithLocalRead` engage on this exchange.
- **Final output assembly.** `LanceMaterializeExec` emits `RDD[InternalRow]` matching `NearestByJoin.output`, then a top-level `Project` strips the synthetic `__score` column. Catalyst handles the rest of the plan's lifecycle (caching, Catalyst optimizations on the SURROUNDING plan, `df.explain()`).
- **Filter pushdown.** Right-side `WHERE` predicates that translate to Lance SQL get threaded through to Lance's index-side prefilter. Predicates that don't translate cause the rule to refuse the rewrite (refusal not partial pushdown) — fall through to Spark's brute-force.

Spark does what it's good at (high-level distributed plan management, optimization rules, AQE, lifecycle); Lance does what it's good at (SIMD distance, vector index, parquet random access). The boundary between them is a thin RDD-level interface. This is the **right** decomposition.

## Q6: Who cares?

Concrete users / use cases this proposal unlocks:

- **Spark KNN over parquet/Delta** — the primary case. Vector embeddings stored in parquet/Delta tables (the standard for Spark warehouses) get sub-second similarity search without migrating to Lance. (Path C.)
- **Spark on Lance datasets** — the existing case ([lance-format/lance-spark#541](https://github.com/lance-format/lance-spark/issues/541)). For users who already have data in Lance, no change. (Path A.)
- **Vector search on arbitrary subplans** — `Filter(Join(parquet, delta), ...)` — supports any DataFrame. (Path B.)
- **Iterative ML workflows** — train a model, score against existing embedding tables, write results, repeat. Each query benefits from the ~30s amortized index build instead of paying per-query temp-write costs.
- **Real-time inference services** running on Spark — the warm per-query latency at 600-900 ms (wide-medium) and ~5-9 sec (mega-medium, |L|=1000 batch) makes interactive vector search practical.
- **Multi-modal retrieval** — text + image embeddings stored alongside payload columns. Path C reads only projected payload at materialize time, so wide R doesn't bottleneck.

The bigger picture: **vector data is becoming a first-class column type in data warehouses** (Snowflake's `VECTOR`, BigQuery's `EMBEDDING_DISTANCE`, Postgres pgvector). Spark needs to be in this story or it gets routed around. This proposal puts Spark on competitive footing without forcing users to migrate to a vector-native format.

## Q7: Benchmarks

### 7.1 Methodology

Cluster: Spark 3.5, 16 executors × 4 cores requested, shared file:// scratch volume. The bench runs an **executor-pool CPU probe** before timing — a fixed-cost compute loop on every task slot — and prints per-executor median compute. It aborts (or warns) if `slowest_executor_median / fastest_executor_median > 1.25×`, so reviewers can confirm the run was on a uniform pool before trusting the deltas. The numbers below were collected on a confirmed-uniform pool (1.01× spread on both scales).

Workload parameters:
- `dim=128`, `K=10`, `metric=L2`
- `nprobes=16`, `refine_factor=8` (search-time)
- IVF: 256 partitions, PQ: 16 sub-vectors × 8 bits, `max_iters=50`
- 16 string payload columns × 64 bytes each on R

Configs (all run on the SAME data in the SAME job per scale):

- **A**: Spark crossJoin + min_by_k baseline (skipped on wide payload — pair count dominates regardless)
- **B-narrow**: Path B (temp-Lance write) + `kNearestJoin`, project `rid` only
- **B-wide**: Path B + `kNearestJoin`, project `rid + 16 payload columns`
- **C-indexed-narrow**: Path A + IVF-PQ index (R already in Lance), project `rid` only — **apples-to-apples reference**
- **C-indexed-wide**: Path A + IVF-PQ index, project `rid + 16 payload`
- **E**: Path C (`IndexedNearestJoinExternal` over the same parquet files), project `rid` only
- **F**: MLlib `BucketedRandomProjectionLSH` baseline (skipped via `BENCH_SKIP_LSH=true` for cluster runs — see 3.2)

### 7.2 Results

Both scales were collected on a single cluster session (one job per scale) with 6 executors active and the pool-uniformity gate passing (≤1.01× spread). Numbers are from 2 measured runs per config (1 warmup discarded).

#### Wide-medium (|R|=1M, |L|=100)

| Config | Run 1 | Run 2 | Median |
|---|---|---|---|
| B-narrow | 13,999 ms | 11,517 ms | **13,999 ms** |
| C-indexed-narrow | 913 ms | 1,212 ms | **1,212 ms** |
| **E warm** | 689 ms | 670 ms | **689 ms** |
| E build (one-time) | — | — | 29,366 ms |

**Speedups:** E vs B-narrow = **20.3×**. E vs C-indexed = **1.76× faster**.

#### Mega-medium (|R|=10M, |L|=1000)

| Config | Run 1 | Run 2 | Median |
|---|---|---|---|
| B-narrow | 98,334 ms | 107,227 ms | **107,227 ms (~1.8 min)** |
| C-indexed-narrow | 8,358 ms | 8,229 ms | **8,358 ms** |
| **E warm** | 8,602 ms | 8,454 ms | **8,602 ms** |
| E build (one-time) | — | — | 103,806 ms |
| C-indexed pre-work | — | — | 80,055 ms (Lance write + index build, outside timing) |

**Per-query** (dividing by |L|=1000):
- B-narrow: ~107 ms/query
- C-indexed-narrow: ~8 ms/query
- E warm: ~9 ms/query

**Speedups:** E vs B-narrow = **12.5×**. E vs C-indexed = **1.03× slower** (~equivalent — both runs of each config land within ~2% of each other).

#### Stages and tasks (mega-medium, observed in run logs)

| Config | Spark stages | Tasks per stage |
|---|---|---|
| B-narrow | ~5-6 stages | parquet read of R: 32+ tasks; Lance temp write: variable; probe: 1-2 tasks; shuffle exchange: 128 tasks; merge + materialize: 128 tasks |
| C-indexed-narrow | ~3-4 stages | probe: 5-10 tasks (= number of Lance fragments); shuffle: 128 tasks; merge + materialize: 128 tasks |
| **E warm** | **1 stage** (no shuffle) | **60 tasks** (left's natural partitioning for |L|=1000 with the source-size parallelism heuristic) |

E is the simplest pipeline. The fused stage (Path C) was a deliberate architectural choice after benchmarking showed the inherited Path A shuffle was vestigial here.

#### Per-run consistency (uniform-pool runs)

| Scale / config | Run 1 vs Run 2 spread |
|---|---|
| wide-medium B-narrow | 18% |
| wide-medium C-indexed | 25% |
| wide-medium E warm | 2.7% |
| mega-medium B-narrow | 8% |
| mega-medium C-indexed | 1.5% |
| mega-medium E warm | 1.7% |

The pool-uniformity gate keeps run-to-run spread tight enough for sub-second deltas to be meaningful.

### 7.3 What the numbers mean — and what they don't

**Reproducible findings (uniform-pool, single session per scale):**

1. **B-narrow's per-query cost grows linearly with |R|.** Temp-Lance write of (rid + rvec) is bandwidth-bound on shared scratch. At 1M rows it's ~14s; at 10M rows it's ~107s. Predictable, dominant.
2. **E warm's per-query cost is sub-linear in |R|.** Per-query went from ~7 ms at wide-medium to ~9 ms at mega-medium despite |R| growing 10× — because the IVF probe + PQ refinement loop cost is bounded by `nprobes × partition_size + K × refine_factor`, not by `|R|`.
3. **E vs B-narrow speedup is stable at 12-20×** across both scales. The headline finding: external-index avoids the temp-Lance write, and that write is what dominates B's cost.
4. **E vs C-indexed depends on scale.** At wide-medium E is ~1.76× faster (689 ms vs 1,212 ms — both runs of each config separated by ~2 sec). At mega-medium they're indistinguishable (1.03× slower, within run-to-run noise of each config). **The "stay in parquet" path is not a perf penalty.**

**Why E might be slightly faster than C-indexed at smaller scale:**
- Path A (C-indexed) has 3-4 Spark stages with ~50-200 ms scheduler overhead each (probe → shuffle → merge → materialize). Path C (E) has 1 fused stage. At wide-medium where the per-query work is small, that overhead is a meaningful fraction of total time.
- Path A's shuffle serializes `(leftId, ScoredRowRef)` for `K × probeParallelism × |L|` refs through Spark's network stack. Path C keeps refs in-JVM.
- At mega-medium the per-query work is large enough that these constant-cost differences vanish into the noise.

**What these numbers do NOT prove:**

- **Cold-cache parquet read performance.** All numbers above are warm. Production S3/ABFS cold-cache is unmeasured and likely ~2-3× slower for E's first-after-cold-restart query.
- **LSH numbers.** Skipped from cluster runs; estimated 30-60s per query at |R|=1M based on `approxSimilarityJoin`'s O(|R|) shuffle cost.
- **Long-tail latency.** 2 measured runs is enough to distinguish 12× speedups from 1× speedups but not enough to characterize the p99 distribution. For latency-SLO-bound workloads, more runs are needed.

**Methodology footnote.** The pool-uniformity probe (`ExecutorCpuCheck`) is part of `IndexedNearestJoinExternalBenchmark` and prints a per-executor median table at the start of every run. Re-running the bench in your own cluster will produce a similar artifact you can use to decide whether the run's numbers are trustworthy.

### 7.4 Cloud-storage (abfss) follow-up — Phase 1.6 perf optimizations

The 7.1/7.2 numbers above are uniform-pool runs against `file://` cluster scratch — no cloud-storage round-trip per refinement read. When Path C runs against `abfss://` source parquet on Databricks (DBR 16.4 LTS, 2 workers × 4 cores), the per-query refinement I/O dominates: each query's 80 candidate vectors land on ~30-40 parquet pages within the row group, the page-index-aware reader coalesces them into one ~50 MB byte-range fetch, and the abfss round-trip cost shows up directly in wall-clock.

Tracking this on `wide-tiny-l10` (|R|=100K, |L|=100, dim=128, 16 payload cols) on a 2-worker × 4-core DBR cluster:

| Optimization shipped (cumulative) | E warm median | Δ vs prior |
|---|---|---|
| abfss baseline (before any cloud-specific tuning) | ~12,000 ms | — |
| `CoalescingParquetReader` with 64 MiB gap + 32-parallel `coalesce_ranges` | 8,682 ms | -28% |
| `ParquetMetaCache` (per-task footer + page-index reuse across queries) | 7,471 ms | -38% |
| Tuned coalesce gap to 4 MiB + dispatched PQ scoring through `compute_distances` | 7,201 ms | -40% |
| **`search_batch` — per-task unioned refinement read** | **2,978 ms** | **-75%** |

local-fs reference for the same scale: ~1,200 ms. After the batch fix the abfss path is ~2.5× over local-fs (vs. ~10× before), which matches what we expected: the residual gap is now CPU-bound on probe+PQ, not I/O.

#### What changed in each step

1. **CoalescingParquetReader** — `parquet-rs`'s default `OBJECT_STORE_COALESCE_DEFAULT = 1 MiB` and hardcoded 10-way parallelism are tuned for cluster-local NVMe. On abfss where each round-trip is 30-50 ms, we wrap `ParquetObjectReader` to merge ranges with a tunable gap and fetch with a tunable concurrency.

2. **ParquetMetaCache** — every per-query `open_parquet_async` was re-doing `ObjectStore::from_uri` + `head` + footer + page-index fetch (3 abfss round trips). Cache lives on `OpenedExternalIndex` and stores `(ObjectStore, Path, file_size, ArrowReaderMetadata)`. On hit, `ParquetRecordBatchStreamBuilder::new_with_metadata` skips all three round trips. Same shape as Spark's `ParquetIOMetadataCache`, DuckDB's `parquet_metadata_cache`, and parquet-rs's own `new_with_metadata` example.

3. **SIMD PQ via `compute_distances`** — replaced the per-row scalar scoring loop in `search.rs` with `ProductQuantizer::compute_distances`, which dispatches to the SIMD distance kernels in `lance-index`. Required transposing PQ codes from row-major (on-disk) to column-major before scoring; one transpose per partition is far cheaper than N scalar scoring loops. Net win at 8-bit PQ was small (the SIMD path mostly fires for 4-bit) but the code is now dispatching to the right kernel for any future change in PQ width.

4. **`search_batch`** — the dominant fix. Adds `pub async fn search_batch(opened, queries, k, nprobes, refine_factor, filter) -> Vec<Vec<SearchResult>>` in `lance::index::vector::external::search`, with matching JNI (`nativeSearchBatch`), Java (`ExternalIvfPqIndex.searchBatch`), and Scala (`ExternalIndexProbe.probeBatch`) bindings. The fused Spark stage (`ExternalFusedStage.fusedPartition`) now collects the entire partition's left rows + queries first, then issues ONE batched probe call. Inside Lance, the batch path:
   - Probes + PQ-scores each query independently
   - **Unions the candidate `(file_path, row_in_file)` pairs across all queries**
   - Issues ONE refinement read per distinct file covering the unioned set
   - Looks each query's candidates up from the shared per-file buffer for exact distance + top-K

   Per-task cost goes from `N × (probe + pq + refine_io)` to `N × (probe + pq) + refine_io`. With a 12-13-query/task batch, the I/O term collapses by ~12×.

#### Per-batch breakdown (from `RUST_LOG=lance::index::vector::external=info` driver/executor logs)

```
extidx_search_batch n_queries=12 total=~2300ms
  probe_pq    ≈ 1800 ms   (12 × ~150 ms — pure CPU, sequential within task)
  union/topk  ≈    0 ms   (in-memory hash-set + sort)
  refine_open ≈   40 ms   (one ParquetMetaCache miss per task, then hits)
  refine_io   ≈  470 ms   (ONE ~50 MB byte-range fetch covering all candidates)
```

`refine_files=1` always (one source parquet at this scale); `candidate_pairs=960` (12 queries × 80 refine candidates). After dedup the per-file refinement read covers ~920 unique rows.

#### Why we stopped at -75%

Per-task probe+PQ is now 1.8 s of the 2.3 s batch wall-clock — **CPU-bound, not I/O**. Spark's task-per-core execution model already saturates all 8 cluster cores with 8 concurrent tasks, so intra-task parallelism (rayon across a batch's queries) would just steal from the other 7 tasks. Further gains require either more cluster cores, dropping to 4-bit PQ (better SIMD utilization), or a recall trade-off via lower `nprobes`/`refine_factor`. None of those are appropriate as default tunings; they're per-workload knobs.

#### Code changes shipped under Phase 1.6

- `rust/lance/src/index/vector/external/parquet_source.rs` — `CoalescingParquetReader`, `ParquetMetaCache`, `open_parquet_cached`
- `rust/lance/src/index/vector/external/open.rs` — `OpenedExternalIndex` owns the cache for its handle lifetime
- `rust/lance/src/index/vector/external/search.rs` — `search_batch` + per-batch `log::info!` timing diagnostics; `search` reduced to a thin wrapper over `search_batch(&[query])`
- `rust/lance/src/index/vector/external/mod.rs` — `ExternalIvfPqIndex::search_batch` public API
- `java/lance-jni/src/external_index.rs` — `nativeSearchBatch`
- `java/src/main/java/org/lance/index/external/ExternalIvfPqIndex.java` — `searchBatch(float[][], k, nprobes, refineFactor, deletedRids)`
- `lance-spark-knn_2.12/src/main/scala/org/lance/spark/knn/internal/ExternalIndexProbe.scala` — `probeBatch`
- `lance-spark-knn_2.12/src/main/scala/org/lance/spark/knn/internal/ExternalFusedStage.scala` — `fusedPartition` collects all queries before one batch probe call
- New test: `search_batch_matches_per_query_search` asserts batched + per-query results agree exactly.

## Q8: Risks

### Architectural

| Risk | Mitigation |
|---|---|
| **Three paths are too many to maintain** | Each path has a clearly distinct preconditions; routing rule is mechanical. The shared probe + merge + materialize pipeline (Phase 1.5 staged execs) is reused across A and B. C is intentionally simpler (1 stage, no shuffle). Total LoC is comparable to a single path with all features, just split across files. |
| **Catalyst rule complexity** | The rule is opt-in (`spark.lance.knn.indexedNearestByJoin.enabled=false` by default). Users who don't enable get Spark's default crossJoin rewrite — no behavior change. Once enabled, the rule's pattern match is conservative (refusal not partial pushdown). |
| **Cross-language test surface** | Rust unit tests + JNI smoke tests + Scala integration tests cover the JNI surface. End-to-end tests in `IndexedNearestJoinExternalTest` exercise the full path. Cluster bench provides regression coverage. |

### Implementation

| Risk | Mitigation |
|---|---|
| **parquet-rs corner cases** (encrypted parquet, dictionary-encoded, INT96 timestamps) | Document supported feature set; clear errors for unsupported. Encrypted parquet specifically is out of scope for Phase 1. |
| **Concurrent reads on the same parquet file** require separate `Arc<ParquetObjectReader>` clones (parquet-rs `&mut self` on `get_bytes`) | Phase 2 reader cache keyed on `(file_id, session_id)`. Phase 1 constructs fresh per call — wasteful but simple. |
| **Snapshot consistency** drift between build and search | Phase 2 manifest fingerprint (size + mtime + footer hash) validation at `open()`. Documented "rebuild on snapshot change" workflow. |
| **JVM parquet writers emit `List<Float>` not `FixedSizeList<Float>`** | `parquet_source::coerce_to_fsl` accepts both; validates row-length uniformity. Phase 1. |
| **Cross-compile complexity** for the JNI .so | `cargo zigbuild --target x86_64-unknown-linux-gnu --release` works from macOS arm64. Documented for contributors. Upstream CI publishes multi-arch jars. |

### Operational

| Risk | Mitigation |
|---|---|
| **Disk quota on cluster scratch** | Bench's `cleanupSiblingScratchDirs` strict pattern match, sweeps `knn-bench-data-*` siblings only. Production users size `spark.lance.knn.tempR.dir` and `spark.lance.knn.externalIndex.dir` per their cluster. |
| **Multi-tenant cluster variance** dominates fine-grained comparisons | Bench reports run-to-run ranges. Median across multiple runs for headline. Honest about which deltas are within noise. |
| **Index file format compatibility** across lance-rs versions | Index uses standard Lance IVF-PQ encoding (`pb::Index` protobuf). The manifest format is versioned (`manifest_version: 1`). Both are tested for round-trip across the open path. |

### Strategic

| Risk | Mitigation |
|---|---|
| **Lance-c ecosystem maturity** (Velox PR, Gluten support) overtakes the JNI path | The Catalyst rule's logical contract is engine-independent; lowering to a Velox-native exec when that ecosystem matures requires only a Strategy entry, not a redesign. The JVM JNI path keeps working alongside. |
| **Spark 4.x evolution** breaks the rule injection | The current rule targets Spark 4.2-SNAPSHOT for the `NearestByJoin` operator. For Spark 3.5, lance-spark uses a different Catalyst integration (the staged execs work in both). We test against multiple Spark versions in CI. |

## Q9: Implementation status

### Phase 1 (lance-rs): ✅ end-to-end shipped

- `rust/lance/src/index/vector/external/` — new module: `mod.rs`, `types.rs`, `params.rs`, `parquet_source.rs`, `manifest.rs`, `build.rs`, `open.rs`, `search.rs`, `fetch.rs` (~1,100 LoC)
- `rust/lance/src/index/vector/ivf.rs` — new `pub fn write_ivf_pq_file_external(...)` parallel to the existing `write_ivf_pq_file_from_existing_index` but with no `&Dataset` arg
- `rust/lance/Cargo.toml` — parquet promoted from dev-dep to runtime dep
- 14 unit tests + 1 integration test (`tests/external_index_phase1.rs`); all passing
- Branch: `external-index-rfc-draft` on [sezruby/lance](https://github.com/sezruby/lance/pull/1)

### Phase 1.5 (JNI + Java surface): ✅

- `java/lance-jni/src/external_index.rs` — `nativeBuild`, `nativeOpen`, `nativeClose`, `nativeSearch`, `nativeFetchRows` + accessors
- `java/src/main/java/org/lance/index/external/` — `ExternalIvfPqIndex`, `ExternalIvfPqIndexParams`, `SearchResult`, `ParquetRowKey`
- 5 JNI smoke tests; all passing

### Phase 1.5 (lance-spark integration): ✅

Bulk join (Pattern 3):
- `IndexedNearestJoinExternal.scala` — public DataFrame API for `df.kNearestJoin(parquetCorpus, ...)`
- `internal/ExternalIndexProbe.scala` — Scala wrapper around the JNI handle
- `internal/ExternalIndexLifecycle.scala` — driver-side cache (SHA-256-keyed by file paths + col + params) + cleanup hook
- `internal/ExternalFusedStage.scala` — single-stage probe + materialize, source-size-driven parallelism
- `internal/LanceVectorIndexBuilder.scala` — promoted from test/ for bench use
- `IndexedNearestJoinExternalTest.scala` — end-to-end test passing
- `benchmark/IndexedNearestJoinExternalBenchmark.scala` — A vs B-narrow vs B-wide vs C-indexed-narrow vs C-indexed-wide vs E vs F (LSH)

Driver-side single-query API (Pattern 2):
- `LanceParquetIndex.scala` — `build` / `buildIfMissing` / `open` / `search` / `searchToDF` / `fetchRowsToDF`. Hooks into `ExternalIndexLifecycle` so the index file is shared with the bulk-join path within one Spark application.
- `LanceParquetIndexTest.scala` — 5 tests: JNI round-trip, DataFrame schema shape, projection materialization, `fetchRowsToDF` caller order, and `buildIfMissing` cache reuse.

Branch: `knn-external-index` on [sezruby/lance-spark](https://github.com/sezruby/lance-spark/pull/5).

### Phase 1.6 (cloud-storage perf): ✅ shipped

Optimization layer on top of Phase 1.5 — same external-index API, same recall, abfss/cloud-storage path now ~75% faster than uncloud-tuned baseline. See [Q7.4](#74-cloud-storage-abfss-follow-up--phase-16-perf-optimizations) for the timeline and per-batch breakdown.

- `parquet_source.rs` — `CoalescingParquetReader` (tunable coalesce gap + parallelism), `ParquetMetaCache` (per-handle footer reuse)
- `search.rs` — `search_batch` for unioned per-task refinement; `search` reduced to a thin wrapper
- JNI + Java + Scala bindings for `searchBatch` / `probeBatch`
- `ExternalFusedStage.fusedPartition` issues one batched probe call per Spark task instead of N per-query calls
- Per-batch timing logs gated on `LANCE_LOG=lance::index::vector::external=info` for production diagnostics

### What's NOT in Phase 1

- Catalyst rule routing for parquet/Delta direct scans → Path C (Phase 3)
- Persistent index reuse across Spark sessions (Phase 2 — needs manifest fingerprint validation in lance-rs)
- HNSW / IVF-FLAT external builders (Phase 2/3)
- `append()` / `compact()` for incremental index updates (Phase 4)
- Distributed index build via Spark executors (Phase 4)
- Reader cache keyed on `(file_id, session_id)` (Phase 2)

## Q10: Phasing

| Phase | Goal | LoC est | Status |
|---|---|---|---|
| **1 (lance-rs)** | IVF-PQ external index over parquet, with refinement | ~1,100 + ~250 tests | ✅ done |
| **1.5 (JNI + Spark)** | JNI bindings + Java surface + DataFrame API + bench | ~600 (Rust+Java) + ~700 (Scala) + ~450 tests | ✅ done |
| **1.6 (cloud perf)** | search_batch + ParquetMetaCache + CoalescingParquetReader; abfss path 12s→3s on wide-tiny | ~250 + tests | ✅ done |
| **2** | Manifest fingerprint validation + persistent-index reuse + reader cache | ~400 LoC + tests | pending |
| **3** | Catalyst rule routing for Path C (parquet/Delta auto-detect) | ~400 LoC + tests | pending |
| **4** | Distributed index build (Spark executors) + `append()` / `compact()` | ~1,200 LoC + tests | pending |

## Open questions

1. **Persistent index across Spark sessions.** Phase 2 ships fingerprint validation; the lifecycle is per-application today. Should there be a Hive-metastore-style registration (`spark.lance.knn.externalIndex.catalog`) so multiple apps share an index? Or is it sufficient to expose the URI and let users wire their own catalog?
2. **Should Path C support Delta deletion vectors natively?** Today: `RowFilter` byte-array of deleted rids — caller materializes the bitmap from Delta's snapshot. Native Delta integration would need a `lance-spark-delta` companion. Probably belongs as a follow-up rather than core.
3. **Path B's temp dir cleanup contention.** Multiple jobs writing to the same scratch root can race. Per-app namespacing exists; per-query namespacing would add safety. Worth it?
4. **`probeParallelism > 1` for Path C.** Path A supports this (fragment-grouped probe); Path C doesn't yet because Lance's `idx.search()` already merges across IVF partitions internally. But for very large `|L|` (millions), having multiple Spark tasks share the work for one left row would reduce per-row latency. Phase 3 candidate.
5. **HNSW external builder.** Same shape as IVF-PQ but different on-disk encoding. Phase 3 — defer until use case is concrete.
6. **What's the right "stay in parquet vs migrate to Lance" decision rule** for documentation? Current draft: "if R is queried >2 times and writes are infrequent, Path C wins on TCO; if R is queried once-and-discarded, Path B is fine; if you can migrate cleanly and keep R updated in Lance, Path A wins per-query latency."

## References

### This proposal's artifacts

- **lance-rs Phase 1 PR (DRAFT)**: https://github.com/sezruby/lance/pull/1
- **lance-spark integration PR (DRAFT)**: https://github.com/sezruby/lance-spark/pull/5
- **Tracking issue for the PRs**: https://github.com/sezruby/lance-spark/issues/6
- **External-index RFC + benchmark numbers (Path C)**: https://github.com/sezruby/lance-spark/issues/4
- **Per-query temp-Lance write (Path B)**: https://github.com/sezruby/lance-spark/issues/2 (PR #3)

### Upstream

- lance-format/lance-spark#541 — original indexed `NearestByJoin` work that's the foundation of Path A
- [Lance distributed indexing](https://lancedb.github.io/lance/guide/distributed_indexing/)
- [Velox Lance read connector](https://github.com/facebookincubator/velox/pull/16556)
- [lance-c (Lance-as-library)](https://github.com/lance-format/lance/pull/6254)
- Spark MLlib LSH source: [LSH.scala](https://github.com/apache/spark/blob/v3.5.5/mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala)

### Background reading

- Spark `NearestByJoin` operator (Spark 4.2): the logical operator this work attaches a strategy to
- Spark Catalyst `postHocResolutionRule` injection point: where the rule lives in the analyzer/optimizer pipeline
- parquet-rs ≥58.3 page index APIs: [`PageIndexPolicy`](https://docs.rs/parquet/58/parquet/file/metadata/enum.PageIndexPolicy.html), [`RowSelection`](https://docs.rs/parquet/58/parquet/arrow/arrow_reader/struct.RowSelection.html)
- Lance's IVF-PQ index format: shared between Path A (Lance dataset) and Path C (external) — same on-disk encoding





`searchToDF` form	Schema
`searchToDF(qvec, k)` (no projection)	`(file_path STRING, row_index LONG, score FLOAT)` — keys + exact distance
`searchToDF(qvec, k, projection = Seq("doc_id", "title"))`	`(doc_id ..., title ..., score FLOAT)` — payload columns from parquet + score

Pattern	Class	Method	Status	Purpose
1	`org.lance.index.external.ExternalIvfPqIndex`	`build` / `open` / `search` / `fetchRows`	shipped (Phase 1)	JNI handle. No Spark dependency. Any JVM caller.
2	`LanceParquetIndex`	`build` / `buildIfMissing` / `open`	shipped	Scala factory. `buildIfMissing` shares the SHA-256-keyed cache with `IndexedNearestJoinExternal`.
2	`LanceParquetIndex`	`search(query, k, ...): Seq[SearchResult]`	shipped	Driver-side single-query top-k.
2	`LanceParquetIndex`	`searchToDF(query, k, projection?): DataFrame`	shipped	1-partition DataFrame for pipeline composition. With `projection`, materializes payload via Lance's parquet reader.
2	`LanceParquetIndex`	`fetchRowsToDF(refs, projection): DataFrame`	shipped	Random-access fetch in caller order.
3	`IndexedNearestJoinExternal`	`apply(left, right, k, ...)`	shipped	Bulk join. Pattern 3 today via `df.kNearestJoin(...)`.
3	`LanceParquetIndex`	`searchBatch(queries, qCol, k, ...): DataFrame`	proposed (Phase 2 sugar)	Method-on-index alias for the bulk join. Not load-bearing — `kNearestJoin` already covers the use case.

Risk	Mitigation
Three paths are too many to maintain	Each path has a clearly distinct preconditions; routing rule is mechanical. The shared probe + merge + materialize pipeline (Phase 1.5 staged execs) is reused across A and B. C is intentionally simpler (1 stage, no shuffle). Total LoC is comparable to a single path with all features, just split across files.
Catalyst rule complexity	The rule is opt-in (`spark.lance.knn.indexedNearestByJoin.enabled=false` by default). Users who don't enable get Spark's default crossJoin rewrite — no behavior change. Once enabled, the rule's pattern match is conservative (refusal not partial pushdown).
Cross-language test surface	Rust unit tests + JNI smoke tests + Scala integration tests cover the JNI surface. End-to-end tests in `IndexedNearestJoinExternalTest` exercise the full path. Cluster bench provides regression coverage.

Risk	Mitigation
parquet-rs corner cases (encrypted parquet, dictionary-encoded, INT96 timestamps)	Document supported feature set; clear errors for unsupported. Encrypted parquet specifically is out of scope for Phase 1.
Concurrent reads on the same parquet file require separate `Arc<ParquetObjectReader>` clones (parquet-rs `&mut self` on `get_bytes`)	Phase 2 reader cache keyed on `(file_id, session_id)`. Phase 1 constructs fresh per call — wasteful but simple.
Snapshot consistency drift between build and search	Phase 2 manifest fingerprint (size + mtime + footer hash) validation at `open()`. Documented "rebuild on snapshot change" workflow.
JVM parquet writers emit `List<Float>` not `FixedSizeList<Float>`	`parquet_source::coerce_to_fsl` accepts both; validates row-length uniformity. Phase 1.
Cross-compile complexity for the JNI .so	`cargo zigbuild --target x86_64-unknown-linux-gnu --release` works from macOS arm64. Documented for contributors. Upstream CI publishes multi-arch jars.

Risk	Mitigation
Disk quota on cluster scratch	Bench's `cleanupSiblingScratchDirs` strict pattern match, sweeps `knn-bench-data-*` siblings only. Production users size `spark.lance.knn.tempR.dir` and `spark.lance.knn.externalIndex.dir` per their cluster.
Multi-tenant cluster variance dominates fine-grained comparisons	Bench reports run-to-run ranges. Median across multiple runs for headline. Honest about which deltas are within noise.
Index file format compatibility across lance-rs versions	Index uses standard Lance IVF-PQ encoding (`pb::Index` protobuf). The manifest format is versioned (`manifest_version: 1`). Both are tested for round-trip across the open path.

Dataset	\|R\|	dim	p50	p90	p99
Cohere `wikipedia-2023-11-embed-v3`	10M	1024	56 ms	182 ms	514 ms
Synthetic uniform (worst case for IVF)	100M	128	157 ms	776 ms	1,213 ms
Synthetic uniform	250M	128	1,084 ms	1,528 ms	2,323 ms

Config	Median	vs SINGLE
SINGLE (driver loop)	~490 s	1.00×
SPARK_DIST (`mapPartitions` → `LanceProbe.probe`, no shuffle)	~42 s	~13× faster
SPARK (`kNearestJoin` staged pipeline, `probeParallelism=1`)	~472 s	~1.04× (break-even)

Config	Median (100 queries)	vs SINGLE
SINGLE (driver loop)	~50 min	1.00×
SPARK (`kNearestJoin`, `probeParallelism=8`)	~19.4 min	2.6×
SPARK (`kNearestJoin`, `probeParallelism=32`)	~13.9 min	3.6×
SPARK_DIST (`mapPartitions`)	~5.0 min	~10× faster

If R is...	The path is...
A Lance dataset (with or without a vector index)	Path A: Lance-native KNN (existing — #541 upstream)
A direct parquet or Delta scan	Path C: external Lance vector index over the source files (new — #4 / PR #5)
Anything else (joins, projections, filters, computed columns, subqueries)	Path B: per-query temp Lance materialization (new — #2 / PR #3)

Cost	Magnitude
Spark task launch	~50-200 ms
Stage submission overhead	~50-100 ms
Result block fetch back to driver	~10-50 ms
Spark scheduler floor	~110-350 ms total

Method	What it costs	Where it's available
Page index + RowSelection (the right way)	~16 KB read for the page index + ~80 × ~4 KB page reads = ~330 KB total per query	parquet-rs ≥58.3 with `PageIndexPolicy::Required`. Lance uses this.
Row group filter + full decompression	Decompress 1-10 row groups (~50K-500K rows × 512B vec = 25-250 MB) to extract 80 rows	parquet-mr (Spark's JVM library) today. No fix in flight upstream.

Config	Run 1	Run 2	Median
B-narrow	13,999 ms	11,517 ms	13,999 ms
C-indexed-narrow	913 ms	1,212 ms	1,212 ms
E warm	689 ms	670 ms	689 ms
E build (one-time)	—	—	29,366 ms

Config	Run 1	Run 2	Median
B-narrow	98,334 ms	107,227 ms	107,227 ms (~1.8 min)
C-indexed-narrow	8,358 ms	8,229 ms	8,358 ms
E warm	8,602 ms	8,454 ms	8,602 ms
E build (one-time)	—	—	103,806 ms
C-indexed pre-work	—	—	80,055 ms (Lance write + index build, outside timing)

Config	Spark stages	Tasks per stage
B-narrow	~5-6 stages	parquet read of R: 32+ tasks; Lance temp write: variable; probe: 1-2 tasks; shuffle exchange: 128 tasks; merge + materialize: 128 tasks
C-indexed-narrow	~3-4 stages	probe: 5-10 tasks (= number of Lance fragments); shuffle: 128 tasks; merge + materialize: 128 tasks
E warm	1 stage (no shuffle)	60 tasks (left's natural partitioning for

Scale / config	Run 1 vs Run 2 spread
wide-medium B-narrow	18%
wide-medium C-indexed	25%
wide-medium E warm	2.7%
mega-medium B-narrow	8%
mega-medium C-indexed	1.5%
mega-medium E warm	1.7%

Optimization shipped (cumulative)	E warm median	Δ vs prior
abfss baseline (before any cloud-specific tuning)	~12,000 ms	—
`CoalescingParquetReader` with 64 MiB gap + 32-parallel `coalesce_ranges`	8,682 ms	-28%
`ParquetMetaCache` (per-task footer + page-index reuse across queries)	7,471 ms	-38%
Tuned coalesce gap to 4 MiB + dispatched PQ scoring through `compute_distances`	7,201 ms	-40%
`search_batch` — per-task unioned refinement read	2,978 ms	-75%

Risk	Mitigation
Lance-c ecosystem maturity (Velox PR, Gluten support) overtakes the JNI path	The Catalyst rule's logical contract is engine-independent; lowering to a Velox-native exec when that ecosystem matures requires only a Strategy entry, not a redesign. The JVM JNI path keeps working alongside.
Spark 4.x evolution breaks the rule injection	The current rule targets Spark 4.2-SNAPSHOT for the `NearestByJoin` operator. For Spark 3.5, lance-spark uses a different Catalyst integration (the staged execs work in both). We test against multiple Spark versions in CI.

Phase	Goal	LoC est	Status
1 (lance-rs)	IVF-PQ external index over parquet, with refinement	~1,100 + ~250 tests	✅ done
1.5 (JNI + Spark)	JNI bindings + Java surface + DataFrame API + bench	~600 (Rust+Java) + ~700 (Scala) + ~450 tests	✅ done
1.6 (cloud perf)	search_batch + ParquetMetaCache + CoalescingParquetReader; abfss path 12s→3s on wide-tiny	~250 + tests	✅ done
2	Manifest fingerprint validation + persistent-index reuse + reader cache	~400 LoC + tests	pending
3	Catalyst rule routing for Path C (parquet/Delta auto-detect)	~400 LoC + tests	pending
4	Distributed index build (Spark executors) + `append()` / `compact()`	~1,200 LoC + tests	pending

SPIP: Lance-backed approximate nearest-neighbor join for Apache Spark — three complementary paths #7

Description

SPIP: Lance-backed approximate nearest-neighbor join for Apache Spark — three complementary paths

📊 Measured results summary (added 2026-06-16)

Single-query retrieval (M=1, driver-side native Dataset.newScan(...nearest), IVF-PQ)

Batch KNN-join, indexed (synthetic 100M × dim=128, IVF-PQ, M=5000 queries, 8×4 cluster)

Batch KNN-join, no index (Cohere wiki 10M × dim=1024, brute-force, M=100, 8×4 cluster)

Native batch-query API (lance-core #6828, shared flat-KNN scan)

Outline

Q1: What are we trying to do?

Q2: What this is NOT designed to solve

Q3: How is it done today, and what are the limits?

3.1 Spark's brute-force crossJoin — the natural-but-wrong default

3.2 Spark MLlib BucketedRandomProjectionLSH

3.3 The Spark-native KNN wall

3.3.1 JVM row population overhead

3.3.2 Shuffle bandwidth for partition-grouping

3.3.3 Distance kernel performance

3.3.4 No page-index-aware parquet random access in Spark

3.3.5 No vector index file format in Spark

3.3.6 GC pressure under multi-query batches

Q4: What is new in this approach?

4.1 The three-path architecture

4.2 Path A: Lance-native KNN (R is already in Lance)

4.3 Path B: per-query temp Lance materialization (R is an arbitrary subplan)

4.4 Path C: external index over parquet (R is a direct parquet/Delta scan)

4.5 The routing decision tree

4.6 Beyond join: single-query and batched-query retrieval

Pattern 1: any JVM caller, no Spark dependency

Pattern 2: Spark notebook / job, one query at a time

Pattern 3: Spark-cluster caller, many independent queries

Index lifetime is shared across all three patterns

Surface area (in PR #5 today unless noted)

Q5: Why is each hot-path piece in Lance, not Spark?

5.1 SIMD distance kernels

5.2 Page-index-aware parquet random access

5.3 PQ-encoded vector storage

5.4 IVF posting list traversal

5.5 Refinement — needs all of the above

5.6 Catalyst's reach: what we DID push into Spark planning

Q6: Who cares?

Q7: Benchmarks

7.1 Methodology

7.2 Results

Wide-medium (|R|=1M, |L|=100)

Mega-medium (|R|=10M, |L|=1000)

Stages and tasks (mega-medium, observed in run logs)

Per-run consistency (uniform-pool runs)

7.3 What the numbers mean — and what they don't

7.4 Cloud-storage (abfss) follow-up — Phase 1.6 perf optimizations

What changed in each step

Per-batch breakdown (from RUST_LOG=lance::index::vector::external=info driver/executor logs)

Why we stopped at -75%

Code changes shipped under Phase 1.6

Q8: Risks

Architectural

Implementation

Operational

Strategic

Q9: Implementation status

Phase 1 (lance-rs): ✅ end-to-end shipped

Phase 1.5 (JNI + Java surface): ✅

Phase 1.5 (lance-spark integration): ✅

Phase 1.6 (cloud-storage perf): ✅ shipped

What's NOT in Phase 1

Q10: Phasing

Open questions

References

This proposal's artifacts

Upstream

Background reading

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Single-query retrieval (M=1, driver-side native `Dataset.newScan(...nearest)`, IVF-PQ)

3.1 Spark's brute-force `crossJoin` — the natural-but-wrong default

3.2 Spark MLlib `BucketedRandomProjectionLSH`

Per-batch breakdown (from `RUST_LOG=lance::index::vector::external=info` driver/executor logs)