feat(knn): per-query temp Lance write for non-Lance R (issue #2) by sezruby · Pull Request #3 · sezruby/lance-spark

sezruby · 2026-05-21T15:32:49Z

Implements #2 — generalize the indexed NearestByJoin path to non-Lance right sides via per-query temp Lance materialization.

Summary

Today the indexed pipeline only fires when R is a Lance scan. With this PR, any DataFrame on the right side works — parquet, delta, in-memory, the result of an arbitrary upstream Spark plan. The right side is materialized once via df.write.format("lance").save() to a temp Lance dataset, then the existing probe/merge/materialize pipeline runs against the temp URI. Same data path on the wire; same Catalyst-visible plan shape.

Design

Catalyst rule sees: NearestByJoin(L, R, ...)

If R is already a Lance scan:
    existing path                                              (no change)

Else:
    LanceTempR.materialize(R, vecCol, projection, scratchDir)  (NEW)
       → temp Lance dataset on shared scratch storage
    Probe → Merge → Materialize against temp Lance URI         (existing)
    LanceTempLifecycle: cleanup on SparkListenerApplicationEnd (NEW)

See issue #2 for the materialize-options analysis (M1 carry-in-temp vs M2 hash-join vs M3 custom-shuffle) — only M1 handles the subplan-R case, which is the load-bearing constraint.

Stages (5 commits)

Commit	What	Tests
`caa3ad5`	`IndexedNearestJoinTempRBenchmark` (3-config end-to-end bench)	bench
`b904365`	Stage 1: `LanceTempR.materialize` + `resolveScratchDir` helper	11 new
`91e25ef`	Stage 2: `kNearestJoin` extension transparently handles non-Lance R	3 rewritten
`5adb387`	Stage 3: `LanceTempLifecycle` query-scoped cleanup	6 new
`6b6ea74`	Stage 4: SQL `IndexedNearestByJoinRule` extension for non-Lance R	3 new

Test status

lance-spark-knn_2.12: 77/77 pass (60 existing + 17 new)
lance-spark-knn-4.2_2.13: 20/20 pass (17 existing + 3 new)

Configuration

spark.lance.knn.tempR.dir — scratch path for temp Lance datasets. Required in cluster mode (any path every executor can read+write — s3://, abfss://, file:///shared-mount/..., hdfs://...). Local mode falls back to spark.local.dir.
spark.lance.knn.tempRForSqlRule.enabled — opt-in for the SQL rule's temp-R path (the DataFrame kNearestJoin extension does it automatically; the SQL path is gated separately because Catalyst rule → eager write is unusual).

Local validation (M5 Max, tiny scale)

From issue #2:

A: Spark crossJoin + min_by_k (parquet R)   28,231 ms   1.0×
B: temp Lance + kNearestJoin                   323 ms  87.4×   (36 ms write + 287 ms probe)
C: Lance-native R + kNearestJoin               267 ms  105.7×

(B − C) = 56 ms is the pure cost of the temp path versus already-Lance R. Cluster numbers are blocked on infra (OpenShift CSI / PVC reconciler errors); will be added once the cluster is unstuck.

Out of scope (issue #2 follow-ups)

Cached / 1:1 file-pair Lance sidecar over parquet/delta R (uses _metadata.row_index for cross-execution rid stability)
Random-access reader for parquet (alternative to a sidecar)
Native Rust parquet→Lance copier
Velox / Gluten path

Ready for

Initial review of the design choices and the rule extension's materializeNonLanceR shape. Once approved, ready to mark non-draft and squash for upstream submission per the existing 7-PR plan.

Three-config benchmark validating the per-query temp Lance design from #2: same data, same job, three execution paths. A: Spark crossJoin + min_by_k on parquet R (brute-force baseline) B: per-query temp Lance write + kNearestJoin against the temp URI C: Lance-native R + kNearestJoin (already-Lance reference) (B - C) = pure temp-write overhead. (B vs A) = headline speedup vs the naive parquet-R approach. Tiny scale local (M5 Max, 5 repeats): A 28,231 ms B 323 ms (36 ms tw + 287 ms probe) C 267 ms B beats A 87x; (B - C) overhead = 56 ms. Cluster mode supported via BENCH_CLUSTER_MODE=true + BENCH_DATA_PATH; cluster numbers blocked on infra (OpenShift CSI / PVC reconciler) and will be added as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…on-Lance R Per #2 stage 1: helper for materializing an arbitrary DataFrame to a temp Lance dataset before the existing indexed NearestByJoin pipeline. R can be parquet, delta, in-memory, or the result of an arbitrary upstream Spark plan; the helper writes it once and returns a URI the existing LanceProbeStage / LanceMaterializeStage consume unchanged. - LanceTempR.materialize(right, vecCol, projection, scratchDir): String Synthesises a unique _rid via monotonically_increasing_id(), projects rid + vec + caller-requested payload cols, writes to a unique sub-path under scratchDir. - LanceTempR.resolveScratchDir(spark): String Reads spark.lance.knn.tempR.dir; in cluster mode (master != local*), requires it to be set so the temp lands on a path every executor can see (s3://..., hdfs://..., file:///shared-mount/...). Local mode falls back to spark.local.dir + /lance-temp-r. Validation: - Round-trip: row count + rid uniqueness + vector column equality - Projection: temp schema is exactly rid + vec + requested cols - Subplan-backed sources (Filter+Project chain over parquet): same shape - Empty source: empty Lance dataset, no error - Validation: missing vec, unknown projection, reserved rid name → fail fast - resolveScratchDir: conf-key honoured; local-mode fallback writes correctly 71/71 tests pass (60 existing + 11 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per #2 stage 2: the df.kNearestJoin(rightDf, ...) extension now accepts any DataFrame on the right side, not only Lance scans. When the right side is not a Lance scan, the extension materializes it once via LanceTempR.materialize() and routes the existing probe pipeline against the temp URI. Same data path on the wire. Behavior change: Before: parquet / in-memory / subplan R → IllegalArgumentException After: same inputs → temp Lance materialization → indexed kNN works Lance scans still take the existing fast path (no temp write). extractLanceUri now returns Option[(String, Option[Long])] instead of throwing on miss; callers fall through to materializeNonLanceR which: - Calls LanceTempR.resolveScratchDir to find a writable scratch dir (spark.lance.knn.tempR.dir is required in cluster mode; local mode falls back to spark.local.dir) - Materializes via LanceTempR.materialize with the user-specified rightProjection (or all of R's non-vector columns if rightProjection is None) Tests: replaced the three "throws on non-Lance R" cases with three positive oracle-equivalence tests covering parquet R, subplan-backed R (parquet → Filter → Project), and in-memory + alias-wrapped R. Lance-scan happy path and Filter-on-Lance unchanged. 71/71 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tasets Per #2 stage 3: without lifecycle management, every kNearestJoin against a non-Lance R leaks a Lance dataset on whatever scratch storage spark.lance.knn.tempR.dir points at — local FS, S3, HDFS, ABFS — until the JVM dies. This commit adds: - LanceTempLifecycle.register(spark, tempUri) Tracks the URI for cleanup. Idempotent (dedupes via LinkedHashSet). Invoked automatically from LanceTempR.materialize at the end of every successful write. - SparkListenerApplicationEnd cleanup path Per-app SparkListener; on application end, deletes all registered URIs for that app. Routes through Hadoop FileSystem.get(uri, conf) so it handles local/s3/hdfs/abfs uniformly. Best-effort: errors are logged and swallowed so cleanup can't break the user's session teardown. - JVM shutdown-hook fallback Single hook installed once per JVM, runs every app's cleanup on Runtime.shutdown — covers crashes / hard kills. Why not onJobEnd: a single kNearestJoin invocation runs multiple Spark jobs (write + probe + merge + materialize). onJobEnd would race the still-running probe and break correctness. onApplicationEnd is the right scope. Tests (6 cases): explicit cleanup deletes from disk, multi-URI cleanup, idempotent registration, SparkListenerApplicationEnd-triggered cleanup, deleteUri on non-existent path is a no-op, deleteUri null/empty is a no-op. 77/77 tests pass overall (71 + 6 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Lance Per #2 stage 4: extend IndexedNearestByJoinRule (Spark 4.2 SQL path) so a NearestByJoin whose right side isn't a Lance scan can also be rewritten to the indexed path. The rule materializes the right plan to a temp Lance dataset at rule-application time via LanceTempR.materialize, then proceeds with the same staged-plan rewrite as for Lance R. - New conf TempRForSqlEnabledConfKey = "spark.lance.knn.tempRForSqlRule.enabled". Off by default. Two reasons it's separate from the main enabled flag: 1. The rule evaluates the right plan synchronously at analysis time — users should consciously accept the cost 2. Cluster mode requires spark.lance.knn.tempR.dir to be set; surfacing the failure behind an explicit opt-in is friendlier than failing on every NearestByJoin - rewriteIfApplicable's for-comprehension changes: Recognize ranking BEFORE attempting right-side resolution so we don't pay a temp materialization for queries that fall through anyway (wrong direction, mixed-side rank expression, etc.). unwrapLanceScan(right).orElse { if (tempRForSqlEnabled) materializeNonLanceR(right, rightVecCol) else None } - materializeNonLanceR: Wraps the right plan as a DataFrame via LanceKnnDatasetBridge.asDataFrame, calls LanceTempR.materialize with right.output.map(_.name) as projection (carry every right-side attribute the parent plan can reference), and synthesises a LanceScanInfo whose `output` reuses right.output's AttributeReferences so the top-level Project(j.output, ...) stays resolved. Any failure → return None and fall through to brute-force. Tests (3 new, 17 total in IndexedNearestByJoinRuleTest): - testTempRForSqlRewritesNonLanceR: parquet R + both flags on → rewrites to Project(LanceMaterialize(...)) - testTempRForSqlRequiresMainEnabledFlag: both flags must be on; the temp-R flag alone doesn't fire the rule - testNonLanceRWithoutTempRConfFallsThrough: pins existing behavior — without the temp-R conf, parquet R falls through 77/77 tests pass in lance-spark-knn_2.12, 20/20 in lance-spark-knn-4.2_2.13. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Config D (`kNearestJoin(parquetDf, ...)`) exercises the full public-API code path added in stages 2-4: the extension internally hits LanceKnnImplicits.materializeNonLanceR -> LanceTempR.materialize -> existing probe pipeline. Local validation, M5 Max, tiny scale (3 reps + 1 warmup): A: Spark crossJoin + min_by_k (parquet R) 28,000 ms 1.0× B: temp Lance write + kNearestJoin (manual) 322 ms 86.9× C: Lance-native R + kNearestJoin (reference) 261 ms 107.3× D: kNearestJoin(parquetDf) — built-in temp 319 ms 87.8× (D - B) = 3 ms — within run-to-run noise. The public API does the same work as the manually-spelled-out B, no extra overhead. The new explain(extended=true) dump (head-scale only) confirms: - Probe and Materialize URI both point at the temp Lance dir - Full LanceProbe -> Exchange -> LanceMerge -> LanceMaterialize chain in the executed plan - Wrapped by AdaptiveSparkPlan (AQE-visible merge shuffle) - Left side is unmodified (only R goes through temp materialization) Lifecycle: zero leakage observed. Earlier-test runs from before stage 3 left orphaned temp dirs in spark.local.dir/lance-temp-r/ (no lifecycle existed yet to clean them); fresh runs of LanceTempRTest + LanceTempLifecycleTest after stage 3 produce delta=0 in that directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two improvements per review feedback on #2: 1. Schema validation before triggering the temp Lance write LanceTempR.checkSupported(schema) returns Some(reason) when any column in the projected schema (rid + vec + payload) is not Lance-writable. The conservative allow-list covers numerics, boolean, string, binary, date, timestamp, struct (recursive), array (recursive). Rejects MapType, NullType, and unrecognised types with a clear "column X has type Y" message. Caller-specific behaviour: - DataFrame API (kNearestJoin): LanceTempR.materialize throws IllegalArgumentException, surfaces to the user. They asked for it explicitly so a clear failure is the right answer. - Catalyst rule (SQL APPROX NEAREST): the rule's materializeNonLanceR calls checkSupported BEFORE doing any work and returns None on miss, making the rule fall through to Spark's brute-force RewriteNearestByJoin — the user's query still runs, just slowly. Same "refusal not partial" pattern as the existing prefilter-pushdown. 2. Same-path regression test in LanceKnnImplicitsTest testProbeAndMaterializeShareSameTempUri walks the analyzed plan of a kNearestJoin against a non-Lance R, finds the LanceProbeLogicalPlan and LanceMaterializeLogicalPlan nodes, and asserts both stage configs reference the SAME temp Lance URI. Future regressions where helper / implicits / IndexedNearestJoin.apply diverge produce a fast structural failure instead of silent wrong results from a probe-vs-materialize URI mismatch. Tests added (8 new): - 4 in LanceTempRTest: checkSupported on common types accepts; rejects Map; rejects array-of-Map (recursive); rejects struct-with-Map (recursive); materialize() throws on unsupported projection - 2 in LanceKnnImplicitsTest: testProbeAndMaterializeShareSameTempUri, testKNearestJoinRejectsUnsupportedColumnType - 1 in IndexedNearestByJoinRuleTest: testTempRForSqlFallsThroughOnUnsupportedSchema 84/84 in lance-spark-knn_2.12 (was 77; 4 + 2 = 6 new — note 1 existing test was left intact, so net is +7 not +8). 21/21 in lance-spark-knn-4.2_2.13 (was 20). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude added 7 commits May 21, 2026 07:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(knn): per-query temp Lance write for non-Lance R (issue #2)#3

feat(knn): per-query temp Lance write for non-Lance R (issue #2)#3
sezruby wants to merge 7 commits into
knn-phase0from
knn-temp-r

sezruby commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sezruby commented May 21, 2026

Summary

Design

Stages (5 commits)

Test status

Configuration

Local validation (M5 Max, tiny scale)

Out of scope (issue #2 follow-ups)

Ready for

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants