feat: Arrow-direct codegen dispatcher for Spark expressions and Scala UDFs#4267
Draft
mbutrovich wants to merge 22 commits intoapache:mainfrom
Draft
feat: Arrow-direct codegen dispatcher for Spark expressions and Scala UDFs#4267mbutrovich wants to merge 22 commits intoapache:mainfrom
mbutrovich wants to merge 22 commits intoapache:mainfrom
Conversation
This was referenced May 8, 2026
Contributor
Author
|
There are like 4 Spark SQL test failures that look like they might need updating, but otherwise it's looking good. Not gonna worry about them until we discuss moving forward. |
…ted body" on Spark 3.5
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft while we discuss with #4233 and #4239.
Which issue does this PR close?
Closes #.
Rationale for this change
#4232 merged the JVM UDF bridge: a JNI path that lets native execution call
CometUDFimplementations on the JVM, with Arrow FFI for data exchange. One way to extend it is to write moreCometUDFs per expression, as #4239 does for the remaining Spark regex family. Another way is to expose the bridge to end users so they can register their own, as #4233 does. Both paths require a hand-written Arrow-vector implementation per expression, and both require it for every arbitraryScalaUDFor Catalyst expression that a user wants on the native path.This PR proposes a different approach on top of #4232: a codegen-based dispatcher that compiles a batch-kernel
CometUDFdirectly from a bound CatalystExpressionvia Janino. Any expression whose children and output type are supported routes through native without a hand-writtenCometUDF. One dispatcher covers userScalaUDFs, regex expressions, and any other Catalyst expression that Spark's codegen can already emit, with no per-expression glue.Operating on bound Catalyst
Expressiontrees affects three parts of the system:CometBatchKernelCodegen.canHandle, not writing a newCometUDFclass plus its serde.ScalaUDFs, and unregisteredScalaUDFs live in the same tree and are serialized together, so Comet keeps the surrounding native operators in place.f(g(a), h(b))orupper(udf(x))) compiles into a single generated batch kernel with one per-row loop. Every expression in the subtree runs on rowibefore the loop advances to rowi+1. Intermediate values are local variables on the stack rather than Arrow vectors sized to the full batch. The hand-codedCometUDFpath cannot fuse across UDF boundaries because theevaluate(inputs: Array[ValueVector], numRows: Int): ValueVectorsignature requires each UDF to consume and return fully-materialized Arrow vectors, so stacking two hand-coded UDFs produces two JNI calls and one intermediate vector ofnumRowselements between them. The dispatcher avoids that because it sees the Catalyst subtree as source. Spark'sCodegenContextanddoGenCodeinline each child's generated code into its parent's, so the whole subtree flattens into one straight-line sequence per row. This is the same machinery that powers WholeStageCodegen, so the per-row fusion is reused rather than written here.The dispatcher is opt-in and gated by
spark.comet.exec.codegenDispatch.mode = auto | force | disabled.Intended scope is narrower than "any expression". The primary targets are string expressions, where JVM and Rust differ on collation and regex engine semantics, and custom
ScalaUDFs, where no Rust implementation exists. For numeric and other expression families with native Rust kernels, the native path is almost certainly faster and this dispatcher is not meant to replace it.What changes are included in this PR?
Codegen dispatcher
CometBatchKernelCodegen- compiles a boundExpressioninto a specializedCometUDFvia Janino. Object Scaladoc covers caching, CSE variant choice, and the full optimization menu.CometCodegenDispatchUDF- the bridge'sCometUDF. Carries the expression as serialized bytes. Three-layer cache (JVM-wide compile, per-thread UDF instance, per-partition kernel instance) described in its Scaladoc.CometInternalRow- Arrow-vector-backedInternalRowthat Spark'sBoundReference.genCodereads through.CometArrayData- Arrow-vector-backedArrayDatashim that Spark'sBoundReference.genCodeuses forgetArray(ord)calls. One codegen-emitted final subclass per array-typed input column, specialized on the element type.Complex type support (ArrayType)
ArrowColumnSpecis a sealed trait withScalarColumnSpecandArrayColumnSpec(nullable, elementSparkType, element). Theelementis itself anArrowColumnSpec, so nested shapes (Array<Array<...>>) fall out of the recursion.MapandStructcases will plug into the same trait in follow-up work without disturbing callers.emitWriteemits aListVector.startNewValue/ element loop /endValuetriple per row, with the per-element write recursing throughemitWriteon the list's child vector.allocateOutputallocates theListVectorwith its inner typed data vector, pre-sized from the input's data-buffer estimate.InputArray_colNfinal class per array-typed input column, extendingCometArrayData. Each class holdsstartIndex/lengthstate reset per row from the outerListVector's offsets; element reads go through the typed child-vector field with zero allocation (UTF8String.fromAddressfor string elements, decimal128 short-precision fast path forDecimalType(p <= 18), primitive direct for others). The kernel'sgetArray(ord)switch resets the pre-allocated instance and returns it.Optimizations applied in the generated kernel
Compile-time specialized per
(expression, input schema)pair. The generated Java carries only the chosen path at each emission site. Full enumeration, triggers, and code anchors live in the object-level Scaladoc menu onCometBatchKernelCodegen. Categories:VarCharVector/ViewVarCharVector, non-nullableisNullAtelision, decimal short-value fast path forp <= 18.p <= 18(toUnscaledLong+DecimalVector.setSafe(int, long)), UTF8 on-heap shortcut (passUTF8String's backingbyte[]directly, skip the redundantgetBytes()allocation), pre-sized output buffers derived from input data-buffer sizes.NullIntolerantshort-circuit, non-nullable output short-circuit, subexpression elimination (class-field variant).RegExpReplacewith direct-column subject and foldable pattern / replacement bypasses theUTF8Stringround-trip thatjava.util.regex.Matcher'sCharSequencerequirement would otherwise force.Each optimization has a source-level activation assertion in
CometCodegenSourceSuite. Smoke and fuzz tests cover correctness end-to-end.Per-expression specialized path
When the default
doGenCodeoutput pays a measurable penalty because of conversions an Arrow-aware byte-oriented loop would skip, the dispatcher supports emitting custom Java for that expression while staying inside the framework (same cache, same schema-keying, same serde entry).RegExpReplaceis the current example. The infrastructure is structured so future specializers can land alongside. SeespecializedRegExpReplaceBodyfor the emit and the criteria for adding new specializers.Bridge contract
numRowsparameter onCometUDF.evaluate. Mirrors DataFusion'sScalarFunctionArgs.number_rows. Needed for zero-column expressions where no input vector carries batch size.TaskContextparameter.CometExecIteratorcaptures the Spark task thread'sTaskContextatcreatePlantime; it is stashed as aGlobalRefin the nativeExecutionContext, threaded into eachJvmScalarUdfExpr, and installed as the thread-local in the bridge in a try/finally so Tokio workers see a liveTaskContextinstead of null. Access toprotected[spark] TaskContext.setTaskContext/unsetgoes throughCometTaskContextShiminorg.apache.spark.comet. Fixes correctness for partition-sensitive built-ins inside UDF trees (Rand,Uuid,MonotonicallyIncreasingID) and any user UDF that callsTaskContext.get().(Ljava/lang/String;[J[JJJILorg/apache/spark/TaskContext;)V. Native call site updated.Serde routing
scalaUdf.scala- routes anyScalaUDFthrough the codegen dispatcher, no registration step required.strings.scala-CodegenDispatchSerdeHelpers.pickWithModegives every regex-family expression (rlike,regexp_replace,regexp_extract,regexp_extract_all,regexp_instr,splitviaStringSplit) a uniformauto | force | disabledswitch.regexp_replace,rlike, andStringSplitfall through to their existing native Rust paths whenregexp.engine=rust;disabledmode returnsNoneand Spark runs the expression.CometConf.COMET_CODEGEN_DISPATCH_MODE- the mode knob.Docs
docs/source/contributor-guide/jvm_udf_dispatch.md.CometBatchKernelCodegenlisting every optimization with its trigger and code anchor. Single source of truth; the categories above are intentionally terse pointers.canHandle, cache-key hashing cost onCometCodegenDispatchUDF.CacheKey, 64KB method-size note ongenerateSource, WSCG-variant CSE discussion on the object Scaladoc, full zero-copy UTF8 output deferred with justification next to theStringTypewriter case.How are these changes tested?
CometCodegenDispatchSmokeSuite- type-coverage with vector-signature assertions, composed-UDF tests (3-deep and multi-column), zero-columnScalaUDF, decimal precisions on each side of thep = 18boundary, subquery-reuse test, threeTaskContext-propagation tests (TaskContext.get().partitionId()viaspark.range, same probe via a fully-native Parquet source, multi-partitionrand(seed)composition), and ArrayType input / output end-to-end tests (Seq[String],Seq[Int],Seq[BigDecimal], and array-returning UDFs).CometCodegenSourceSuite- generated-source assertions for every optimization in the menu: zero-copy UTF8 reads, non-nullableisNullAtelision, decimal input fast-path and slow-path emission, decimal output fast-path and slow-path emission, UTF8 output on-heap shortcut,NullIntolerantshort-circuit, non-nullable output short-circuit and its nullable counterpart, CSE collapse withLengthmarker, CSE filter leavingAdd(Rand, Rand)alone, specializedRegExpReplaceemitter. Array coverage:ListVector.startNewValue/endValueemission forArrayTypeoutput;InputArray_colNnested class with the right element-type getter forArrayType(StringType)/ArrayType(IntegerType)/ArrayType(DecimalType)on both sides of thep = 18boundary.CometCodegenDispatchFuzzSuite- multi-column fuzz across the supported type matrix, plus decimal identity fuzz over the 18-digit boundary at several null densities.CometRegExpJvmSuite- SQL-level Spark-vs-Comet correctness suite for the regex family. Passes unchanged with the dispatcher inautoandforce.CometScalaUDFCompositionBenchmark- four modes (Spark, Comet native built-ins, dispatcherdisabled, dispatcherforce) over three shapes. Numbers in the design doc.