feat: add GroupsAccumulator for variance, stddev, covariance, correlation#4254
feat: add GroupsAccumulator for variance, stddev, covariance, correlation#4254andygrove wants to merge 12 commits intoapache:mainfrom
Conversation
beforeafter |
Should we try to wire DF aggregation instead of Comet implementation? or there are any Spark inconsistencies? |
They are not compatible the last time I checked - signed vs unsigned types, IIRC |
would be good to upstream these to datafusion-spark and then see if they could share common code with the core impl |
|
Thanks for picking this up, @andygrove! The speedups are really nice to see and the refactor into Correlation update_batch maskIn Correlation evaluate doing three finalizes
finalize duplication between variance and covariance
Stddev evaluate allocationIn Covariance argument count
Redundant null_on_divide_by_zero on Correlation
DataFusion accumulate helpersHave you looked at Stddev test coverageVariance has CometBenchmarkBase shuffle managerThe Overall very nice work, the unit tests do a good job of pinning down the Spark-specific edges and the state ordering comments on correlation are helpful for future readers. |
mbutrovich
left a comment
There was a problem hiding this comment.
I guess my comments above count as changes requested. Thanks @andygrove!
Which issue does this PR close?
Closes #1627
Closes #4249
Rationale for this change
Detailed benchmark results are posted in a comment on this PR.
DataFusion's grouped aggregate operator has a fast path that uses
GroupsAccumulatorinstead of oneAccumulatorper group. Comet'sexisting variance, stddev, covariance, and correlation aggregates only
implement the per-row
Accumulatortrait, so grouped queries fall backto a slower row-at-a-time path. Issue #1275 measured a roughly 4x
speedup from applying the same change to
avg. After this PR the fourremaining Comet-owned aggregates are on parity with
avg/sum_int/
sum_decimal/avg_decimal, which already implement the fast path.What changes are included in this PR?
agg_funcs/welford.rswith the Welford update/merge math used byboth the per-row and grouped paths. The existing
VarianceAccumulatorand
CovarianceAccumulatorper-row methods now route through thesehelpers (no behavior change).
VarianceGroupsAccumulator,StddevGroupsAccumulator,CovarianceGroupsAccumulator, andCorrelationGroupsAccumulator.Each
AggregateUDFImplnow returnstruefromgroups_accumulator_supportedand constructs the matching groupedaccumulator. Stddev wraps variance; correlation wraps one covariance
plus two variance group accumulators, mirroring the existing
Accumulatorcomposition.f64counts) andSpark semantics (
null_on_divide_by_zerofor thecount == 1branches).How are these changes tested?
correctness, null handling on each input column (correlation needs
both columns non-null per row),
opt_filtermasking, multi-batchmerge equals single-shot, and the empty-group-yields-null and
single-row-sample edge cases.
aggregate, so existing JVM coverage in
CometAggregateSuiteexercisesthe new code end-to-end. CI runs the full sweep across Spark 3.4,
3.5, and 4.0.
This PR was scaffolded with the project's superpowers:brainstorming and
superpowers:writing-plans skills.