Skip to content

feat: add coverage threshold to skip Comet for low-coverage queries#3816

Open
andygrove wants to merge 7 commits intoapache:mainfrom
andygrove:feat/coverage-threshold
Open

feat: add coverage threshold to skip Comet for low-coverage queries#3816
andygrove wants to merge 7 commits intoapache:mainfrom
andygrove:feat/coverage-threshold

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Mar 27, 2026

Which issue does this PR close?

Part of #833

Rationale for this change

When Comet can only convert a small fraction of a query's operators, the overhead from Spark to Comet transitions can outweigh the benefit of native execution. This adds a coarse safety valve so users can opt into "skip Comet entirely if coverage is below X" without having to disable Comet globally.

What changes are included in this PR?

  • Add spark.comet.exec.coverageThreshold (double, 0.0 to 1.0, default 0.0 = disabled). When the fraction of converted operators in the final plan is below the threshold, CometExecRule returns the original Spark plan instead of the converted one.
  • Decide once before AQE starts executing stages. The threshold check is gated on the canonical first pass (the injectQueryStagePrepRule registration in AQE mode, or preColumnarTransitions in non-AQE mode). AQE per-stage preColumnarTransitions re-entries skip the check.
  • Make the decision sticky. On fallback, every node of the plan tree is tagged with a new SKIP_COMET_PLAN_TAG so any subsequent rule application (e.g. AQE per-stage on a sub-plan) honors the prior decision instead of re-evaluating coverage on a smaller view of the plan.
  • Surface the reason via withInfo. When the threshold triggers, the reason string ("Comet native coverage X% is below threshold Y% (...)") is attached to the plan and shows up in extended explain output alongside other Comet fallback reasons.
  • Update the tuning guide with the AQE behavior and limitations.

Limitations

This is a coarse safety valve, not a cost model. Worth being explicit about what it does not capture:

  • Operator-count metric. Each operator counts equally regardless of cost. A plan with one expensive Spark scan feeding many cheap native projections looks well-covered even though the slow scan dominates runtime.
  • No row or byte weighting. A native node consuming millions of rows and a Spark node consuming a handful are counted the same.
  • Transitions are not part of the metric. A plan can have full operator coverage and still incur many transitions. The transition count is reported in the warning string for diagnostic use only.
  • Single global value. Same threshold applies to every query in the session. There is no per-query override.

The default of 0.0 keeps behavior unchanged for existing users.

How are these changes tested?

New tests in CometExecRuleSuite:

  • default 0.0 disables the check
  • coverage above the threshold passes through the conversion
  • coverage below the threshold returns the original Spark plan
  • the fallback reason is recorded in the plan's extension info for explain output
  • a prior fallback decision is sticky on AQE per-stage re-entry (re-applying the rule with applyThresholdCheck = false and the threshold reset to 0 still leaves the plan as Spark, demonstrating the AQE per-stage sentinel)

Add spark.comet.exec.coverageThreshold config (0.0-1.0, default 0.0)
that reverts to the original Spark plan when the fraction of converted
operators falls below the threshold. This avoids the overhead of
Spark-to-Comet transitions for queries where only a small percentage
of operators can run natively.
Extract classifyNode() to eliminate duplicated match logic between
generateTreeString and collectStats. Replace coveragePercent with
coverageFraction (0.0-1.0) to match the threshold config units and
avoid unnecessary multiply/divide conversions.
@andygrove andygrove marked this pull request as ready for review May 8, 2026 21:27
@andygrove andygrove marked this pull request as draft May 8, 2026 21:29
andygrove added 3 commits May 8, 2026 15:29
The merge with apache/main introduced two object CometCoverageStats
declarations (forPlan from main, fromPlan from this branch). Drop the
local fromPlan variant and reuse the existing forPlan helper.
…a withInfo

Apply the threshold check on the canonical first pass only: the
queryStagePrepRule registration in AQE mode, or preColumnarTransitions
in non-AQE mode. AQE per-stage re-entries skip the check, and a sticky
SKIP_COMET_PLAN_TAG on every node ensures any fallback decision survives
into per-stage rule applications so we don't flip-flop.

When falling back, attach the reason to the plan via withInfo so it
shows up in extended explain output alongside other Comet fallback
reasons.

Add CometExecRuleSuite tests covering: default disabled, fallback above
coverage, conversion below threshold, reason in extension info, and AQE
stickiness on re-entry.
Replace the in-method early-return on threshold fallback with a helper
that returns either the converted or the original plan as the value of
the if/else expression.

Expand the tuning guide with a section on AQE behavior and the known
limitations of an operator-count metric (no cost weighting, no row/byte
weighting, transitions excluded, single global value).
@andygrove andygrove marked this pull request as ready for review May 8, 2026 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant