feat: accelerate Iceberg RewriteDataFiles reads via Comet native scan by jordepic · Pull Request #4251 · apache/datafusion-comet

jordepic · 2026-05-06T22:43:52Z

Which issue does this PR close?

Closes #4250.

Rationale for this change

A large number of query resources are devoted across the industry to rewriting data files using spark procedures for iceberg tables. Using native code here where possible can significantly speed up this process!

What changes are included in this PR?

Detect spark scans (SparkStagedScan) that are created during the RewriteDataFilesSparkAction and replace them with comet scans. Extract their associated tasks and pass in the lack of filter (see SparkStagedScan line 50 in the apache iceberg project).

Note that some things are NOT included in this PR:
Datafusion-comet does not yet support writing iceberg data. Once it does, we can "comet-ify" this whole pipeline!

How are these changes tested?

We write two tests to inspect the spark plan associated with rewriting data files and ensure that the operators get replaced. Before this change is merged I can also try to run it locally and pick up some benchmarks for table compactions (on tables that are only data files, and those with delete files associated).

I have also added tests to ensure that the compaction works, not just ensuring operator replacement, but also runtime compaction correctness on both bin packing and sorting!

mbutrovich · 2026-05-07T12:40:45Z

Spark 3.4 failed for a reflection access. IIRC we use an older version of Iceberg there, so signatures might have changed. We might need different reflection logic for older versions of Iceberg. I had to do that somewhere else in the reflection class, but can't recall what right now.

mbutrovich · 2026-05-07T14:01:12Z

Thanks @jordepic! I will take a proper pass through this today.

jordepic force-pushed the main branch 3 times, most recently from 352861a to f1366a1 Compare May 7, 2026 02:18

feat: accelerate Iceberg RewriteDataFiles reads via Comet native scan

27a21ed

jordepic force-pushed the main branch from f1366a1 to 27a21ed Compare May 7, 2026 13:54

mbutrovich self-requested a review May 7, 2026 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: accelerate Iceberg RewriteDataFiles reads via Comet native scan#4251

feat: accelerate Iceberg RewriteDataFiles reads via Comet native scan#4251
jordepic wants to merge 1 commit intoapache:mainfrom
jordepic:main

jordepic commented May 6, 2026 •

edited

Loading

Uh oh!

mbutrovich commented May 7, 2026

Uh oh!

mbutrovich commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jordepic commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich commented May 7, 2026

Uh oh!

mbutrovich commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jordepic commented May 6, 2026 •

edited

Loading