Skip to content

Support writing Arrow RecordBatchReader or Scanner to Iceberg tables #2152

@chitralverma

Description

@chitralverma

Feature Request / Improvement

Summary

Please consider adding support in pyiceberg for writing data to Iceberg tables using streamable Arrow-native types such as

  • pyarrow.RecordBatchReader
  • Iterator[pyarrow.RecordBatch]
  • pyarrow.RecordBatch
  • pyarrow.dataset.Scanner
  • pyarrow.Table (existing or fallback)

Operations could include:

table.append(record_batch_reader)
table.overwrite(scanner)
table.upsert(scanner, primary_keys=["id"])

Motivation

Currently, writing data into Iceberg via Python requires materializing data entirely in memory (e.g., via pyarrow.Table) and converting it to Parquet manually. This limits scalability and performance, especially for:

  • Large datasets that exceed memory
  • Incremental / streaming ingestion
  • Lazy pipelines using DuckDB, ADBC, or Scanner.from_batches(...)

RecordBatchReader and Scanner are both streamable abstractions ideal for these use cases.

Benefits

  • Enables lazy and streaming by avoiding unnecessary disk I/O and memory pressure
  • Avoids intermediate Parquet files, temp storage and big data pipelines without requiring full materialization
  • Enables clean integration with Arrow-native tools (e.g., ADBC, DuckDB, pyarrow ecosystem)
  • Avoids unnecessary disk I/O and memory pressure

Related Context

  • delta-rs supports Arrow-native ingestion via ArrowStreamExportable, ArrowArrayExportable, and sequences of arrays

This feature would unlock efficient Python-native data ingestion workflows for Iceberg and align pyiceberg more closely with the rest of the Arrow ecosystem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions