Feature Request / Improvement
Summary
Please consider adding support in pyiceberg for writing data to Iceberg tables using streamable Arrow-native types such as
pyarrow.RecordBatchReader
Iterator[pyarrow.RecordBatch]
pyarrow.RecordBatch
pyarrow.dataset.Scanner
pyarrow.Table (existing or fallback)
Operations could include:
table.append(record_batch_reader)
table.overwrite(scanner)
table.upsert(scanner, primary_keys=["id"])
Motivation
Currently, writing data into Iceberg via Python requires materializing data entirely in memory (e.g., via pyarrow.Table) and converting it to Parquet manually. This limits scalability and performance, especially for:
- Large datasets that exceed memory
- Incremental / streaming ingestion
- Lazy pipelines using DuckDB, ADBC, or
Scanner.from_batches(...)
RecordBatchReader and Scanner are both streamable abstractions ideal for these use cases.
Benefits
- Enables lazy and streaming by avoiding unnecessary disk I/O and memory pressure
- Avoids intermediate Parquet files, temp storage and big data pipelines without requiring full materialization
- Enables clean integration with Arrow-native tools (e.g., ADBC, DuckDB, pyarrow ecosystem)
- Avoids unnecessary disk I/O and memory pressure
Related Context
- delta-rs supports Arrow-native ingestion via ArrowStreamExportable, ArrowArrayExportable, and sequences of arrays
This feature would unlock efficient Python-native data ingestion workflows for Iceberg and align pyiceberg more closely with the rest of the Arrow ecosystem.
Feature Request / Improvement
Summary
Please consider adding support in
pyicebergfor writing data to Iceberg tables using streamable Arrow-native types such aspyarrow.RecordBatchReaderIterator[pyarrow.RecordBatch]pyarrow.RecordBatchpyarrow.dataset.Scannerpyarrow.Table(existing or fallback)Operations could include:
Motivation
Currently, writing data into Iceberg via Python requires materializing data entirely in memory (e.g., via
pyarrow.Table) and converting it to Parquet manually. This limits scalability and performance, especially for:Scanner.from_batches(...)RecordBatchReaderandScannerare both streamable abstractions ideal for these use cases.Benefits
Related Context
This feature would unlock efficient Python-native data ingestion workflows for Iceberg and align pyiceberg more closely with the rest of the Arrow ecosystem.