feat: column-major streaming Parquet writer primitive#6384
Open
feat: column-major streaming Parquet writer primitive#6384
Conversation
Wraps SerializedFileWriter directly to expose a per-column write API that flushes one column chunk at a time. Peak memory per row group is bounded by a single column chunk plus small bookkeeping (bloom filters + page indexes), not by the whole row group. PR-2 of the streaming-merge stack. No production callers in this PR; PR-3 cuts ingest over to a single-RG writer built on this primitive, PR-6 cuts the merge engine over. The caller contract is a 4-step state machine (start_row_group → write_next_column × N → finish → repeat → close). Out-of-order calls return a structured error rather than panicking. Top-level arrow fields must each map to exactly one parquet leaf, which the metrics schema satisfies; nested types are rejected at start_row_group with a clear message. 13 tests cover: round-trip (single RG, multi RG, with nulls), metadata identity vs ArrowWriter (single + multi RG, with KV entries and sorting_columns populated), per-column statistics enabled propagation, bloom filter functional round-trip, bounded-memory contract, empty row group, and four caller-contract violations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR-2 of the streaming-merge stack. Lands a primitive in
quickwit-parquet-engine/src/storage/streaming_writer.rsthat wrapsSerializedFileWriterdirectly (notArrowWriter) and exposes aper-column write API. Each column's chunk is appended and flushed
before the next column is written, so peak memory per row group is
bounded by one column's encoded chunk plus small bookkeeping (bloom
filters + page indexes) — not by the entire row group.
No production callers in this PR. PR-3 will cut ingest over to a
single-RG writer built on this primitive; PR-6 will cut the merge
engine over and add direct page copy.
Caller contract
Out-of-order calls (too many columns, finish before all columns,
mismatched row counts across columns within an RG) return a
structured
ParquetWriteError::SchemaValidationrather thanpanicking. The compiler enforces single-RG-open via the lifetime on
RowGroupBuilder.Limitations (PR-2)
Top-level arrow fields must each map to exactly one parquet leaf —
"flat" schemas of primitive, byte-array, and dictionary types. The
metrics schema is flat in this sense. Nested types (Struct, List,
Map) are rejected at
start_row_groupwith a clear error message; ifPR-6 ever needs them, the implementation can extend the per-call
contract to consume multiple writers per arrow column.
The streaming writer does not implicitly add the
ARROW:schemaIPC entry that
ArrowWriterwrites by default. Callers wanting thatself-describing arrow round-trip pass their
WriterPropertiesthrough
parquet::arrow::add_encoded_arrow_schema_to_metadatafirst.PR-3 ingest and PR-6 merge will do this in their own setup helpers.
This is documented on
try_new.Tests (13)
Round-trip: single RG, multi RG, nulls preserved.
Metadata identity vs
ArrowWriterunder identicalWriterProperties:single RG and multi RG. Asserts schema descriptor (column count +
names + physical types), per-RG
sorting_columns, allqh.*KVentries, per-column compression, per-column bloom filter presence,
per-column
statistics_enabledlevel, num row groups, and per-RGnum_rows.
Functional contracts:
EnabledStatistics::Chunk)rejected by the read-back filter
memory_sizeis monotonenon-increasing as columns are written
Caller-contract violations:
start_row_groupSchemaValidationSchemaValidationSchemaValidationschema
Position in stack
qh.rg_partition_prefix_lenmarkerParquetMergeExecutorto streaming engine; delete downloaderPR-2 is independent of PR-1 review — it branches off
main.Test plan
cargo +nightly fmt --all -- --check(per files touched)RUSTFLAGS="-Dwarnings --cfg tokio_unstable" cargo clippy -p quickwit-parquet-engine --testscargo doc -p quickwit-parquet-engine --no-depscargo machetebash quickwit/scripts/check_license_headers.shbash quickwit/scripts/check_log_format.shcargo nextest run -p quickwit-parquet-engine --all-features— 369 tests, all pass🤖 Generated with Claude Code