feat: enable page-level Parquet stats + add rg_partition_prefix_len marker by g-talbot · Pull Request #6377 · quickwit-oss/quickwit

g-talbot · 2026-05-04T18:36:48Z

Summary

PR-1 of a 7-PR stack producing a memory-bounded, streaming, column-major Parquet pipeline (the reformulated Phase 4a). This PR lays the on-disk format foundation and verifies it works end-to-end through the production query path. Subsequent PRs add the streaming reader, writer, merge engine, and pipeline wiring.

Three changes:

Page-level statistics by default. Writer flips from EnabledStatistics::Chunk to EnabledStatistics::Page, so every new file carries Parquet Column Index + Offset Index in the footer. The level is now a config knob — ParquetWriterConfig::with_page_statistics(false) falls back to chunk-level for callers that want a smaller footer. A second knob (with_data_page_row_count_limit) exposes Parquet's per-page row count rollover threshold, useful for tests and for future writers that want to control page granularity directly.
qh.rg_partition_prefix_len marker. A numeric KV in the Parquet footer (and a matching u32 field on ParquetSplitMetadata) recording how many leading sort schema columns row group boundaries align with. 0 (or absent) = no claim, the legacy default. N = aligned with the first N sort columns. The marker is read by the merge engine and added to the compaction scope so files with different prefix values stay in separate buckets. The merge engine preserves the inputs' prefix on a single-RG output (vacuously aligned) and demotes to 0 only when the output is genuinely multi-RG, so post-PR-3 ingest splits stay in their bucket through small merges.
Developer tooling. inspect_parquet_page_stats(...) library + inspect_parquet CLI binary in quickwit-parquet-engine. Reads the footer (including the page indexes), pretty-prints a per-RG / per-column / per-page report, and supports --json, --all-pages, and --verify-prefix (which enforces the strong form of the prefix claim: every column in the prefix must be constant within each RG).

End-to-end verification (Gate-A)

A new integration test (test_page_index_pruning_via_query in quickwit-datafusion/tests/metrics.rs) confirms page-level pruning fires through the production query path:

Builds a single split with two metric_names interleaved in one row group, forced into ~16 pages.
Runs WHERE metric_name = 'cpu.usage' via the regular SQL session builder (no shortcuts; same path the REST API uses).
Walks the executed ExecutionPlan for the PruningMetrics::page_index_rows_pruned counter exposed by DataFusion's ParquetSource.
Asserts the counter ≥ 4096 (the rows from the other metric) AND that the query still returns exactly the cpu.usage rows.

This proves that:

The new EnabledStatistics::Page writer config makes the column index + offset index land in the footer (already covered by inspect.rs unit tests).
The reader path (MetricsParquetTableProvider::scan) loads the page index — ParquetSource::with_enable_page_index(true) was already wired at line 207, no change needed.
DataFusion's predicate pushdown evaluates against the page-level stats and skips matching pages.

This is the gate that had to pass before PR-3 cuts ingest over to single-RG: without page-level pruning, single-RG would collapse query pruning to one min/max per file. The gate is now passed.

Why now

Single-row-group files — coming in PR-3 — have only one RG worth of chunk-level statistics, which collapses query pruning to one min/max per file unless page-level data is in the footer. Page indexes have to land in the footer before the writer cuts over to single-RG, which is what this PR sets up.

The marker is the contract between the streaming reader and the merge engine: only files that claim the same prefix length can be merged together. Defining and threading it through compaction scope + validation now means later PRs in the stack just turn on writes — no retroactive plumbing.

Footer overhead

Measured on production-sized files (zstd-3, default writer config) by writing the same data twice — once with EnabledStatistics::Chunk, once with Page — and comparing total file size:

Shape	Compressed size	Page-index overhead
1.5M rows × 11 cols (realistic)	22.6 MiB	+0.033% (7.6 KiB)
3.5M rows × 11 cols (realistic)	52.2 MiB	+0.033% (17.6 KiB)
3.5M rows × 50 cols (wide schema worst case)	52.9 MiB	+0.162% (87.5 KiB)

The page index scales with column count × page count, not row count, so on production-sized splits it's a rounding error. Pinned by an integration test (test_footer_size_delta_for_page_level_stats) that fails if the delta exceeds 30% on a small synthetic file — a generous bound that catches regressions without flapping on the absolute size.

Test plan

What's not in this PR (deferred to later in the stack)

Any writer or reader changes beyond the config flip and the marker plumbing.
Setting rg_partition_prefix_len > 0 from any production code path. PR-3 (single-RG ingest) is the first writer that will set it; PR-6 (streaming column-major merge engine) is where multi-RG-by-metric_name output lands.

🤖 Generated with Claude Code

…arker Foundation for the streaming column-major merge engine workstream. Switches the writer's default from EnabledStatistics::Chunk to EnabledStatistics::Page so every newly-written file carries a Column Index and Offset Index in its footer. Without this, single-RG files produced by future PRs would have one min/max per file — useless for selective queries. The default is exposed as a knob (`ParquetWriterConfig::with_page_statistics`) so callers can opt out when the footer overhead isn't worth it. Adds a numeric marker `qh.rg_partition_prefix_len` in the file's KV metadata and a matching `rg_partition_prefix_len: u32` field on `ParquetSplitMetadata`. The marker records how many leading sort schema columns RG boundaries align with: 0 = no claim (legacy default), N = aligned with the first N sort columns. Single-RG files vacuously satisfy any prefix; future writers will set N = sort_schema.len(). Compaction scope now includes `rg_partition_prefix_len`. Splits with different prefix values land in different buckets; the merge engine validates input files agree on prefix and rejects mismatches at both the metastore-struct layer and the on-disk KV layer. Until the streaming engine lands, the merge writer demotes the output's prefix to 0 because it cannot enforce alignment. New developer tooling: - `quickwit_parquet_engine::storage::inspect_parquet_page_stats` library function returning a structured per-RG / per-column / per-page report, plus `verify_partition_prefix` for the strong-form alignment check. - `inspect_parquet` binary in the parquet-engine crate with `--json`, `--all-pages`, and `--verify-prefix` flags. Footer-size delta on a representative shape (100K rows × 6 cols): +19.5% (672 KB → 804 KB). The page index scales with column count, not data volume, so production-sized 50 MB splits show < 0.3% overhead. Test count: 367 → 382 (15 new). Clippy/doc/license/log/machete clean.

Avoids a compaction-bucket leak that would otherwise appear once PR-3 ships single-RG ingest before PR-6 ships the streaming column-major merge engine. Previously, every merge unconditionally set the output's `rg_partition_prefix_len` to 0, even when the writer happened to produce a single-RG output that vacuously satisfies any alignment claim. With single-RG ingest active and merge demoting on every operation, post-PR-3 ingest splits would leak out of the `prefix = sort_len` bucket on their first merge and never rejoin it — newer ingests would not merge with merge outputs. New rule: predict the output's row group count via `num_rows.div_ceil(row_group_size)`. If ≤ 1 RG, propagate the inputs' prefix; otherwise demote to 0. Both the metastore split metadata (`merge_parquet_split_metadata`) and the file's KV metadata (`build_merge_kv_metadata`) follow the same rule, so they always agree about what's on disk. A `debug_assert!` checks that the prediction matches the actual row group count returned by `ArrowWriter::close()` — catches a future config change that adds a byte-based RG threshold and silently invalidates the KV claim. `MergeOutputFile` gains a `num_row_groups: usize` field so the metastore-side rule can be applied without re-parsing the file. Test changes: - Rename `test_output_prefix_len_demoted_to_zero` to `test_output_prefix_len_demoted_when_multi_rg`; pin the demotion to the `num_row_groups > 1` case. - New `test_output_prefix_len_preserved_when_single_rg` asserting the propagation case. - New `test_merge_demotes_prefix_when_output_is_multi_rg` exercising the real writer with `row_group_size = 2` and verifying the file's KV records 0 via the inspector. - Extend `test_merge_accepts_matching_rg_partition_prefix_len` to inspector-verify the single-RG output's KV preserves the prefix. Test count: 382 → 384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Gate-A verification before PR-3 (single-RG ingest cutover): proves that page-level statistics written by PR-1 are actually consumed by the production query path for pruning, not just embedded inertly in the footer. Findings: - The metrics read path at `MetricsParquetTableProvider::scan` already calls `ParquetSource::with_enable_page_index(true)`, so DataFusion loads the column index + offset index when reading. No new wiring needed on the reader side. - DataFusion's `PruningMetrics` (`page_index_rows_pruned`) counter on `DataSourceExec` is the testable signal — pruned > 0 means pages were eliminated using their min/max from the column index. The new integration test (`quickwit-datafusion/tests/metrics.rs::test_page_index_pruning_via_query`) builds a single split with two metric_names interleaved, forces the metric_name column into ~16 pages within one row group, runs `WHERE metric_name = 'cpu.usage'`, walks the executed plan, and asserts `page_index_rows_pruned >= 4096` (the rows from the *other* metric) plus correctness of the returned rows. Plumbing change: `ParquetWriterConfig::with_data_page_row_count_limit` exposes Parquet's per-page row count rollover threshold. The size-based `data_page_size` knob alone can't force multi-page output when dictionary-encoded columns RLE-compress to a handful of bytes regardless of row count. Default 0 = unbounded; production behavior unchanged. Tests: 14/14 metrics integration tests pass (was 13). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

g-talbot requested review from alexanderbianchi and mattmkim May 4, 2026 18:52

g-talbot and others added 2 commits May 5, 2026 10:06

This was referenced May 5, 2026

feat: column-major streaming Parquet writer primitive #6384

Open

feat: column-major streaming Parquet reader primitive #6386

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable page-level Parquet stats + add rg_partition_prefix_len marker#6377

feat: enable page-level Parquet stats + add rg_partition_prefix_len marker#6377
g-talbot wants to merge 3 commits intomainfrom
gtt/parquet-page-stats-and-marker

g-talbot commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

g-talbot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

End-to-end verification (Gate-A)

Why now

Footer overhead

Test plan

What's not in this PR (deferred to later in the stack)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

g-talbot commented May 4, 2026 •

edited

Loading