Skip to content

feat: enable page-level Parquet stats + add rg_partition_prefix_len marker#6377

Open
g-talbot wants to merge 3 commits intomainfrom
gtt/parquet-page-stats-and-marker
Open

feat: enable page-level Parquet stats + add rg_partition_prefix_len marker#6377
g-talbot wants to merge 3 commits intomainfrom
gtt/parquet-page-stats-and-marker

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

@g-talbot g-talbot commented May 4, 2026

Summary

PR-1 of a 7-PR stack producing a memory-bounded, streaming, column-major Parquet pipeline (the reformulated Phase 4a). This PR lays the on-disk format foundation and verifies it works end-to-end through the production query path. Subsequent PRs add the streaming reader, writer, merge engine, and pipeline wiring.

Three changes:

  1. Page-level statistics by default. Writer flips from EnabledStatistics::Chunk to EnabledStatistics::Page, so every new file carries Parquet Column Index + Offset Index in the footer. The level is now a config knob — ParquetWriterConfig::with_page_statistics(false) falls back to chunk-level for callers that want a smaller footer. A second knob (with_data_page_row_count_limit) exposes Parquet's per-page row count rollover threshold, useful for tests and for future writers that want to control page granularity directly.

  2. qh.rg_partition_prefix_len marker. A numeric KV in the Parquet footer (and a matching u32 field on ParquetSplitMetadata) recording how many leading sort schema columns row group boundaries align with. 0 (or absent) = no claim, the legacy default. N = aligned with the first N sort columns. The marker is read by the merge engine and added to the compaction scope so files with different prefix values stay in separate buckets. The merge engine preserves the inputs' prefix on a single-RG output (vacuously aligned) and demotes to 0 only when the output is genuinely multi-RG, so post-PR-3 ingest splits stay in their bucket through small merges.

  3. Developer tooling. inspect_parquet_page_stats(...) library + inspect_parquet CLI binary in quickwit-parquet-engine. Reads the footer (including the page indexes), pretty-prints a per-RG / per-column / per-page report, and supports --json, --all-pages, and --verify-prefix (which enforces the strong form of the prefix claim: every column in the prefix must be constant within each RG).

End-to-end verification (Gate-A)

A new integration test (test_page_index_pruning_via_query in quickwit-datafusion/tests/metrics.rs) confirms page-level pruning fires through the production query path:

  • Builds a single split with two metric_names interleaved in one row group, forced into ~16 pages.
  • Runs WHERE metric_name = 'cpu.usage' via the regular SQL session builder (no shortcuts; same path the REST API uses).
  • Walks the executed ExecutionPlan for the PruningMetrics::page_index_rows_pruned counter exposed by DataFusion's ParquetSource.
  • Asserts the counter ≥ 4096 (the rows from the other metric) AND that the query still returns exactly the cpu.usage rows.

This proves that:

  • The new EnabledStatistics::Page writer config makes the column index + offset index land in the footer (already covered by inspect.rs unit tests).
  • The reader path (MetricsParquetTableProvider::scan) loads the page index — ParquetSource::with_enable_page_index(true) was already wired at line 207, no change needed.
  • DataFusion's predicate pushdown evaluates against the page-level stats and skips matching pages.

This is the gate that had to pass before PR-3 cuts ingest over to single-RG: without page-level pruning, single-RG would collapse query pruning to one min/max per file. The gate is now passed.

Why now

Single-row-group files — coming in PR-3 — have only one RG worth of chunk-level statistics, which collapses query pruning to one min/max per file unless page-level data is in the footer. Page indexes have to land in the footer before the writer cuts over to single-RG, which is what this PR sets up.

The marker is the contract between the streaming reader and the merge engine: only files that claim the same prefix length can be merged together. Defining and threading it through compaction scope + validation now means later PRs in the stack just turn on writes — no retroactive plumbing.

Footer overhead

Measured on production-sized files (zstd-3, default writer config) by writing the same data twice — once with EnabledStatistics::Chunk, once with Page — and comparing total file size:

Shape Compressed size Page-index overhead
1.5M rows × 11 cols (realistic) 22.6 MiB +0.033% (7.6 KiB)
3.5M rows × 11 cols (realistic) 52.2 MiB +0.033% (17.6 KiB)
3.5M rows × 50 cols (wide schema worst case) 52.9 MiB +0.162% (87.5 KiB)

The page index scales with column count × page count, not row count, so on production-sized splits it's a rounding error. Pinned by an integration test (test_footer_size_delta_for_page_level_stats) that fails if the delta exceeds 30% on a small synthetic file — a generous bound that catches regressions without flapping on the absolute size.

Test plan

  • cargo nextest run -p quickwit-parquet-engine --all-features — 384 / 384 (17 new for this PR)
  • cargo nextest run -p quickwit-datafusion --test metrics — 14 / 14 (1 new: end-to-end page-index pruning)
  • cargo clippy -p quickwit-parquet-engine --tests — clean
  • cargo clippy -p quickwit-datafusion --tests — clean
  • cargo doc -p quickwit-parquet-engine --no-deps — clean
  • cargo machete — clean
  • bash quickwit/scripts/check_license_headers.sh — clean
  • bash quickwit/scripts/check_log_format.sh — clean
  • Manual: cargo run -p quickwit-parquet-engine --bin inspect_parquet -- <file>.parquet produces a useful human report; --json round-trips through serde; --verify-prefix errors with a per-RG diagnostic when the prefix claim is violated
  • Workspace cargo check --workspace --all-features — no breakage from the new ParquetSplitMetadata field (default 0, additive JSON Serde) or the new ParquetWriterConfig field (default 0 = unbounded, behavior unchanged)

What's not in this PR (deferred to later in the stack)

  • Any writer or reader changes beyond the config flip and the marker plumbing.
  • Setting rg_partition_prefix_len > 0 from any production code path. PR-3 (single-RG ingest) is the first writer that will set it; PR-6 (streaming column-major merge engine) is where multi-RG-by-metric_name output lands.

🤖 Generated with Claude Code

…arker

Foundation for the streaming column-major merge engine workstream.

Switches the writer's default from EnabledStatistics::Chunk to
EnabledStatistics::Page so every newly-written file carries a Column
Index and Offset Index in its footer. Without this, single-RG files
produced by future PRs would have one min/max per file — useless for
selective queries. The default is exposed as a knob
(`ParquetWriterConfig::with_page_statistics`) so callers can opt out
when the footer overhead isn't worth it.

Adds a numeric marker `qh.rg_partition_prefix_len` in the file's KV
metadata and a matching `rg_partition_prefix_len: u32` field on
`ParquetSplitMetadata`. The marker records how many leading sort
schema columns RG boundaries align with: 0 = no claim (legacy default),
N = aligned with the first N sort columns. Single-RG files vacuously
satisfy any prefix; future writers will set N = sort_schema.len().

Compaction scope now includes `rg_partition_prefix_len`. Splits with
different prefix values land in different buckets; the merge engine
validates input files agree on prefix and rejects mismatches at both
the metastore-struct layer and the on-disk KV layer. Until the
streaming engine lands, the merge writer demotes the output's prefix
to 0 because it cannot enforce alignment.

New developer tooling:
- `quickwit_parquet_engine::storage::inspect_parquet_page_stats` library
  function returning a structured per-RG / per-column / per-page report,
  plus `verify_partition_prefix` for the strong-form alignment check.
- `inspect_parquet` binary in the parquet-engine crate with `--json`,
  `--all-pages`, and `--verify-prefix` flags.

Footer-size delta on a representative shape (100K rows × 6 cols):
+19.5% (672 KB → 804 KB). The page index scales with column count,
not data volume, so production-sized 50 MB splits show < 0.3% overhead.

Test count: 367 → 382 (15 new). Clippy/doc/license/log/machete clean.
g-talbot and others added 2 commits May 5, 2026 10:06
Avoids a compaction-bucket leak that would otherwise appear once PR-3
ships single-RG ingest before PR-6 ships the streaming column-major
merge engine.

Previously, every merge unconditionally set the output's
`rg_partition_prefix_len` to 0, even when the writer happened to
produce a single-RG output that vacuously satisfies any alignment
claim. With single-RG ingest active and merge demoting on every
operation, post-PR-3 ingest splits would leak out of the
`prefix = sort_len` bucket on their first merge and never rejoin
it — newer ingests would not merge with merge outputs.

New rule: predict the output's row group count via
`num_rows.div_ceil(row_group_size)`. If ≤ 1 RG, propagate the inputs'
prefix; otherwise demote to 0. Both the metastore split metadata
(`merge_parquet_split_metadata`) and the file's KV metadata
(`build_merge_kv_metadata`) follow the same rule, so they always
agree about what's on disk.

A `debug_assert!` checks that the prediction matches the actual row
group count returned by `ArrowWriter::close()` — catches a future
config change that adds a byte-based RG threshold and silently
invalidates the KV claim.

`MergeOutputFile` gains a `num_row_groups: usize` field so the
metastore-side rule can be applied without re-parsing the file.

Test changes:
- Rename `test_output_prefix_len_demoted_to_zero` to
  `test_output_prefix_len_demoted_when_multi_rg`; pin the demotion
  to the `num_row_groups > 1` case.
- New `test_output_prefix_len_preserved_when_single_rg` asserting
  the propagation case.
- New `test_merge_demotes_prefix_when_output_is_multi_rg`
  exercising the real writer with `row_group_size = 2` and verifying
  the file's KV records 0 via the inspector.
- Extend `test_merge_accepts_matching_rg_partition_prefix_len` to
  inspector-verify the single-RG output's KV preserves the prefix.

Test count: 382 → 384.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Gate-A verification before PR-3 (single-RG ingest cutover): proves that
page-level statistics written by PR-1 are actually consumed by the
production query path for pruning, not just embedded inertly in the
footer.

Findings:
- The metrics read path at `MetricsParquetTableProvider::scan` already
  calls `ParquetSource::with_enable_page_index(true)`, so DataFusion
  loads the column index + offset index when reading. No new wiring
  needed on the reader side.
- DataFusion's `PruningMetrics` (`page_index_rows_pruned`) counter on
  `DataSourceExec` is the testable signal — pruned > 0 means pages
  were eliminated using their min/max from the column index.

The new integration test
(`quickwit-datafusion/tests/metrics.rs::test_page_index_pruning_via_query`)
builds a single split with two metric_names interleaved, forces the
metric_name column into ~16 pages within one row group, runs
`WHERE metric_name = 'cpu.usage'`, walks the executed plan, and asserts
`page_index_rows_pruned >= 4096` (the rows from the *other* metric)
plus correctness of the returned rows.

Plumbing change: `ParquetWriterConfig::with_data_page_row_count_limit`
exposes Parquet's per-page row count rollover threshold. The size-based
`data_page_size` knob alone can't force multi-page output when
dictionary-encoded columns RLE-compress to a handful of bytes
regardless of row count. Default 0 = unbounded; production behavior
unchanged.

Tests: 14/14 metrics integration tests pass (was 13).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant