A command-line tool for inspecting, transforming, and comparing tabular files.
parq focuses on the workflows that come up most often when working with .parquet, .csv, .tsv, and .xlsx files:
- inspect metadata and schema
- preview the first or last rows
- count rows
- split large files
- compute lightweight column stats (with cardinality and top-values for string columns)
- convert between supported formats
- diff two datasets by key
- merge compatible files
The CLI keeps startup light with lazy imports, preserves plain and json output modes for automation, and avoids unnecessary full-table materialization for large CSV/XLSX workflows where possible.
pip install parq-cliEnable .xlsx support with the optional dependency:
pip install "parq-cli[xlsx]"Project facts for AI-assisted development live in .trellis/spec/. Start with .trellis/spec/README.md, then read .trellis/spec/backend/index.md and .trellis/spec/shared/verification.md before changing CLI commands, readers, format modules, output contracts, scripts, or release workflows.
# Inspect metadata
parq meta data.parquet
parq meta --fast data.csv
# Show schema
parq schema data.xlsx
# Preview rows
parq head data.parquet
parq head -n 10 --columns id,name data.csv
parq tail -n 20 data.csv
# Count rows
parq count data.parquet
# Split files
parq split data.csv --record-count 100000 -n "chunks/part-%03d.csv"
parq split data.parquet --file-count 4 -n "chunks/part-%02d.parquet"
parq split data.csv --record-count 100000 -n "out/part-%03d.csv" --force # overwrite existing
# Column statistics (string columns include cardinality and top values)
parq stats sales.parquet --columns amount,category --limit 10
parq stats sales.parquet --columns category --top-n 10 # show top 10 most frequent values
# Format conversion (with live progress bar)
parq convert raw.xlsx cleaned.parquet
parq convert source.parquet export.csv --columns id,name,status
parq convert source.parquet export.csv --force # overwrite if exists
# Read TSV files or use a custom delimiter
parq head data.tsv
parq head --delimiter ";" data.csv
# Read a specific XLSX sheet
parq head --sheet Sheet2 report.xlsx
parq head --sheet 1 report.xlsx # 0-based index
# Dataset diff
parq diff old.parquet new.parquet --key id --columns status,amount
parq diff left.csv right.csv --key id --summary-only
# Merge compatible inputs (with live progress bar)
parq merge part-001.parquet part-002.parquet merged.parquet
parq merge chunks/*.parquet merged.parquet --force # overwrite if exists| Command | Parquet | CSV | TSV | XLSX |
|---|---|---|---|---|
meta |
yes | yes | yes | yes |
schema |
yes | yes | yes | yes |
head / tail |
yes | yes | yes | yes |
count |
yes | yes | yes | yes |
split |
yes | yes | yes | yes |
stats |
yes | yes | yes | yes |
convert |
yes | yes | yes | yes |
diff |
yes | yes | yes | no, convert first |
merge |
yes | yes | yes | yes |
XLSX support requires openpyxl. TSV files are auto-detected by the .tsv extension; a custom delimiter can be supplied with --delimiter.
parq meta FILE
parq meta --fast FILEShows file-level metadata such as path, format, column count, file size, row-group count, and when available, row count and Parquet-specific metadata.
Use --fast when you want a cheap metadata pass on CSV/XLSX files. In fast mode, expensive fields such as full row counts are skipped.
parq schema FILEShows column names, types, and nullable information.
parq head FILE
parq head -n 20 FILE
parq head -n 20 --columns id,name FILE
parq tail FILE
parq tail -n 20 FILE
parq tail -n 20 --columns id,name FILENotes:
- default preview size is
5 --columnsaccepts a comma-separated list- missing files return a friendly error with exit code
1 - empty header-only CSV/XLSX files return an empty preview with detected columns
- an empty csv with no header raises a friendly
Empty CSV fileerror
parq count FILEReturns the total row count.
parq split FILE --file-count N
parq split FILE --record-count N
parq split FILE --record-count 100000 -n "chunks/part-%03d.parquet"
parq split FILE --record-count 100000 -n "chunks/part-%03d.csv" --forceSplits one input file into multiple output files.
Rules:
- specify exactly one of
--file-countor--record-count - output format is inferred from
--name-format - by default, existing target files raise an error; use
--force/-Fto overwrite - in
--record-countmode, CSV/XLSX now stream in a single pass instead of pre-counting the entire file - a live progress bar is shown during the split
parq stats FILE
parq stats FILE --columns amount,category
parq stats FILE --limit 20
parq stats FILE --columns category --top-n 10Computes simple per-column statistics.
- numeric columns include
count,null_count,min,max,mean - string, boolean, and date columns additionally include
cardinalityandtop_values(top N most frequent values with their occurrence counts) - default
--top-nis5; set to0to suppress top-values output entirely - default
--limitis50to avoid flooding the terminal on very wide tables
parq convert SOURCE OUTPUT
parq convert SOURCE OUTPUT --columns id,name,status
parq convert SOURCE OUTPUT --forceConverts a supported input file to another supported output format. The output format is determined by the OUTPUT suffix.
Notes:
- current targets are
.parquet,.csv,.tsv, and.xlsx - conversion is streaming-based where possible
- a live progress bar is shown during the conversion
- by default, existing output files raise an error; use
--force/-Fto overwrite
parq diff LEFT RIGHT --key id
parq diff LEFT RIGHT --key id1,id2 --columns status,amount
parq diff LEFT RIGHT --key id --summary-onlyCompares two datasets by key and reports:
- row count delta
- rows only present on the left
- rows only present on the right
- changed rows for the selected columns
- schema-only columns and same-name type mismatches
Notes:
--keyis requireddiffcurrently supports Parquet and CSV inputs- XLSX files should be converted first
- duplicate keys on either side are treated as an error
--summary-onlykeeps the counts and omits sample payloads
parq merge INPUT1 INPUT2 OUTPUT
parq merge chunks/*.parquet merged.parquet
parq merge chunks/*.parquet merged.parquet --forceMerges multiple compatible input files into a single output file. The last positional argument is the output path.
Notes:
- schemas must be identical or safely unifiable by Arrow
- by default, existing output files raise an error; use
--force/-Fto overwrite - output format is inferred from the output suffix
- a live progress bar is shown during the merge
Global options:
--version,-v: show version information--output,-o: select output format (rich|plain|json)--delimiter,-d: field delimiter for CSV/TSV input (default:,);.tsvfiles default to\tautomatically--sheet: XLSX sheet name or 0-based index to read (default: active sheet)--help: show command help
Available output modes:
rich: human-friendly terminal renderingplain: low-overhead tabular output for shell pipelinesjson: machine-readable structured output
Examples:
parq meta data.parquet --output json
parq --output plain stats data.csv
parq --delimiter ";" head semicolon_data.csv
parq --sheet "Sales" head report.xlsx
parq diff left.parquet right.parquet --key id --summary-only --output jsonOn Windows terminals that cannot safely render emoji or extended characters, Rich headings automatically fall back to a safe plain style instead of crashing.
- Parquet metadata, row counts, and previews use Arrow metadata and row-group shortcuts where available.
- CSV
tailuses a fixed-size column window instead of materializing every row as Python dicts. - CSV/XLSX
split --record-countstreams in one pass. meta --fastis the best option when you need quick metadata from large CSV/XLSX inputs.- XLSX schema inference samples the first 1000 rows instead of scanning the entire sheet up front.
For repeated heavy workflows, converting large CSV/XLSX files to Parquet is still the best path for throughput.
Install development dependencies:
uv sync --extra devor:
pip install -e ".[dev]"Useful commands:
python -m parq --help
pytest -m "not performance"
pytest tests/test_performance.py -m performance -q -s
ruff check parq tests
ruff check --fix parq tests
pytest --cov=parq --cov-report=htmlImplemented:
- metadata and schema inspection
- head and tail preview
- row counting
- file splitting (with progress bar,
--forceoverwrite) - column statistics (numeric + string cardinality/top-values,
--top-n) - format conversion (with progress bar,
--forceoverwrite) - keyed dataset diff
- compatible file merge (with progress bar,
--forceoverwrite) - TSV auto-detection and custom delimiter support (
--delimiter) - XLSX multi-sheet selection (
--sheet)
Planned improvements are now centered on deeper performance tuning, richer diff workflows, and broader reporting capabilities rather than adding the core commands from scratch.