HyperQuant is an open-source, memory-first compression stack for teams that are hitting a real resident-memory limit in vector, long-context, replay, or session-state workloads.
It is not positioned as a universal “better quantizer.” The product goal is narrower and more useful:
reduce resident memory in a way that can be measured, reproduced, and rolled out safely.
Most infrastructure teams do not need another synthetic compression demo. They need a clear answer to one of these questions:
- Why does resident state cap session density per node?
- How many sessions fit into the current RAM budget if only a hot window stays resident?
- Can long-context artifacts be tiered out of RAM without hiding quality loss?
- Can route selection stay explicit instead of silently over-claiming wins?
HyperQuant ships four explicit routes.
conservative_codebook: trained codebook route for arbitrary numeric tensors.vector_codec: training-free rotated-scalar route for vector-like data, with a residual rescue side-channel for the hardest coefficients.context_codec: page-aware route for structured long-context data, with an explicit fail-closed contract.resident_tier: resident-memory planning and tiered artifact store for bounded hot-window serving.
These routes are deliberately separate. The repository is easier to trust when it says where each path fits.
The main reason to evaluate HyperQuant is not headline ratio. It is operating leverage:
- fewer resident bytes per session,
- higher session density under a fixed RAM budget,
- reproducible route-level trade-offs,
- offline tiered artifacts that can be reopened, verified, and read lazily by slice.
The flagship path is resident_tier. The other routes exist to make that path measurable and explainable.
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
pytest -qGenerate a live-like dataset and project resident memory:
python -m hyperquant generate-live-data --scenario online --output /tmp/online.npy --n-vectors 16384 --dim 128
python -m hyperquant resident-plan --input /tmp/online.npy --page-size 64 --group-size 128 --residual-topk 1 --hot-pages 8 --active-window-tokens 256 --concurrent-sessions 8 --runtime-value-bytes 2Build, verify, and read from a tiered resident store:
python -m hyperquant build-resident-store --input /tmp/online.npy --output /tmp/resident_store --page-size 64 --group-size 128 --residual-topk 1 --hot-pages 8
python -m hyperquant verify-resident-store --store /tmp/resident_store
python -m hyperquant read-resident-slice --store /tmp/resident_store --start 0 --end 256 --output /tmp/window.npyCompare the vector route against the dense rotation baseline:
python -m hyperquant vector-benchmark --input /tmp/online.npy --bits 3 --group-size 128 --residual-topk 1 --iterations 5 --warmup 1
python -m hyperquant dense-baseline-benchmark --input /tmp/online.npy --bits 3 --group-size 128 --residual-topk 1 --iterations 5 --warmup 1Evaluate the structured context route with a fail-closed contract:
python -m hyperquant context-benchmark --input /tmp/online.npy --with-guarantee --fail-closedRegenerate the checked-in evidence pack:
python scripts/build_proof_pack.pyStart with:
resident-planbuild-resident-storeverify-resident-storeread-resident-sliceresident-benchmark
Start with:
vector-benchmarkdense-baseline-benchmarkroute-benchmark
Start with:
context-benchmark --with-guarantee --fail-closedroute-benchmarkcontext-audit-input
The repository ships checked-in evidence under docs/evidence/ and a script that regenerates it from source.
Start with:
docs/evidence/route-benchmark.mddocs/evidence/resident-benchmark.mddocs/evidence/capacity-example.md
GET /healthzGET /metricsPOST /v1/codebook/compressPOST /v1/codebook/decompressPOST /v1/vector/compressPOST /v1/vector/decompressPOST /v1/context/compressPOST /v1/context/decompressPOST /v1/resident/plan
- the four routes work end to end;
- resident-memory planning is reproducible from source;
- the resident store can be built, reopened, verified, and read lazily by slice;
- route trade-offs can be measured with the included harness;
- the context route can run with a fail-closed contract.
- universal superiority across every model or runtime;
- end-task quality on your production traces without a pilot;
- vendor-runtime integration such as auth, admission control, or multi-node cache coherence;
- a drop-in replacement for serving engines or vector indexes.
That boundary is deliberate. It is more credible than pretending to be a universal winner.
docs/problem-solution-profit.mddocs/architecture.mddocs/evaluation.mddocs/guarantees.mddocs/pilot-playbook.md
Licensed under the Apache License, Version 2.0.
Copyright © 2026 Сацук Артём Венедиктович (Satsuk Artem).
Synchatica is the project brand. See LICENSE and NOTICE.