sovereign embedded vector database. single-file .nest container with content-addressable citations, reproducible builds, and offline-first model verification.
one file carries chunks, embeddings, source spans, optional ANN and BM25 indices, and a search contract. all hash-verified, memory-mapped, SIMD-friendly.
python builds. rust serves. nest ships. agents/llms read, that's it.
a portable binary format plus a runtime that opens it. the format is open and frozen at v1. the runtime is rust, mmaps the file, runs SIMD dot products straight off disk, and returns hits with stable citation IDs. there is no server. no API call. no central index. ship a curated knowledge base with the application instead of pointing at a hosted vector DB. some run on phones, some on edge nodes, some inside regulated environments where queries must not leave the box.
the file is the database.
(∂μfμν = jν)
four properties, all enforced by the format itself, not by policy.
self-contained. a .nest file is the entire knowledge base. chunks, embeddings, source spans, search contract, indices. you copy it like you copy a SQLite db. no schema migrations, no companion files, no external service to look up.
cryptographically verifiable. every section has its own SHA-256 over the physical bytes. the file has a SHA-256 footer. the canonical content has a third hash, computed over the decoded bytes, that survives compression and lets two builds of the same logical content prove they are the same content. every hit comes back with a nest://content_hash/chunk_id citation that reopens the exact byte span it came from.
reproducible. same chunks plus same model fingerprint plus reproducible=True produce byte-identical files. two operators on two machines build the same file_hash. this is what makes the citation URI useful: it points at content, not at a copy.
offline-first by construction. the runtime never opens a socket. queries are answered from mmap. the embedding model the corpus was built with is identified by a granular fingerprint (config plus tokenizer plus weights plus pooling plus dim plus normalize), and nest search-text --model-path /path/to/local/snapshot lets you operate without ever touching huggingface. mismatches between the runtime model and the corpus model fail loudly, not silently.
put together: nobody outside the operator's machine has to be online, trusted, or even reachable for the database to work. that is what "sovereign" means here.
python builds, rust serves. the .nest file is the contract between them. one writer pipeline produces a deterministic container; one mmap-backed runtime opens it and answers queries through a SIMD-dispatched search path. the cli and the python api are thin surfaces over the same runtime.
sequenceDiagram
autonumber
participant B as builder.py
participant F as nest-format
participant N as .nest file
participant R as nest-runtime
participant E as embed_query.py
participant Q as nest cli / api
note over B,N: build · offline · reproducible
B->>B: chunk + embed + model_hash
B->>F: nest.build(chunks, model_hash)
F->>N: deterministic emit + four hashes
note over R,Q: query · offline · mmap
Q->>E: search-text spawns embedder
E-->>Q: query vector + model_hash
Q->>R: search(vector)
R->>N: mmap open + validate hashes
R->>R: model_hash vs manifest
R->>R: hnsw / bm25 candidates
R->>R: mandatory exact cosine rerank
R-->>Q: hits + nest://content_hash/chunk_id
the sequence traces the two flows that define nest. build (python, offline, reproducible) chunks, embeds, fingerprints the model, and writes a deterministic .nest file with four hashes. query (rust, offline, mmap) opens the file, validates the embedder's model_hash against the manifest, gathers hnsw or bm25 candidates, and reranks them with exact cosine before returning citations. the .nest file is the only artifact that crosses between python and rust.
four crates plus a python tooling layer:
nest-formatowns the frozen v1 container: layout, manifest, sections, encodings, hashes.nest-runtimeowns the mmap, the simd dispatcher, hnsw, bm25, and the search path with mandatory exact rerank.nest-cliis a thin clap surface with eight subcommands.nest-pythonis the pyo3 bridge that exposes the runtime to python.python/holds the writer pipeline, model fingerprint, and the query-time embedder used bysearch-text.
full visual map: doc/arc/arc.mmd. human reference: doc/arc/arc.md. machine map: doc/arc/arc.yaml.
requires rust edition 2024 and python 3.12+.
cargo build --release --workspace
cp target/release/lib_nest.dylib python/_nest.so # macOS
cp target/release/lib_nest.so python/_nest.so # linux
nest inspect <file> header, manifest, hashes (--json available)
nest validate <file> full integrity check
nest stats <file> sizes, counts, dtype, model, simd backend
nest search <file> <qvec> -k K exact top-k, query is a JSON array of f32
nest search-ann <file> <qvec> -k K --ef N HNSW path with exact rerank
nest search-text <file> "query" -k K [--model-path PATH] [--skip-model-hash-check]
nest benchmark <file> -q N -k K [--ann EF] [--madvise-cold]
nest cite <file> nest://<content_hash>/<chunk_id>
search takes a vector. search-text shells out to a python embedder, validates the model fingerprint against the manifest, then routes to the declared index_type (exact, hnsw, hybrid). a model mismatch fails with a typed error, never silently.
import sys; sys.path.insert(0, "python"); import nest
db = nest.open(path)
hits = db.search(qvec, k=5)
hits[0].citation_id # nest://content_hash/chunk_id
hits[0].source_uri
hits[0].offset_start, hits[0].offset_end
hits[0].score # real cosine, not an ANN proxy
# with HNSW present:
hits = db.search_ann(qvec, k=5, ef=100)
# hybrid (BM25 union vector, then exact rerank):
hits = db.search_hybrid(qvec, query_text, k=5, candidates=200)
db.validate()
db.inspect()
build a file:
nest.build(
output_path,
embedding_model,
embedding_dim,
chunker_version,
model_hash, # sha256(canonical_json(fingerprint)), see python/model_fingerprint.py
chunks, # [{canonical_text, source_uri, byte_start, byte_end, embedding}]
reproducible=True,
preset="exact", # "compressed" | "tiny" | "hybrid"
)
or via Pipeline in python/builder.py with chunker, SQLite cache, and auto-validate.
| preset | text encoding | embeddings | ANN | BM25 | size ratio | recall@10 |
|---|---|---|---|---|---|---|
exact |
raw | float32 | no | no | 1.000 | 1.0000 |
compressed |
zstd | float16 | no | no | 0.350 | 1.0000 |
tiny |
zstd | int8 | yes | no | 0.283 | 0.9920 |
hybrid |
zstd | float32 | yes | yes | 0.668 | 1.0000 |
numbers measured on a 30,725-chunk PT-BR corpus, dim=384, NEON. tiny is the smallest distributable form, hybrid recovers lexical recall on rare terms, exact is the recall-1.0 ground truth.
flat exact search remains the recall-1.0 baseline.
added on top, all inside v1 (no format break):
- wire encodings 1, 2, 3: zstd for text sections, float16 and int8 for embeddings.
content_hashis over decoded bytes, so citations are stable across encodings. - HNSW (section 0x07) and BM25 (section 0x08) as optional sections, with mandatory exact rerank so the
scorefield is real cosine, not an ANN proxy. - SIMD dispatcher: AVX2 on x86_64, NEON on aarch64, scalar fallback.
NEST_FORCE_SCALAR=1for A/B benchmarks. accumulators are always f32 regardless of dtype. nest search-textwith reproducible model fingerprint and--model-pathfor fully offline operation. supersedes the v1 "vector only" CLI restriction.madvise-coldbenchmark for first-hit-after-boot latency bound.- four presets that bundle the above into named tradeoffs.
builds with reproducible=True are byte-identical for the same input.
cargo test --release --workspace
python tests/test_e2e.py
python tests/test_builder.py
python tests/test_search_text_model_hash.py
./scripts/release_check.sh
release_check.sh runs the full pipeline: build, test, clippy, fmt, line-count guard, python tests, ruff, measure_presets.py, compare_measure.py against the committed baseline. exits non-zero on any failure.
dat/corpus_next.v1.nest is built from seven public PT-BR fake-news datasets unified by truw: FakeBr-hf, FakeTrue.Br-hf, Fake.br-Corpus, FakeRecogna, FakeTrue.Br, factck-br, and portuguese-fake-news-classifier-bilstm-combined. raw datasets and the embedding cache live under dat/demo/ as local-only, gitignored assets. rebuild from source with:
python python/tools/nest_build_corpus.py
with reproducible=True (the script default) two operators get byte-identical file_hash. see dat/demo/README.md for download commands, what each subdirectory contains, the schema of the v2 truw canonical CSV, and per-dataset license notes.
dat/demo/README.mdfor what each upstream dataset is and how to rebuild the corpusdoc/arc/arc.mdfor architecture, binary layout, API surface, errors, and versioningdoc/arc/arc.yamlfor the machine-readable architecture map used by agents and toolingdoc/arc/arc.mmdfor the mermaid sequence diagram of the build and query flowsdoc/usage.mdfor the eight commands, presets, offline mode, citationsdoc/changelog.mdfor v0.1 to v0.2 deltas
made it simple, but signifcant. Brenner Cruvinel.
MIT.
