Draft
Conversation
Rebuilds a single namespace's replication log from its live DB file without affecting other namespaces on the pod. Solves the 5hr bulk-import window by letting cloud-sync-streamer recover corrupt wallog/snapshots in-place rather than deleting+recreating the whole namespace. - Namespace::path() accessor (libsql-server/src/namespace/mod.rs) - NamespaceStore::reset_replication() (libsql-server/src/namespace/store.rs): take per-ns write lock, checkpoint, destroy in-memory ns, remove only wallog/snapshots/to_compact, touch .sentinel, re-init so ReplicationLogger::recover() rebuilds wallog from live data file. - POST /v1/namespaces/:ns/reset-replication admin route - libsql-ffi/build.rs: filter .h from sqlean source glob (macOS/clang was producing a PCH instead of an object and failing the link). Measured end-to-end: 1.1s p95 recovery vs 41s baseline at 20k seed rows. 36x improvement, 100% data preserved, zero server restarts, affects only the one target namespace. Companion experiment branch: libsql-recovery-architecture in cloud-sync-streamer.
If the pre-teardown checkpoint fails with a DatabaseCorrupt / malformed error, the live data file itself is corrupt. Rebuilding the wallog from corrupt data would just propagate the corruption AND leave the namespace in a broken state (the destroy-then-make sequence fails halfway, leaving NamespaceDoesntExist). Now the endpoint returns 500 with an explicit error message pointing the operator to a restore-from-backup path, without destroying the in-memory namespace first. The namespace stays loaded and returns the underlying corruption error to subsequent reads — a true observability signal. Verified with /tmp/test_mode_b.sh: before fix, namespace went to 404 after reset; after fix, namespace stays loaded with 'malformed database schema' error. Mode A happy path (wallog corruption, live data OK) unchanged: 1135ms p95 over 3 reps, 100% data preserved.
Runs PRAGMA quick_check (or integrity_check if {full:true}) on a
namespace's live data file. Lets callers (cloud-sync-streamer,
operators) classify the failure mode BEFORE attempting recovery:
ok:true -> live DB is fine; corruption must be in wallog/snapshots
(Mode A) -> use reset-replication (~1s recovery).
ok:false -> live DB itself is corrupt (Mode B) -> reset-replication
would propagate the corruption; fall back to delete +
bulk-import.
Implementation detail: SQLite surfaces severe corruption as
prepare/connect errors rather than PRAGMA rows. The endpoint normalizes
those into the same {ok:false, message:...} response shape so the caller
gets a uniform classification signal (HTTP 200) rather than a server
error (HTTP 500).
- libsql-server/src/namespace/mod.rs: Namespace::integrity_check()
- libsql-server/src/namespace/store.rs: NamespaceStore::integrity_check()
- libsql-server/src/http/admin/mod.rs: handler + route
Verified with /tmp/test_integrity_check.sh against a healthy namespace
(ok), a namespace with a poisoned data file (Mode B detected), and a
non-existent namespace (proper 404).
Three turmoil-based integration tests:
- integrity_check_on_healthy_namespace: asserts POST /v1/namespaces/
:ns/integrity-check returns {ok:true, message:'ok', check:'quick'}
for a healthy DB, and works with {full:true} as well.
- reset_replication_preserves_data_on_healthy_namespace: creates a
namespace, seeds 100 rows, calls reset-replication, confirms data
is still there + writes still work + integrity-check reports ok.
This is the idempotency guarantee for the endpoint.
- reset_replication_on_nonexistent_namespace_returns_404: verifies
the endpoint returns 404 (not 500) when the namespace doesn't
exist.
All 3 pass. Complements the ad-hoc /tmp/test_* bash scripts used during
development.
Paired 'reset_replication: starting for namespace X' and 'rebuilt replication log for namespace X in Yms' log lines. Lets ops correlate recovery events to client-side metrics and measure tail latency in prod. No performance impact (info! tracing is near-zero cost).
Default behavior unchanged: .sentinel is removed on graceful shutdown
(preserves existing semantics for the 99% of deployments that don't
need the kubectl-delete-pod recovery path).
With LIBSQL_PRESERVE_SENTINEL_ON_SHUTDOWN=1 set, the sentinel survives
graceful shutdown. This re-enables the documented operator recovery
procedure:
1. kubectl exec <pod> -- touch /data/dbs/<ns>/.sentinel
2. kubectl delete pod <pod> # SIGTERM → graceful shutdown
3. Kubernetes recreates pod
4. Next namespace access triggers dirty-recovery on the preserved
.sentinel, rebuilding wallog/snapshots from the live data file
Without this flag, step 2's graceful shutdown removes the sentinel
BEFORE the pod stops, so step 4 doesn't find a sentinel and skips
the dirty-recovery path.
Now that POST /v1/namespaces/:ns/reset-replication is the primary
recovery primitive, this flag is a low-priority belt-and-suspenders
for emergency ops workflows (e.g. when the admin API is unavailable).
Verified end-to-end with /tmp/run_sentinel_preserve_simple.sh: sentinel
preserved with flag, dirty-recovery fires on next access, data
preserved through the cycle.
Verifies that calling POST /v1/namespaces/:ns/reset-replication three times in a row succeeds at each attempt and preserves data. This is a critical safety property: the streamer's retry-after-reset path (and any operator running the command twice) must not corrupt the data file through sequential invocations. Previously validated empirically via bench_recovery.sh (run 21 in autoresearch.jsonl: 12ms per call on healthy ns, no data loss) but was not pinned as a turmoil integration test. Now it's in CI. All 4 reset-replication / integrity-check turmoil tests pass (the 2 pre-existing meta_attach flakes reproduce without this change).
Protects the documented API contract: POST /v1/namespaces/:ns/integrity-check
with an empty body {} must default to quick_check. This lets older or
simpler clients omit the field entirely.
Complements the existing integrity_check_on_healthy_namespace test
which only covers explicit full:true/full:false.
5 reset-replication + integrity-check turmoil tests all pass.
Before: handler returned 200 with empty body. Operators had to grep
libsql logs to see how long the reset took.
After: handler returns 200 with {"elapsed_ms": N}. The streamer (and
any other caller) can parse this into a StatsD histogram for dashboard
latency tracking, without server-side log scraping.
Backwards compatible: older clients that ignore the body still work.
New clients that expect the body handle an empty body gracefully.
New turmoil test: reset_replication_response_includes_elapsed_ms
verifies the field is present and within sanity bounds.
Symmetric to reset_replication_on_nonexistent_namespace_returns_404. If error handling regresses (e.g., accidentally returning 500 on NamespaceDoesntExist), the streamer's classifier would treat that as a transient server error and retry, instead of falling through to the optimistic-reset path. This test pins the correct behavior. 7 libsql turmoil tests total, all pass.
Three fixes surfaced in code-review of PR 2230: 1. integrity_check (namespace/mod.rs): Move the PRAGMA quick_check/integrity_check PRAGMA run from the current Tokio worker onto the blocking thread pool via spawn_blocking. On a large DB, integrity_check can take seconds and would otherwise stall async tasks on that worker. Matches the pattern already used for checkpoint() and vacuum_if_needed() on LegacyConnection and for block_in_place in admin_shell. 2. reset_replication replica guard (namespace/store.rs): Reject calls with 400 NotAPrimary unless db_kind is Primary. On a replica, the ReplicaConfigurator does not consume .sentinel, so the re-init path would not actually rebuild anything and we'd destroy wallog/snapshots the replica still needs. 3. Async filesystem check (namespace/store.rs): Replace sync Path::exists() with tokio::fs::try_exists to avoid blocking the Tokio worker on a filesystem stat. Also leaves a TODO comment on the string-based corruption classification in the checkpoint error path — follow-up tracked to replace it with typed rusqlite::ErrorCode matching (or a dedicated Error::DatabaseCorrupt variant) once the Error enum plumbing is updated. Verified: cargo check clean, all 7 reset/integrity turmoil tests pass (reset_replication_preserves_data_on_healthy_namespace, reset_replication_on_nonexistent_namespace_returns_404, reset_replication_is_idempotent, reset_replication_response_includes_elapsed_ms, integrity_check_on_healthy_namespace, integrity_check_defaults_to_quick_when_full_omitted, integrity_check_on_nonexistent_namespace_returns_404).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.