Reset replication endpoint by sle-c · Pull Request #24 · Shopify/libsql

sle-c · 2026-04-22T17:29:52Z

No description provided.

Rebuilds a single namespace's replication log from its live DB file without affecting other namespaces on the pod. Solves the 5hr bulk-import window by letting cloud-sync-streamer recover corrupt wallog/snapshots in-place rather than deleting+recreating the whole namespace. - Namespace::path() accessor (libsql-server/src/namespace/mod.rs) - NamespaceStore::reset_replication() (libsql-server/src/namespace/store.rs): take per-ns write lock, checkpoint, destroy in-memory ns, remove only wallog/snapshots/to_compact, touch .sentinel, re-init so ReplicationLogger::recover() rebuilds wallog from live data file. - POST /v1/namespaces/:ns/reset-replication admin route - libsql-ffi/build.rs: filter .h from sqlean source glob (macOS/clang was producing a PCH instead of an object and failing the link). Measured end-to-end: 1.1s p95 recovery vs 41s baseline at 20k seed rows. 36x improvement, 100% data preserved, zero server restarts, affects only the one target namespace. Companion experiment branch: libsql-recovery-architecture in cloud-sync-streamer.

If the pre-teardown checkpoint fails with a DatabaseCorrupt / malformed error, the live data file itself is corrupt. Rebuilding the wallog from corrupt data would just propagate the corruption AND leave the namespace in a broken state (the destroy-then-make sequence fails halfway, leaving NamespaceDoesntExist). Now the endpoint returns 500 with an explicit error message pointing the operator to a restore-from-backup path, without destroying the in-memory namespace first. The namespace stays loaded and returns the underlying corruption error to subsequent reads — a true observability signal. Verified with /tmp/test_mode_b.sh: before fix, namespace went to 404 after reset; after fix, namespace stays loaded with 'malformed database schema' error. Mode A happy path (wallog corruption, live data OK) unchanged: 1135ms p95 over 3 reps, 100% data preserved.

Runs PRAGMA quick_check (or integrity_check if {full:true}) on a namespace's live data file. Lets callers (cloud-sync-streamer, operators) classify the failure mode BEFORE attempting recovery: ok:true -> live DB is fine; corruption must be in wallog/snapshots (Mode A) -> use reset-replication (~1s recovery). ok:false -> live DB itself is corrupt (Mode B) -> reset-replication would propagate the corruption; fall back to delete + bulk-import. Implementation detail: SQLite surfaces severe corruption as prepare/connect errors rather than PRAGMA rows. The endpoint normalizes those into the same {ok:false, message:...} response shape so the caller gets a uniform classification signal (HTTP 200) rather than a server error (HTTP 500). - libsql-server/src/namespace/mod.rs: Namespace::integrity_check() - libsql-server/src/namespace/store.rs: NamespaceStore::integrity_check() - libsql-server/src/http/admin/mod.rs: handler + route Verified with /tmp/test_integrity_check.sh against a healthy namespace (ok), a namespace with a poisoned data file (Mode B detected), and a non-existent namespace (proper 404).

Three turmoil-based integration tests: - integrity_check_on_healthy_namespace: asserts POST /v1/namespaces/ :ns/integrity-check returns {ok:true, message:'ok', check:'quick'} for a healthy DB, and works with {full:true} as well. - reset_replication_preserves_data_on_healthy_namespace: creates a namespace, seeds 100 rows, calls reset-replication, confirms data is still there + writes still work + integrity-check reports ok. This is the idempotency guarantee for the endpoint. - reset_replication_on_nonexistent_namespace_returns_404: verifies the endpoint returns 404 (not 500) when the namespace doesn't exist. All 3 pass. Complements the ad-hoc /tmp/test_* bash scripts used during development.

Paired 'reset_replication: starting for namespace X' and 'rebuilt replication log for namespace X in Yms' log lines. Lets ops correlate recovery events to client-side metrics and measure tail latency in prod. No performance impact (info! tracing is near-zero cost).

Default behavior unchanged: .sentinel is removed on graceful shutdown (preserves existing semantics for the 99% of deployments that don't need the kubectl-delete-pod recovery path). With LIBSQL_PRESERVE_SENTINEL_ON_SHUTDOWN=1 set, the sentinel survives graceful shutdown. This re-enables the documented operator recovery procedure: 1. kubectl exec <pod> -- touch /data/dbs/<ns>/.sentinel 2. kubectl delete pod <pod> # SIGTERM → graceful shutdown 3. Kubernetes recreates pod 4. Next namespace access triggers dirty-recovery on the preserved .sentinel, rebuilding wallog/snapshots from the live data file Without this flag, step 2's graceful shutdown removes the sentinel BEFORE the pod stops, so step 4 doesn't find a sentinel and skips the dirty-recovery path. Now that POST /v1/namespaces/:ns/reset-replication is the primary recovery primitive, this flag is a low-priority belt-and-suspenders for emergency ops workflows (e.g. when the admin API is unavailable). Verified end-to-end with /tmp/run_sentinel_preserve_simple.sh: sentinel preserved with flag, dirty-recovery fires on next access, data preserved through the cycle.

Verifies that calling POST /v1/namespaces/:ns/reset-replication three times in a row succeeds at each attempt and preserves data. This is a critical safety property: the streamer's retry-after-reset path (and any operator running the command twice) must not corrupt the data file through sequential invocations. Previously validated empirically via bench_recovery.sh (run 21 in autoresearch.jsonl: 12ms per call on healthy ns, no data loss) but was not pinned as a turmoil integration test. Now it's in CI. All 4 reset-replication / integrity-check turmoil tests pass (the 2 pre-existing meta_attach flakes reproduce without this change).

Protects the documented API contract: POST /v1/namespaces/:ns/integrity-check with an empty body {} must default to quick_check. This lets older or simpler clients omit the field entirely. Complements the existing integrity_check_on_healthy_namespace test which only covers explicit full:true/full:false. 5 reset-replication + integrity-check turmoil tests all pass.

Before: handler returned 200 with empty body. Operators had to grep libsql logs to see how long the reset took. After: handler returns 200 with {"elapsed_ms": N}. The streamer (and any other caller) can parse this into a StatsD histogram for dashboard latency tracking, without server-side log scraping. Backwards compatible: older clients that ignore the body still work. New clients that expect the body handle an empty body gracefully. New turmoil test: reset_replication_response_includes_elapsed_ms verifies the field is present and within sanity bounds.

Symmetric to reset_replication_on_nonexistent_namespace_returns_404. If error handling regresses (e.g., accidentally returning 500 on NamespaceDoesntExist), the streamer's classifier would treat that as a transient server error and retry, instead of falling through to the optimistic-reset path. This test pins the correct behavior. 7 libsql turmoil tests total, all pass.

Three fixes surfaced in code-review of PR 2230: 1. integrity_check (namespace/mod.rs): Move the PRAGMA quick_check/integrity_check PRAGMA run from the current Tokio worker onto the blocking thread pool via spawn_blocking. On a large DB, integrity_check can take seconds and would otherwise stall async tasks on that worker. Matches the pattern already used for checkpoint() and vacuum_if_needed() on LegacyConnection and for block_in_place in admin_shell. 2. reset_replication replica guard (namespace/store.rs): Reject calls with 400 NotAPrimary unless db_kind is Primary. On a replica, the ReplicaConfigurator does not consume .sentinel, so the re-init path would not actually rebuild anything and we'd destroy wallog/snapshots the replica still needs. 3. Async filesystem check (namespace/store.rs): Replace sync Path::exists() with tokio::fs::try_exists to avoid blocking the Tokio worker on a filesystem stat. Also leaves a TODO comment on the string-based corruption classification in the checkpoint error path — follow-up tracked to replace it with typed rusqlite::ErrorCode matching (or a dedicated Error::DatabaseCorrupt variant) once the Error enum plumbing is updated. Verified: cargo check clean, all 7 reset/integrity turmoil tests pass (reset_replication_preserves_data_on_healthy_namespace, reset_replication_on_nonexistent_namespace_returns_404, reset_replication_is_idempotent, reset_replication_response_includes_elapsed_ms, integrity_check_on_healthy_namespace, integrity_check_defaults_to_quick_when_full_omitted, integrity_check_on_nonexistent_namespace_returns_404).

sle-c added 11 commits April 22, 2026 01:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset replication endpoint#24

Reset replication endpoint#24
sle-c wants to merge 11 commits intomainfrom
reset-replication-endpoint

sle-c commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sle-c commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant