Skip to content

Reset replication endpoint#24

Draft
sle-c wants to merge 11 commits intomainfrom
reset-replication-endpoint
Draft

Reset replication endpoint#24
sle-c wants to merge 11 commits intomainfrom
reset-replication-endpoint

Conversation

@sle-c
Copy link
Copy Markdown

@sle-c sle-c commented Apr 22, 2026

No description provided.

sle-c added 11 commits April 22, 2026 01:03
Rebuilds a single namespace's replication log from its live DB file
without affecting other namespaces on the pod. Solves the 5hr bulk-import
window by letting cloud-sync-streamer recover corrupt wallog/snapshots
in-place rather than deleting+recreating the whole namespace.

- Namespace::path() accessor (libsql-server/src/namespace/mod.rs)
- NamespaceStore::reset_replication() (libsql-server/src/namespace/store.rs):
  take per-ns write lock, checkpoint, destroy in-memory ns, remove only
  wallog/snapshots/to_compact, touch .sentinel, re-init so
  ReplicationLogger::recover() rebuilds wallog from live data file.
- POST /v1/namespaces/:ns/reset-replication admin route
- libsql-ffi/build.rs: filter .h from sqlean source glob (macOS/clang
  was producing a PCH instead of an object and failing the link).

Measured end-to-end: 1.1s p95 recovery vs 41s baseline at 20k seed rows.
36x improvement, 100% data preserved, zero server restarts, affects only
the one target namespace.

Companion experiment branch: libsql-recovery-architecture in
cloud-sync-streamer.
If the pre-teardown checkpoint fails with a DatabaseCorrupt / malformed
error, the live data file itself is corrupt. Rebuilding the wallog from
corrupt data would just propagate the corruption AND leave the namespace
in a broken state (the destroy-then-make sequence fails halfway, leaving
NamespaceDoesntExist).

Now the endpoint returns 500 with an explicit error message pointing the
operator to a restore-from-backup path, without destroying the in-memory
namespace first. The namespace stays loaded and returns the underlying
corruption error to subsequent reads — a true observability signal.

Verified with /tmp/test_mode_b.sh: before fix, namespace went to 404
after reset; after fix, namespace stays loaded with 'malformed database
schema' error. Mode A happy path (wallog corruption, live data OK)
unchanged: 1135ms p95 over 3 reps, 100% data preserved.
Runs PRAGMA quick_check (or integrity_check if {full:true}) on a
namespace's live data file. Lets callers (cloud-sync-streamer,
operators) classify the failure mode BEFORE attempting recovery:

  ok:true  -> live DB is fine; corruption must be in wallog/snapshots
              (Mode A) -> use reset-replication (~1s recovery).
  ok:false -> live DB itself is corrupt (Mode B) -> reset-replication
              would propagate the corruption; fall back to delete +
              bulk-import.

Implementation detail: SQLite surfaces severe corruption as
prepare/connect errors rather than PRAGMA rows. The endpoint normalizes
those into the same {ok:false, message:...} response shape so the caller
gets a uniform classification signal (HTTP 200) rather than a server
error (HTTP 500).

- libsql-server/src/namespace/mod.rs: Namespace::integrity_check()
- libsql-server/src/namespace/store.rs: NamespaceStore::integrity_check()
- libsql-server/src/http/admin/mod.rs: handler + route

Verified with /tmp/test_integrity_check.sh against a healthy namespace
(ok), a namespace with a poisoned data file (Mode B detected), and a
non-existent namespace (proper 404).
Three turmoil-based integration tests:

- integrity_check_on_healthy_namespace: asserts POST /v1/namespaces/
  :ns/integrity-check returns {ok:true, message:'ok', check:'quick'}
  for a healthy DB, and works with {full:true} as well.

- reset_replication_preserves_data_on_healthy_namespace: creates a
  namespace, seeds 100 rows, calls reset-replication, confirms data
  is still there + writes still work + integrity-check reports ok.
  This is the idempotency guarantee for the endpoint.

- reset_replication_on_nonexistent_namespace_returns_404: verifies
  the endpoint returns 404 (not 500) when the namespace doesn't
  exist.

All 3 pass. Complements the ad-hoc /tmp/test_* bash scripts used during
development.
Paired 'reset_replication: starting for namespace X' and 'rebuilt
replication log for namespace X in Yms' log lines. Lets ops correlate
recovery events to client-side metrics and measure tail latency in
prod. No performance impact (info! tracing is near-zero cost).
Default behavior unchanged: .sentinel is removed on graceful shutdown
(preserves existing semantics for the 99% of deployments that don't
need the kubectl-delete-pod recovery path).

With LIBSQL_PRESERVE_SENTINEL_ON_SHUTDOWN=1 set, the sentinel survives
graceful shutdown. This re-enables the documented operator recovery
procedure:

  1. kubectl exec <pod> -- touch /data/dbs/<ns>/.sentinel
  2. kubectl delete pod <pod>      # SIGTERM → graceful shutdown
  3. Kubernetes recreates pod
  4. Next namespace access triggers dirty-recovery on the preserved
     .sentinel, rebuilding wallog/snapshots from the live data file

Without this flag, step 2's graceful shutdown removes the sentinel
BEFORE the pod stops, so step 4 doesn't find a sentinel and skips
the dirty-recovery path.

Now that POST /v1/namespaces/:ns/reset-replication is the primary
recovery primitive, this flag is a low-priority belt-and-suspenders
for emergency ops workflows (e.g. when the admin API is unavailable).

Verified end-to-end with /tmp/run_sentinel_preserve_simple.sh: sentinel
preserved with flag, dirty-recovery fires on next access, data
preserved through the cycle.
Verifies that calling POST /v1/namespaces/:ns/reset-replication three
times in a row succeeds at each attempt and preserves data.

This is a critical safety property: the streamer's retry-after-reset
path (and any operator running the command twice) must not corrupt
the data file through sequential invocations.

Previously validated empirically via bench_recovery.sh (run 21 in
autoresearch.jsonl: 12ms per call on healthy ns, no data loss) but
was not pinned as a turmoil integration test. Now it's in CI.

All 4 reset-replication / integrity-check turmoil tests pass (the 2
pre-existing meta_attach flakes reproduce without this change).
Protects the documented API contract: POST /v1/namespaces/:ns/integrity-check
with an empty body {} must default to quick_check. This lets older or
simpler clients omit the field entirely.

Complements the existing integrity_check_on_healthy_namespace test
which only covers explicit full:true/full:false.

5 reset-replication + integrity-check turmoil tests all pass.
Before: handler returned 200 with empty body. Operators had to grep
libsql logs to see how long the reset took.

After: handler returns 200 with {"elapsed_ms": N}. The streamer (and
any other caller) can parse this into a StatsD histogram for dashboard
latency tracking, without server-side log scraping.

Backwards compatible: older clients that ignore the body still work.
New clients that expect the body handle an empty body gracefully.

New turmoil test: reset_replication_response_includes_elapsed_ms
verifies the field is present and within sanity bounds.
Symmetric to reset_replication_on_nonexistent_namespace_returns_404.

If error handling regresses (e.g., accidentally returning 500 on
NamespaceDoesntExist), the streamer's classifier would treat that
as a transient server error and retry, instead of falling through
to the optimistic-reset path. This test pins the correct behavior.

7 libsql turmoil tests total, all pass.
Three fixes surfaced in code-review of PR 2230:

1. integrity_check (namespace/mod.rs):
   Move the PRAGMA quick_check/integrity_check PRAGMA run from the
   current Tokio worker onto the blocking thread pool via
   spawn_blocking. On a large DB, integrity_check can take seconds
   and would otherwise stall async tasks on that worker. Matches the
   pattern already used for checkpoint() and vacuum_if_needed() on
   LegacyConnection and for block_in_place in admin_shell.

2. reset_replication replica guard (namespace/store.rs):
   Reject calls with 400 NotAPrimary unless db_kind is Primary. On a
   replica, the ReplicaConfigurator does not consume .sentinel, so
   the re-init path would not actually rebuild anything and we'd
   destroy wallog/snapshots the replica still needs.

3. Async filesystem check (namespace/store.rs):
   Replace sync Path::exists() with tokio::fs::try_exists to avoid
   blocking the Tokio worker on a filesystem stat.

Also leaves a TODO comment on the string-based corruption
classification in the checkpoint error path — follow-up tracked to
replace it with typed rusqlite::ErrorCode matching (or a dedicated
Error::DatabaseCorrupt variant) once the Error enum plumbing is
updated.

Verified: cargo check clean, all 7 reset/integrity turmoil tests
pass (reset_replication_preserves_data_on_healthy_namespace,
reset_replication_on_nonexistent_namespace_returns_404,
reset_replication_is_idempotent,
reset_replication_response_includes_elapsed_ms,
integrity_check_on_healthy_namespace,
integrity_check_defaults_to_quick_when_full_omitted,
integrity_check_on_nonexistent_namespace_returns_404).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant