Skip to content

Fix/sqlite delete journal mode lustre#367

Open
rhaegar325 wants to merge 4 commits intomainfrom
fix/sqlite-delete-journal-mode-lustre
Open

Fix/sqlite delete journal mode lustre#367
rhaegar325 wants to merge 4 commits intomainfrom
fix/sqlite-delete-journal-mode-lustre

Conversation

@rhaegar325
Copy link
Copy Markdown
Collaborator

@rhaegar325 rhaegar325 commented May 8, 2026

Solve #274

Fix: SQLite Concurrent Write Crashes on Lustre (Gadi)

Problem

Batch CMORisation on Gadi failed with two classes of SQLite errors when multiple PBS jobs
accessed the shared tracking database concurrently.

1. SIGBUS (Bus error, signal 7) in walIndexReadHdr()

[gadi-cpu-clx-1361:2245122] Caught signal 7 (Bus error: nonexistent physical address)
2 walIndexReadHdr()  sqlite3.c:0
3 walTryBeginRead()  sqlite3.c:0
4 sqlite3PagerSharedLock()  sqlite3.c:0

WAL mode creates a .db-shm file accessed via mmap() to share the WAL index across
processes. On Lustre (/scratch, /g/data), mmap() cache coherency is not guaranteed
across compute nodes, causing the mapped memory to reference a nonexistent physical address.

2. disk I/O error (SQLITE_IOERR)

Error processing Amon.tas: disk I/O error

With 71 PBS jobs starting simultaneously, Lustre's Metadata Server (MDS) intermittently
returns EIO under high concurrent load. fsync() calls on journal files as well as
direct read()/write() syscalls to the database file can fail transiently.

Changes

src/access_moppy/tracking.py

SQLite configuration (_init_db)

Setting Before After Reason
journal_mode WAL DELETE Eliminates .db-shm mmap — safe on Lustre
synchronous NORMAL OFF No fsync() calls; removes the primary EIO source
busy_timeout not set 30000 ms (moved first) Covers lock contention for all subsequent PRAGMAs
connect timeout not set 30 s Python-level connection timeout
wal_checkpoint(TRUNCATE) not set before mode switch Flushes any pre-existing WAL when migrating old databases

DELETE + synchronous=OFF rationale: DELETE mode writes a rollback journal via the OS
page cache (no fsync()), which survives process crashes for automatic recovery.
synchronous=OFF eliminates all fsync() calls. pwrite() to the journal goes through
the kernel page cache and does not trigger EIO; only fsync() does.

Retry logic (_execute_with_retry)

Extended the retry condition from "database is locked" only to also cover
"disk I/O error". Lustre MDS EIO errors are transient; exponential backoff retries
(1 s, 2 s, 4 s, 8 s, 16 s) succeed once the metadata server recovers.

All read and write methods now route through _execute_with_retry

Previously, get_status() and is_done() called cursor.execute() directly with no
retry protection. These are the first DB calls made by each PBS job at startup — the
highest-risk window for concurrent EIO failures. Routing them through _execute_with_retry
closes this gap. is_done() is simplified to delegate to get_status(), removing
duplicate query logic.

tests/unit/test_tracking.py

  • Added test_no_shm_or_wal_files_on_disk: asserts that no .db-shm or .db-wal files
    are created after writes (WAL mode absent).
  • Added test_journal_mode_is_delete: asserts PRAGMA journal_mode returns delete.

Test Results

Five rounds of 71 concurrent PBS jobs on Gadi:

Run Configuration disk I/O error SIGBUS risk
test_1 WAL + NORMAL (original) 6 / 71 present
test_2 DELETE + NORMAL 8 / 71 eliminated
test_3 DELETE + OFF 3 / 71 eliminated
test_4 DELETE + OFF + write retry (reads unprotected) 2 / 71 eliminated
test_5 DELETE + OFF + full retry (all reads and writes) 0 / 71 eliminated

Trade-offs

  • synchronous=OFF: the OS page cache is not forced to disk after each commit. If a
    PBS job is killed mid-write (e.g. walltime exceeded during the <10 ms write window),
    the tracking database may be inconsistent. The tracking DB is non-critical status
    metadata; it can be deleted and repopulated without affecting CMORised output files.
  • DELETE vs WAL read concurrency: DELETE mode requires readers to wait for exclusive
    write locks, unlike WAL's concurrent-reader model. In MOPPy's workload each job writes
    twice in under 10 ms across hours of processing, so this has no measurable impact on
    throughput.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 71.0%. Comparing base (1b35d75) to head (2d7eb31).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/access_moppy/tracking.py 93.8% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main    #367     +/-   ##
=======================================
+ Coverage   70.7%   71.0%   +0.3%     
=======================================
  Files         28      28             
  Lines       4686    4704     +18     
  Branches     849     853      +4     
=======================================
+ Hits        3312    3340     +28     
+ Misses      1169    1158     -11     
- Partials     205     206      +1     
Flag Coverage Δ
unit 71.0% <93.8%> (+0.3%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rhaegar325 rhaegar325 requested a review from rbeucher May 8, 2026 05:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant