Fix/sqlite delete journal mode lustre by rhaegar325 · Pull Request #367 · ACCESS-NRI/ACCESS-MOPPy

rhaegar325 · 2026-05-08T04:49:50Z

Solve #274

Fix: SQLite Concurrent Write Crashes on Lustre (Gadi)

Problem

Batch CMORisation on Gadi failed with two classes of SQLite errors when multiple PBS jobs
accessed the shared tracking database concurrently.

1. SIGBUS (Bus error, signal 7) in walIndexReadHdr()

[gadi-cpu-clx-1361:2245122] Caught signal 7 (Bus error: nonexistent physical address)
2 walIndexReadHdr()  sqlite3.c:0
3 walTryBeginRead()  sqlite3.c:0
4 sqlite3PagerSharedLock()  sqlite3.c:0

WAL mode creates a .db-shm file accessed via mmap() to share the WAL index across
processes. On Lustre (/scratch, /g/data), mmap() cache coherency is not guaranteed
across compute nodes, causing the mapped memory to reference a nonexistent physical address.

2. disk I/O error (SQLITE_IOERR)

Error processing Amon.tas: disk I/O error

With 71 PBS jobs starting simultaneously, Lustre's Metadata Server (MDS) intermittently
returns EIO under high concurrent load. fsync() calls on journal files as well as
direct read()/write() syscalls to the database file can fail transiently.

Changes

`src/access_moppy/tracking.py`

SQLite configuration (_init_db)

Setting	Before	After	Reason
`journal_mode`	`WAL`	`DELETE`	Eliminates `.db-shm` mmap — safe on Lustre
`synchronous`	`NORMAL`	`OFF`	No `fsync()` calls; removes the primary EIO source
`busy_timeout`	not set	`30000` ms (moved first)	Covers lock contention for all subsequent PRAGMAs
`connect timeout`	not set	`30` s	Python-level connection timeout
`wal_checkpoint(TRUNCATE)`	not set	before mode switch	Flushes any pre-existing WAL when migrating old databases

DELETE + synchronous=OFF rationale: DELETE mode writes a rollback journal via the OS
page cache (no fsync()), which survives process crashes for automatic recovery.
synchronous=OFF eliminates all fsync() calls. pwrite() to the journal goes through
the kernel page cache and does not trigger EIO; only fsync() does.

Retry logic (_execute_with_retry)

Extended the retry condition from "database is locked" only to also cover
"disk I/O error". Lustre MDS EIO errors are transient; exponential backoff retries
(1 s, 2 s, 4 s, 8 s, 16 s) succeed once the metadata server recovers.

All read and write methods now route through _execute_with_retry

Previously, get_status() and is_done() called cursor.execute() directly with no
retry protection. These are the first DB calls made by each PBS job at startup — the
highest-risk window for concurrent EIO failures. Routing them through _execute_with_retry
closes this gap. is_done() is simplified to delegate to get_status(), removing
duplicate query logic.

`tests/unit/test_tracking.py`

Added test_no_shm_or_wal_files_on_disk: asserts that no .db-shm or .db-wal files
are created after writes (WAL mode absent).
Added test_journal_mode_is_delete: asserts PRAGMA journal_mode returns delete.

Test Results

Five rounds of 71 concurrent PBS jobs on Gadi:

Run	Configuration	`disk I/O error`	SIGBUS risk
test_1	WAL + NORMAL (original)	6 / 71	present
test_2	DELETE + NORMAL	8 / 71	eliminated
test_3	DELETE + OFF	3 / 71	eliminated
test_4	DELETE + OFF + write retry (reads unprotected)	2 / 71	eliminated
test_5	DELETE + OFF + full retry (all reads and writes)	0 / 71	eliminated

Trade-offs

synchronous=OFF: the OS page cache is not forced to disk after each commit. If a
PBS job is killed mid-write (e.g. walltime exceeded during the <10 ms write window),
the tracking database may be inconsistent. The tracking DB is non-critical status
metadata; it can be deleted and repopulated without affecting CMORised output files.
DELETE vs WAL read concurrency: DELETE mode requires readers to wait for exclusive
write locks, unlike WAL's concurrent-reader model. In MOPPy's workload each job writes
twice in under 10 ms across hours of processing, so this has no measurable impact on
throughput.

codecov · 2026-05-08T04:51:36Z

Codecov Report

❌ Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 71.0%. Comparing base (1b35d75) to head (2d7eb31).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/access_moppy/tracking.py	93.8%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##            main    #367     +/-   ##
=======================================
+ Coverage   70.7%   71.0%   +0.3%     
=======================================
  Files         28      28             
  Lines       4686    4704     +18     
  Branches     849     853      +4     
=======================================
+ Hits        3312    3340     +28     
+ Misses      1169    1158     -11     
- Partials     205     206      +1

Flag	Coverage Δ
unit	`71.0% <93.8%> (+0.3%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rhaegar325 added 3 commits May 7, 2026 17:47

fix multitasking database issue

50246d0

solve_IO_error

e2403ea

solve BUSError and IOError

29a4f37

add test

2d7eb31

rhaegar325 requested a review from rbeucher May 8, 2026 05:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/sqlite delete journal mode lustre#367

Fix/sqlite delete journal mode lustre#367
rhaegar325 wants to merge 4 commits intomainfrom
fix/sqlite-delete-journal-mode-lustre

rhaegar325 commented May 8, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rhaegar325 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: SQLite Concurrent Write Crashes on Lustre (Gadi)

Problem

Changes

src/access_moppy/tracking.py

tests/unit/test_tracking.py

Test Results

Trade-offs

Uh oh!

codecov Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rhaegar325 commented May 8, 2026 •

edited

Loading

`src/access_moppy/tracking.py`

`tests/unit/test_tracking.py`

codecov Bot commented May 8, 2026 •

edited

Loading