Fix spurious logout on fly.io caused by SQLITE_BUSY#53
Conversation
Concurrent PATCH /feeds/:id requests (e.g. reorder fan-out) on the Fly.io volume hit SQLITE_BUSY because the DB was opened without WAL or a busy timeout. The auth middleware then mapped any GetSession error to 401, so the frontend cleared the token and bounced the user to /login. - Open SQLite with WAL, busy_timeout=5000, synchronous=NORMAL. - Auth middleware: only 401 on sql.ErrNoRows; other DB errors return 500. - fly.toml: min_machines_running=1 so cold-start replays don't compound. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses spurious logout behavior during SQLite contention by distinguishing invalid sessions from transient DB failures and configuring SQLite/Fly.io for more resilient operation.
Changes:
- Auth middleware now returns 401 only for missing sessions and 500 for other session lookup errors.
- SQLite is opened with WAL, busy timeout, synchronous NORMAL, and foreign keys enabled.
- Fly.io config keeps one machine running to reduce cold-start replay effects.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
internal/infra/http/api/middleware/middleware.go |
Changes session lookup error handling in auth middleware. |
internal/app.go |
Adds SQLite DSN pragmas for WAL and contention handling. |
fly.toml |
Keeps one Fly machine running. |
| // WAL + a generous busy_timeout are required when the DB lives on | ||
| // slower storage (e.g. Fly.io volumes); without them, concurrent | ||
| // requests hit SQLITE_BUSY and surface to callers as generic errors. | ||
| dsn := a.cfg.DB.Path + "?_pragma=journal_mode(WAL)&_pragma=busy_timeout(5000)&_pragma=synchronous(NORMAL)&_pragma=foreign_keys(ON)" |
There was a problem hiding this comment.
Good catch — fixed in 19c365a. The DSN now sets busy_timeout(5000) before journal_mode(WAL), so the WAL PRAGMA on each newly opened pooled connection also waits on a held lock instead of failing with SQLITE_BUSY.
| if err != nil { | ||
| // Only treat "row not found" as an auth failure; transient | ||
| // DB errors (e.g. SQLITE_BUSY on slow volumes) must not | ||
| // log the user out. | ||
| if errors.Is(err, sql.ErrNoRows) { | ||
| writeUnauthorizedResponse(w, "Unauthorized - Invalid token") |
There was a problem hiding this comment.
Added in 19c365a — middleware_test.go now covers both branches: sql.ErrNoRows and a nil session return 401, while a transient DB error (e.g. SQLITE_BUSY) returns 500 without logging the user out. Also covers public-path skipping and userID context propagation.
- Set busy_timeout before journal_mode(WAL) in the DSN so the WAL PRAGMA on each pooled connection also waits on a held lock. - Add middleware tests covering the 401 (sql.ErrNoRows / nil session) vs 500 (transient DB error) split and public-path skipping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract repeated "Bearer tok" string to a constant (goconst) and use http.NoBody instead of nil request bodies (gocritic). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
busy_timeout=5000+synchronous=NORMALso concurrent requests on Fly volumes don't hitSQLITE_BUSY.sql.ErrNoRows; transient DB errors return 500 instead of logging the user out.fly.toml:min_machines_running = 1so cold-start request replays don't compound the issue.Why
Moving a feed to a folder on fly.io kicked the user to the login page. Reorder fan-out fires many parallel
PATCH /feeds/:idrequests; on the network volume those collide on the sessions table and SQLite returns a busy error. The middleware treated any error fromGetSessionas "invalid token" → 401 → frontend clears token → redirect to /login. Localhost on fast SSD never tripped the contention window.Test plan
go build ./...go test ./...🤖 Generated with Claude Code