Skip to content

fix(store): [OCISDEV-834] survive permanently-closed NATS connections#628

Draft
kobergj wants to merge 1 commit into
mainfrom
fix/OCISDEV-834-natsjs-reconnect-hardening
Draft

fix(store): [OCISDEV-834] survive permanently-closed NATS connections#628
kobergj wants to merge 1 commit into
mainfrom
fix/OCISDEV-834-natsjs-reconnect-hardening

Conversation

@kobergj

@kobergj kobergj commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Proactive hardening for the NATS dead-connection class of failures tracked in OCISDEV-834. Complements the go-micro nats-js-kv plugin hasConn() fix (already shipping in oCIS via owncloud/ocis#12401 / #12402 / #12404) — this side prevents the NATS client from giving up in the first place and makes the failure visible.

Problem

The nats-js and nats-js-kv store backends used the NATS client defaults: MaxReconnect = 60, ReconnectWait = 2s. Any NATS outage longer than ~2 minutes exhausts the reconnect budget, the client closes permanently, and every subsequent cache operation fails with nats: connection closed until the process is restarted. None of these state transitions were logged, so the failure was silent in Loki.

Changes

pkg/store/store.go — factor the shared nats option setup into defaultNatsOptions() (used by both the nats-js and nats-js-kv branches) and:

  • MaxReconnect = -1 — reconnect forever
  • ReconnectWait = 5s
  • set Name = "reva-store" (was the literal "TODO")
  • wire DisconnectedErrCB / ReconnectedCB / ClosedCB to log the previously-silent state transitions

pkg/storage/utils/decomposedfs/metadata/messagepack_backend.goloadAttributes no longer fails a successful disk read when the cache write-back (PushToCache) fails. The error is logged and the data read from disk is returned. This is what made all spaces invisible in storage-users when the cache connection was dead (SE-1220).

Test — white-box regression test (messagepack_backend_internal_test.go) confirming a disk read succeeds despite a failing cache write-back.

Notes for reviewer

  • The loadAttributes fix is intentionally scoped to the read path. saveAttributes (the write path) also returns its PushToCache error — arguably the same class of issue, but left out of scope here. Happy to fold it in if you'd prefer.
  • Draft until reviewed.

Refs OCISDEV-834.

The nats-js and nats-js-kv store backends used the NATS client defaults,
which give up reconnecting after 60 attempts (~2 min). Any NATS outage
longer than that left the client permanently closed, surfacing as
'nats: connection closed' on every subsequent cache operation until the
process was restarted.

Reconnect forever (MaxReconnect = -1) with a 5s backoff, set a proper
client name, and log the previously-silent connection state transitions.

Additionally, loadAttributes in the decomposedfs messagepack metadata
backend no longer fails a successful disk read when the cache write-back
fails: the error is logged and the disk data is returned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Julian Koberg <julian.koberg@kiteworks.com>
@kw-security

Copy link
Copy Markdown

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Code Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants