feat(snowflake): self-heal on auth-token expiry via connection_factory by vigneshnarayanaswamy · Pull Request #24 · block/model-ledger

vigneshnarayanaswamy · 2026-06-15T05:08:28Z

Problem

A long-lived SnowflakeLedgerBackend holds a single Snowflake session. When that session's auth token idle-expires, every subsequent statement fails with ProgrammingError errno 390114 ("Authentication token has expired") and stays broken until the process is restarted.

The driver's client_session_keep_alive heartbeat reduces this but cannot eliminate it: network blips, very long idle periods, and a stalled heartbeat thread all still let the token expire. There needs to be a backstop that lets the backend recover without a restart.

What this does

Adds an optional connection_factory: Callable[[], Connection] | None to SnowflakeLedgerBackend.__init__. The existing connection= parameter keeps working unchanged.

On a detected auth-expiry error, the central execute path transparently:

calls connection_factory() for a fresh connection,
swaps it in, and
retries the same statement exactly once.

A second consecutive auth-expiry — or any other error — propagates. There is no retry loop.

Precise error matching

Detection is deliberately narrow so unrelated ProgrammingErrors (bad SQL, missing table, permission denied) are never retried:

match on errno 390114 (the authoritative signal); or
when the driver leaves errno unset, require both the code 390114 and the canonical phrase "authentication token has expired" in the message.

It never matches on the exception type alone.

Thread-safety

The connection swap is guarded by a lock. Before swapping, the reconnect re-checks the failing connection against the current session, so when several threads hit the same expired session concurrently, the factory is invoked exactly once and all threads continue on the same fresh connection.

Composes with keep-alive

This does not replace client_session_keep_alive — it complements it. Heartbeats shrink the window for idle expiry; this path is the backstop for the residual cases the heartbeat can't cover.

No-factory = no behavior change

If only connection= is given (no factory), behavior is identical to before: an auth-expiry error propagates with no reconnect. No regression for existing callers.

Factory contract

connection_factory() must return a ready-to-use connection — same account/user/auth and, where relevant, warehouse, role, and current database as the original. The backend issues no session-setup (USE) statements and fully-qualifies every object as {schema}.OBJECT, so the factory owns all session configuration.

Tests

Extends the backend suite with TestReconnectOnAuthExpiry using a fake cursor-style connection and factory (no live Snowflake, no optional-driver import):

factory not given → no reconnect, error propagates (old behavior);
auth-expiry once → reconnect + retry → success;
non-auth ProgrammingError → not retried, propagates, no reconnect;
second consecutive auth-expiry → propagates (retried exactly once);
errno-unset / message-only auth-expiry → still detected;
concurrency guard → two threads, exactly one reconnect, both end on the fresh session;
the non-result (write/DDL) path self-heals too;
constructing with neither connection nor factory raises.

Full suite, ruff (lint + format), mypy, and the coverage gate are green.

🤖 Generated with Claude Code

A long-lived SnowflakeLedgerBackend holds one session. When that session's auth token idle-expires, every subsequent statement fails with ProgrammingError errno 390114 ("Authentication token has expired") until the process is restarted. client_session_keep_alive heartbeats reduce but cannot eliminate this (network blips, very long idle, a stalled heartbeat thread). Add an optional connection_factory: Callable[[], Connection] | None. When given, the central execute path (_exec / _exec_no_result) detects an auth-expiry error precisely (errno 390114, or the code + canonical message when errno is unset — never the exception type alone), calls the factory for a fresh connection, swaps it in, and retries the same statement exactly once. A second consecutive auth-expiry, or any non-auth error, propagates. The swap is guarded by a lock with a re-check against the observed stale session, so concurrent callers trigger only one reconnect. If only connection= is given (no factory), behavior is unchanged: the error propagates with no reconnect, no regression. Factory contract: connection_factory() returns a ready-to-use connection (same account/user/auth and, where relevant, warehouse/role/current database). The backend issues no USE statements and fully-qualifies every object as {schema}.OBJECT, so the factory owns all session configuration. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0cbedd5fdc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-15T05:11:36Z

        write_pandas(conn, df, "MODELS_STAGING", **wp_kwargs)  # type: ignore[arg-type]
-        _exec_no_result(
-            self._session,
+        self._exec_no_result(


Retry the whole pandas bulk write after reconnect

When the pandas bulk path is active, the staging table and write_pandas upload are bound to the Snowflake session. If this MERGE sees errno 390114, _exec_no_result now swaps self._session and retries only the MERGE; the fresh session has no temporary MODELS_STAGING data, so the retry fails with object-not-found instead of self-healing. The same pattern exists in _flush_snapshots_pandas, and an expiry on the preceding CREATE also leaves the already-captured conn pointing at the stale session for write_pandas; the reconnect needs to restart/rebind the entire create/upload/merge sequence.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(snowflake): self-heal on auth-token expiry via connection_factory#24

feat(snowflake): self-heal on auth-token expiry via connection_factory#24
vigneshnarayanaswamy wants to merge 1 commit into
mainfrom
vigneshn/reconnect-on-auth-error

vigneshnarayanaswamy commented Jun 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vigneshnarayanaswamy commented Jun 15, 2026

Problem

What this does

Precise error matching

Thread-safety

Composes with keep-alive

No-factory = no behavior change

Factory contract

Tests

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant