feat(snowflake): self-heal on auth-token expiry via connection_factory#24
feat(snowflake): self-heal on auth-token expiry via connection_factory#24vigneshnarayanaswamy wants to merge 1 commit into
Conversation
A long-lived SnowflakeLedgerBackend holds one session. When that session's
auth token idle-expires, every subsequent statement fails with
ProgrammingError errno 390114 ("Authentication token has expired") until the
process is restarted. client_session_keep_alive heartbeats reduce but cannot
eliminate this (network blips, very long idle, a stalled heartbeat thread).
Add an optional connection_factory: Callable[[], Connection] | None. When
given, the central execute path (_exec / _exec_no_result) detects an
auth-expiry error precisely (errno 390114, or the code + canonical message
when errno is unset — never the exception type alone), calls the factory for
a fresh connection, swaps it in, and retries the same statement exactly once.
A second consecutive auth-expiry, or any non-auth error, propagates.
The swap is guarded by a lock with a re-check against the observed stale
session, so concurrent callers trigger only one reconnect. If only connection=
is given (no factory), behavior is unchanged: the error propagates with no
reconnect, no regression.
Factory contract: connection_factory() returns a ready-to-use connection
(same account/user/auth and, where relevant, warehouse/role/current database).
The backend issues no USE statements and fully-qualifies every object as
{schema}.OBJECT, so the factory owns all session configuration.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0cbedd5fdc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| write_pandas(conn, df, "MODELS_STAGING", **wp_kwargs) # type: ignore[arg-type] | ||
| _exec_no_result( | ||
| self._session, | ||
| self._exec_no_result( |
There was a problem hiding this comment.
Retry the whole pandas bulk write after reconnect
When the pandas bulk path is active, the staging table and write_pandas upload are bound to the Snowflake session. If this MERGE sees errno 390114, _exec_no_result now swaps self._session and retries only the MERGE; the fresh session has no temporary MODELS_STAGING data, so the retry fails with object-not-found instead of self-healing. The same pattern exists in _flush_snapshots_pandas, and an expiry on the preceding CREATE also leaves the already-captured conn pointing at the stale session for write_pandas; the reconnect needs to restart/rebind the entire create/upload/merge sequence.
Useful? React with 👍 / 👎.
Problem
A long-lived
SnowflakeLedgerBackendholds a single Snowflake session. When that session's auth token idle-expires, every subsequent statement fails withProgrammingErrorerrno390114("Authentication token has expired") and stays broken until the process is restarted.The driver's
client_session_keep_aliveheartbeat reduces this but cannot eliminate it: network blips, very long idle periods, and a stalled heartbeat thread all still let the token expire. There needs to be a backstop that lets the backend recover without a restart.What this does
Adds an optional
connection_factory: Callable[[], Connection] | NonetoSnowflakeLedgerBackend.__init__. The existingconnection=parameter keeps working unchanged.On a detected auth-expiry error, the central execute path transparently:
connection_factory()for a fresh connection,A second consecutive auth-expiry — or any other error — propagates. There is no retry loop.
Precise error matching
Detection is deliberately narrow so unrelated
ProgrammingErrors (bad SQL, missing table, permission denied) are never retried:390114(the authoritative signal); or390114and the canonical phrase "authentication token has expired" in the message.It never matches on the exception type alone.
Thread-safety
The connection swap is guarded by a lock. Before swapping, the reconnect re-checks the failing connection against the current session, so when several threads hit the same expired session concurrently, the factory is invoked exactly once and all threads continue on the same fresh connection.
Composes with keep-alive
This does not replace
client_session_keep_alive— it complements it. Heartbeats shrink the window for idle expiry; this path is the backstop for the residual cases the heartbeat can't cover.No-factory = no behavior change
If only
connection=is given (no factory), behavior is identical to before: an auth-expiry error propagates with no reconnect. No regression for existing callers.Factory contract
connection_factory()must return a ready-to-use connection — same account/user/auth and, where relevant, warehouse, role, and current database as the original. The backend issues no session-setup (USE) statements and fully-qualifies every object as{schema}.OBJECT, so the factory owns all session configuration.Tests
Extends the backend suite with
TestReconnectOnAuthExpiryusing a fake cursor-style connection and factory (no live Snowflake, no optional-driver import):ProgrammingError→ not retried, propagates, no reconnect;Full suite, ruff (lint + format), mypy, and the coverage gate are green.
🤖 Generated with Claude Code