Skip to content

feat(snowflake): self-heal on auth-token expiry via connection_factory#24

Open
vigneshnarayanaswamy wants to merge 1 commit into
mainfrom
vigneshn/reconnect-on-auth-error
Open

feat(snowflake): self-heal on auth-token expiry via connection_factory#24
vigneshnarayanaswamy wants to merge 1 commit into
mainfrom
vigneshn/reconnect-on-auth-error

Conversation

@vigneshnarayanaswamy

Copy link
Copy Markdown
Collaborator

Problem

A long-lived SnowflakeLedgerBackend holds a single Snowflake session. When that session's auth token idle-expires, every subsequent statement fails with ProgrammingError errno 390114 ("Authentication token has expired") and stays broken until the process is restarted.

The driver's client_session_keep_alive heartbeat reduces this but cannot eliminate it: network blips, very long idle periods, and a stalled heartbeat thread all still let the token expire. There needs to be a backstop that lets the backend recover without a restart.

What this does

Adds an optional connection_factory: Callable[[], Connection] | None to SnowflakeLedgerBackend.__init__. The existing connection= parameter keeps working unchanged.

On a detected auth-expiry error, the central execute path transparently:

  1. calls connection_factory() for a fresh connection,
  2. swaps it in, and
  3. retries the same statement exactly once.

A second consecutive auth-expiry — or any other error — propagates. There is no retry loop.

Precise error matching

Detection is deliberately narrow so unrelated ProgrammingErrors (bad SQL, missing table, permission denied) are never retried:

  • match on errno 390114 (the authoritative signal); or
  • when the driver leaves errno unset, require both the code 390114 and the canonical phrase "authentication token has expired" in the message.

It never matches on the exception type alone.

Thread-safety

The connection swap is guarded by a lock. Before swapping, the reconnect re-checks the failing connection against the current session, so when several threads hit the same expired session concurrently, the factory is invoked exactly once and all threads continue on the same fresh connection.

Composes with keep-alive

This does not replace client_session_keep_alive — it complements it. Heartbeats shrink the window for idle expiry; this path is the backstop for the residual cases the heartbeat can't cover.

No-factory = no behavior change

If only connection= is given (no factory), behavior is identical to before: an auth-expiry error propagates with no reconnect. No regression for existing callers.

Factory contract

connection_factory() must return a ready-to-use connection — same account/user/auth and, where relevant, warehouse, role, and current database as the original. The backend issues no session-setup (USE) statements and fully-qualifies every object as {schema}.OBJECT, so the factory owns all session configuration.

Tests

Extends the backend suite with TestReconnectOnAuthExpiry using a fake cursor-style connection and factory (no live Snowflake, no optional-driver import):

  • factory not given → no reconnect, error propagates (old behavior);
  • auth-expiry once → reconnect + retry → success;
  • non-auth ProgrammingError → not retried, propagates, no reconnect;
  • second consecutive auth-expiry → propagates (retried exactly once);
  • errno-unset / message-only auth-expiry → still detected;
  • concurrency guard → two threads, exactly one reconnect, both end on the fresh session;
  • the non-result (write/DDL) path self-heals too;
  • constructing with neither connection nor factory raises.

Full suite, ruff (lint + format), mypy, and the coverage gate are green.

🤖 Generated with Claude Code

A long-lived SnowflakeLedgerBackend holds one session. When that session's
auth token idle-expires, every subsequent statement fails with
ProgrammingError errno 390114 ("Authentication token has expired") until the
process is restarted. client_session_keep_alive heartbeats reduce but cannot
eliminate this (network blips, very long idle, a stalled heartbeat thread).

Add an optional connection_factory: Callable[[], Connection] | None. When
given, the central execute path (_exec / _exec_no_result) detects an
auth-expiry error precisely (errno 390114, or the code + canonical message
when errno is unset — never the exception type alone), calls the factory for
a fresh connection, swaps it in, and retries the same statement exactly once.
A second consecutive auth-expiry, or any non-auth error, propagates.

The swap is guarded by a lock with a re-check against the observed stale
session, so concurrent callers trigger only one reconnect. If only connection=
is given (no factory), behavior is unchanged: the error propagates with no
reconnect, no regression.

Factory contract: connection_factory() returns a ready-to-use connection
(same account/user/auth and, where relevant, warehouse/role/current database).
The backend issues no USE statements and fully-qualifies every object as
{schema}.OBJECT, so the factory owns all session configuration.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0cbedd5fdc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

write_pandas(conn, df, "MODELS_STAGING", **wp_kwargs) # type: ignore[arg-type]
_exec_no_result(
self._session,
self._exec_no_result(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Retry the whole pandas bulk write after reconnect

When the pandas bulk path is active, the staging table and write_pandas upload are bound to the Snowflake session. If this MERGE sees errno 390114, _exec_no_result now swaps self._session and retries only the MERGE; the fresh session has no temporary MODELS_STAGING data, so the retry fails with object-not-found instead of self-healing. The same pattern exists in _flush_snapshots_pandas, and an expiry on the preceding CREATE also leaves the already-captured conn pointing at the stale session for write_pandas; the reconnect needs to restart/rebind the entire create/upload/merge sequence.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant