Skip to content

Fix LoginWithFailover missing parser state check#4140

Open
paulmedynski wants to merge 4 commits intomainfrom
dev/paul/4139-failover-parser-state-check
Open

Fix LoginWithFailover missing parser state check#4140
paulmedynski wants to merge 4 commits intomainfrom
dev/paul/4139-failover-parser-state-check

Conversation

@paulmedynski
Copy link
Copy Markdown
Contributor

@paulmedynski paulmedynski commented Apr 6, 2026

Fixes #4139

Description

LoginWithFailover() in SqlConnectionInternal.cs was missing the _parser?.State check that LoginNoFailover() already has. This caused transient errors (40613, 42108, 42109) to trigger failover alternation instead of being thrown immediately for the outer ConnectRetryCount loop to handle.

The code comment in LoginWithFailover() even notes: "The logic in this method is paralleled by the logic in LoginNoFailover. Changes to either one should be examined to see if they need to be reflected in the other."

Changes

Added the same _parser?.State is not TdsParserState.Closed check to LoginWithFailover() catch block, consistent with LoginNoFailover():

  • Login-phase errors (transient errors, explicit server rejections where parser is still open): throw immediately, handled by outer retry loop
  • Network errors (parser is closed): continue failover alternation as before

Behavioral change

When using Failover Partner, login-phase SQL failures that arrive while the parser remains open are now treated as login failures (outer ConnectRetryCount path) instead of network-level failover signals.

Before this fix:

  • Certain transient login failures in LoginWithFailover() could enter failover alternation.

After this fix:

  • Login-phase SQL failures are rethrown and retried via the outer connect-retry loop.
  • Failover alternation remains reserved for network/connectivity failures where parser state is closed.

Public docs analysis

Reviewed public Microsoft docs related to failover and login behavior:

  • SqlConnectionStringBuilder.FailoverPartner API docs explain high-level failover partner usage and mirroring prerequisites.
  • Connection pooling docs describe login error blocking period and fatal-error pool clearing behavior.
  • Connection string syntax docs cover keywords but not token-level failover decision rules.

Current docs do not explicitly define this internal boundary:

  • when a login-phase SQL error should be handled as a retry-on-primary login failure,
  • versus when a failure should be treated as network-level and trigger failover alternation.

This change aligns implementation with intended internal behavior and with LoginNoFailover() logic.

Testing

  • Added targeted simulated-server regression coverage for parser-state-gated failover behavior
  • Added async parity tests for login-phase transient fault behavior
  • Added pooling and retry-disabled variants for user-provided partner scenarios
  • Added simulator support for configurable login error severity to isolate non-fatal login-token behavior
  • Build validated for unit tests project on net9.0

@paulmedynski paulmedynski added this to the 7.1.0-preview1 milestone Apr 6, 2026
@paulmedynski paulmedynski requested a review from a team as a code owner April 6, 2026 13:09
@github-project-automation github-project-automation Bot moved this to To triage in SqlClient Board Apr 6, 2026
Copilot AI review requested due to automatic review settings April 6, 2026 13:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a failover login behavior discrepancy in SqlConnectionInternal.LoginWithFailover() by aligning its error-handling logic with LoginNoFailover(), ensuring login-phase transient/server errors are thrown for the outer ConnectRetryCount retry loop instead of triggering failover alternation.

Changes:

  • Added a _parser?.State is not TdsParserState.Closed check in LoginWithFailover()’s SqlException catch block.
  • Documented the rationale in-code to distinguish login-phase errors from network-level failures during failover alternation.

@paulmedynski paulmedynski moved this from To triage to In review in SqlClient Board Apr 6, 2026
@paulmedynski paulmedynski enabled auto-merge (squash) April 24, 2026 11:35
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.44%. Comparing base (be95ca2) to head (c585c25).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4140      +/-   ##
==========================================
- Coverage   65.96%   64.44%   -1.53%     
==========================================
  Files         275      270       -5     
  Lines       42993    65779   +22786     
==========================================
+ Hits        28361    42391   +14030     
- Misses      14632    23388    +8756     
Flag Coverage Δ
CI-SqlClient ?
PR-SqlClient-Project 64.44% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@paulmedynski paulmedynski disabled auto-merge April 28, 2026 16:56
@paulmedynski paulmedynski enabled auto-merge (squash) May 5, 2026 11:56
Copy link
Copy Markdown
Contributor

@mdaigle mdaigle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second read, I would expect to see some tests requiring modification due to this change. Can you add test coverage if it's missing to prove the behavior is what we expect?

@github-project-automation github-project-automation Bot moved this from In review to In progress in SqlClient Board May 6, 2026
Copilot AI review requested due to automatic review settings May 7, 2026 13:46
@paulmedynski paulmedynski force-pushed the dev/paul/4139-failover-parser-state-check branch from 371647f to 24d5a77 Compare May 7, 2026 13:46
IsEnabledTransientError = true,
Number = 40613,
// Use non-fatal severity so break/doom logic does not short-circuit the path.
ErrorClass = 16,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the key to observing the old incorrect behaviour, where a non-doomed connection failed-over to the secondary due to a transient non-fatal login error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Copilot AI review requested due to automatic review settings May 7, 2026 14:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

@paulmedynski paulmedynski moved this from In progress to In review in SqlClient Board May 7, 2026
Copy link
Copy Markdown
Contributor

@mdaigle mdaigle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the correct behavior. My final thought is that it may make sense to add an app context switch to allow customers to revert to the old behavior. I could easily imagine someone taking advantage of the old behavior to get auto-switching to a backup when they perform maintenance on the primary.

Alternatively, we should call this change out specifically in our release notes and make sure to update this documentation to be clearer about network vs. TDS error handling and ensure that other drivers are in alignment: https://learn.microsoft.com/en-us/sql/database-engine/database-mirroring/connect-clients-to-a-database-mirroring-session-sql-server?view=sql-server-ver17

IsEnabledTransientError = true,
Number = 40613,
// Use non-fatal severity so break/doom logic does not short-circuit the path.
ErrorClass = 16,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

LoginWithFailover missing parser state check causes transient errors to trigger failover instead of retry

5 participants