Fix connect attempt retries by masaori335 · Pull Request #13102 · apache/trafficserver

masaori335 · 2026-04-20T04:54:22Z

Summary

We have three configs of connect attempt retry. However, prior to the change, none of them are implemented correctly. This PR makes the retry path honor the HostDBInfo state across the board.

connect_attempts_max_retries
connect_attempts_max_retries_down_server
connect_attempts_rr_retries

Changes

New helper HttpTransact::origin_server_connect_attempts_max_retries(State*) returns the retry limit based on the active HostDBInfo::State :
- UP → connect_attempts_max_retries
- SUSPECT → connect_attempts_max_retries_down_server
- DOWN → 0
HttpTransact::handle_response_from_server is refactored and retry logic is fixed.
Fix round-robin logic in ResolveInfo::select_next_rr(now, fail_window)
HttpSM::mark_host_failure now uses origin_server_connect_attempts_max_retries(s) + 1 as the DOWN threshold (was incorrectly using connect_attempts_rr_retries).
HttpSM::do_hostdb_update_if_necessary records failure time via ts_clock::now() instead of client_request_time so multiple connect attempts in one transaction get distinct timestamps.

Tests

New connect_attempts_single_max_retries.replay.yaml exercises connect_attempts_max_retries and connect_attempts_max_retries_down_server for single DNS Record (no round-robin case)
Updated connect_attempts_rr_retries.replay.yaml and gold files to reflect the corrected RR + SUSPECT behavior.
Updated connect_attempts_rr_max_retries gold to match the new state-aware fail-count threshold.

Dependency

This PR is in draft until below PRs are merged.

~~Cleanup serving stale while origin server down #13083~~ (update: this dependency is gone)
Clarify HostDBInfo state #13092

masaori335 · 2026-04-23T01:20:59Z

Now this PR is ready. I removed dependency of #13083. We can do it independently.

Copilot

Pull request overview

This PR fixes origin connect-attempt retry behavior so it consistently honors the active HostDBInfo health state (UP / SUSPECT / DOWN), and updates AuTest replay coverage to validate the corrected single-host and round-robin behavior.

Changes:

Add HttpTransact::origin_server_connect_attempts_max_retries(State*) and refactor server connect failure handling to use state-aware retry limits.
Fix round-robin selection to skip DOWN targets while allowing SUSPECT targets (ResolveInfo::select_next_rr(now, fail_window)), and update HostDB failure timestamping.
Add/adjust gold tests to cover single-record and RR retry behavior and updated failure-count thresholds.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`src/proxy/http/HttpTransact.cc`	Adds state-aware retry helper; refactors retry path and RR switching logic.
`include/proxy/http/HttpTransact.h`	Declares new helper and updates retry helper signature.
`src/proxy/http/HttpSM.cc`	Updates HostDB failure timestamping (`ts_clock::now()`), adjusts DOWN threshold logic.
`src/proxy/http/HttpConfig.cc`	Adds a config Warning related to RR + down-server retry configuration.
`src/iocore/hostdb/HostDB.cc`	Updates RR selection to be time/window aware and skip DOWN targets.
`include/iocore/hostdb/HostDBProcessor.h`	Updates `ResolveInfo::select_next_rr` signature to accept `(now, fail_window)`.
`tests/gold_tests/dns/replay/connect_attempts_single_max_retries.replay.yaml`	New replay to cover single-DNS-record retry/down-server behavior.
`tests/gold_tests/dns/replay/connect_attempts_rr_retries.replay.yaml`	Updates RR retry scenario to match corrected behavior.
`tests/gold_tests/dns/gold/connect_attempts_single_max_retries_error_log.gold`	New gold output for the single-record replay.
`tests/gold_tests/dns/gold/connect_attempts_rr_retries_error_log.gold`	Updates gold output to reflect corrected RR behavior.
`tests/gold_tests/dns/gold/connect_attempts_rr_max_retries_error_log.gold`	Updates gold output for new fail-count threshold.
`tests/gold_tests/dns/connect_attempts.test.py`	Registers the new replay test.
`tests/gold_tests/autest-site/verifier_client.test.ext`	Adds `poll_timeout` support to verifier-client process config for replays.

Copilot · 2026-04-27T22:33:02Z

+    unsigned max_connect_retries = s->txn_conf->connect_attempts_max_retries;
    TxnDbg(dbg_ctl_http_trans, "max_connect_retries: %d s->current.retry_attempts: %d", max_connect_retries,
           s->current.retry_attempts.get());

-    if (is_request_retryable(s) && s->current.retry_attempts.get() < max_connect_retries &&
-        !HttpTransact::is_response_valid(s, &s->hdr_info.server_response)) {
-      // If this is a round robin DNS entry & we're tried configured
-      //    number of times, we should try another node
-      if (ResolveInfo::OS_Addr::TRY_CLIENT == s->dns_info.os_addr_style) {
-        // attempt was based on client supplied server address. Try again using HostDB.
-        // Allow DNS attempt
-        s->dns_info.resolved_p = false;
-        // See if we can get data from HostDB for this.
-        s->dns_info.os_addr_style = ResolveInfo::OS_Addr::TRY_HOSTDB;
-        // Force host resolution to have the same family as the client.
-        // Because this is a transparent connection, we can't switch address
-        // families - that is locked in by the client source address.
-        ats_force_order_by_family(s->current.server->dst_addr.family(), s->my_txn_conf().host_res_data.order);
-        return CallOSDNSLookup(s);
-      } else if (ResolveInfo::OS_Addr::USE_API == s->dns_info.os_addr_style && !s->api_server_addr_set_retried) {
-        // Plugin set the server address via TSHttpTxnServerAddrSet(). Clear resolution
-        // state to allow the OS_DNS hook to be called again, giving the plugin a chance
-        // to set a different server address for retry (issue #12611).
-        // Only retry once to avoid infinite loops if the plugin keeps setting failing addresses.
-        s->api_server_addr_set_retried = true;
-        s->dns_info.resolved_p         = false;
-        s->dns_info.os_addr_style      = ResolveInfo::OS_Addr::TRY_DEFAULT;
-        // Clear the server request so it can be rebuilt for the new destination
-        s->hdr_info.server_request.destroy();
-        TxnDbg(dbg_ctl_http_trans, "Retrying with plugin-set address, returning to OS_DNS hook");
-        return CallOSDNSLookup(s);
-      } else {
-        if ((s->txn_conf->connect_attempts_rr_retries > 0) &&
-            ((s->current.retry_attempts.get() + 1) % s->txn_conf->connect_attempts_rr_retries == 0)) {
-          s->dns_info.select_next_rr();
-        }
-        retry_server_connection_not_open(s, s->current.state, max_connect_retries);
-        TxnDbg(dbg_ctl_http_trans, "Error. Retrying...");
-        s->next_action = how_to_open_connection(s);
-      }
-    } else {
+    // Bail out if the request is not retryable, the global retry cap is reached, or we already have a usable response.
+    if (!is_request_retryable(s) || s->current.retry_attempts.get() >= max_connect_retries ||
+        HttpTransact::is_response_valid(s, &s->hdr_info.server_response)) {
+      TxnDbg(dbg_ctl_http_trans, "Error. No more retries. %d/%d", s->current.retry_attempts.get(), max_connect_retries);


The retry cap here is initialized from connect_attempts_max_retries unconditionally, but the PR description says retry limits should be state-aware (UP/SUSPECT/DOWN). Using the UP config for the early bailout check can allow extra retries (or plugin/client-address re-resolution retries) beyond the intended per-host cap when the active HostDBInfo is SUSPECT/DOWN. Consider initializing max_connect_retries using origin_server_connect_attempts_max_retries(s) (with a deliberate fallback for non-HostDB targets, if needed).

Here, we need to use connect_attempts_max_retries unconditionally to give a chance for round-robin case which can be new HostDBInfo in UP state. The state-aware limit is set by line 4077 eventually.

// The active target (HostDB) may be SUSPECT state, so re-evaluate the retry limit. max_connect_retries = origin_server_connect_attempts_max_retries(s);

Copilot · 2026-04-27T22:33:03Z

+  uint8_t    max_connect_retries = HttpTransact::origin_server_connect_attempts_max_retries(&t_state);
+  ts_seconds fail_window         = t_state.txn_conf->down_server_timeout;
+
+  // Mark the host DOWN only after every attempt has failed. `max_connect_retries` counts only "retries", so the total attempt
+  // budget is `max_connect_retries + 1` (the initial connect plus each retry).
+  auto [down, fail_count] = info->active->increment_fail_count(time_down, max_connect_retries + 1, fail_window);
+


max_connect_retries is a uint8_t and is used as max_connect_retries + 1 for the attempt budget. If max_connect_retries is 255 (or has already wrapped due to narrowing), + 1 overflows back to 0, which would make the down-threshold computation incorrect. Consider computing the attempt budget in a wider integer type and saturating/clamping to UINT8_MAX before passing it to increment_fail_count().

I believe we addressed this by clamping the configuration to 0-255:
a1d98be

But maybe we need to use 0-254? If the user puts in 255 and we increment here, won't that overflow the uint8_t? (Or use a wider type.)

masaori335 · 2026-04-30T05:42:14Z

Copilot comments are addressed.

bneradt

Do we have to worry about "incompatible" here or do we consider these changes just fixers to behaviors making them what users should be expecting already?

Regardles, it would probably be good to update our release docs explaining that these settings are fixed.

bneradt · 2026-05-04T14:50:45Z

+      unsigned limit = active - rr_info.data();
+      size_t   idx   = (limit + 1) % rr_info.count();
+      for (; idx != limit; idx = (idx + 1) % rr_info.count()) {
+        if (!rr_info[idx].is_down(now, fail_window)) {
+          active = &rr_info[idx];
+          break;
+        }
+      }


Good loop refactor. Looks like limit can be const.

bneradt · 2026-05-04T15:23:58Z

+  uint8_t    max_connect_retries = HttpTransact::origin_server_connect_attempts_max_retries(&t_state);
+  ts_seconds fail_window         = t_state.txn_conf->down_server_timeout;
+
+  // Mark the host DOWN only after every attempt has failed. `max_connect_retries` counts only "retries", so the total attempt
+  // budget is `max_connect_retries + 1` (the initial connect plus each retry).
+  auto [down, fail_count] = info->active->increment_fail_count(time_down, max_connect_retries + 1, fail_window);
+


I believe we addressed this by clamping the configuration to 0-255:
a1d98be

But maybe we need to use 0-254? If the user puts in 255 and we increment here, won't that overflow the uint8_t? (Or use a wider type.)

bneradt · 2026-05-04T15:30:28Z

+uint8_t
+HttpTransact::origin_server_connect_attempts_max_retries(State *s)
+{
+  HostDBInfo *active = s->dns_info.active;
+  if (active == nullptr) {
+    return 0;
+  }
+


Codex says:

This re-evaluates the retry budget through origin_server_connect_attempts_max_retries(), which returns 0 when dns_info.active is null. That is a normal supported path for USE_CLIENT after transparent CTA fallback, and can also happen with plugin-supplied addresses after the one OS_DNS retry. In those cases the next failed connect immediately trips max_connect_retries <= retry_attempts and skips the configured connect_attempts_max_retries. I think the call site should keep the existing/global retry budget when there is no active HostDBInfo.
src/proxy/http/HttpTransact.cc:4074-4075

masaori335 added this to the 11.0.0 milestone Apr 20, 2026

masaori335 self-assigned this Apr 20, 2026

masaori335 added HTTP HostDB labels Apr 20, 2026

masaori335 added this to ATS v10.2.x Apr 20, 2026

masaori335 mentioned this pull request Apr 20, 2026

Add AuTest for connect_attempts_max_retries_down_server #12922

Closed

masaori335 removed this from ATS v10.2.x Apr 20, 2026

masaori335 force-pushed the asf-master-fix-connect-attempt-retry branch 2 times, most recently from d9fcfa6 to c033826 Compare April 22, 2026 04:48

Fix connect attempt retries

ab4427d

masaori335 force-pushed the asf-master-fix-connect-attempt-retry branch from c033826 to ab4427d Compare April 23, 2026 01:19

masaori335 marked this pull request as ready for review April 23, 2026 01:20

masaori335 added the Bug label Apr 23, 2026

bryancall requested review from bneradt and Copilot April 27, 2026 22:27

Copilot AI reviewed Apr 27, 2026

View reviewed changes

masaori335 added 4 commits April 30, 2026 14:41

Fix typo

e2cec75

Fix warning message

f2e831a

Fix a bug when round-robin is failed

c4d6e40

Clamp range of connect_attempts_max_retries

a1d98be

bneradt reviewed May 4, 2026

View reviewed changes

Conversation

masaori335 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Tests

Dependency

Uh oh!

masaori335 commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

masaori335 Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

bneradt May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

masaori335 commented Apr 30, 2026

Uh oh!

bneradt left a comment

Choose a reason for hiding this comment

Uh oh!

bneradt May 4, 2026

Choose a reason for hiding this comment

Uh oh!

bneradt May 4, 2026

Choose a reason for hiding this comment

Uh oh!

bneradt May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

masaori335 commented Apr 20, 2026 •

edited

Loading

masaori335 Apr 30, 2026 •

edited

Loading