Skip to content

fix(sdk): ban rate-limited node for Envoy-advertised reset window#3951

Open
Claudius-Maginificent wants to merge 8 commits into
v3.1-devfrom
fix/rs-dapi-client-rate-limit-rotate
Open

fix(sdk): ban rate-limited node for Envoy-advertised reset window#3951
Claudius-Maginificent wants to merge 8 commits into
v3.1-devfrom
fix/rs-dapi-client-rate-limit-rotate

Conversation

@Claudius-Maginificent

@Claudius-Maginificent Claudius-Maginificent commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Why this PR exists

  • Problem: rs-dapi-client treats a gRPC ResourceExhausted (per-IP rate-limit / backpressure) the same as a node-down failure — it applies the 60s × e^ban_count health ban to the address. Banning a healthy-but-throttled node doesn't shed load, it relocates it onto the remaining nodes.
  • What breaks without it: Under a sustained per-IP rate limit, one high-concurrency client cascades to total failure: node A returns ResourceExhausted → banned 60s → its traffic shifts to B/C → they cross the limit → banned → … → NoAvailableAddressesToRetry, with zero server faults.
  • Blocking relationship: none. Independent fix off v3.1-dev.

Redesign note: an earlier revision of this PR rotated to a different node on ResourceExhausted (plus exclusion/backoff bookkeeping). Review feedback flagged that as over-engineered. This revision replaces it with a single, server-driven ban whose duration is dictated by the rate limiter itself. The intermediate commit history reflects that evolution; the net diff is the redesign only.

What was done?

A single rate-limit ban mechanism, driven by the duration Envoy already advertises:

  • CanRetry::rate_limit_ban_duration(&self) -> Option<Duration> (default None), implemented for tonic::Status (packages/rs-dapi-client/src/transport/grpc.rs): returns None unless the code is ResourceExhausted; otherwise parses the ratelimit-reset response-metadata header (whole seconds), filters out 0/non-numeric, and clamps to [1s, 600s]. Delegated unchanged through TransportError → DapiClientError / ExecutionError.
  • update_address_ban_status dispatch (dapi_client.rs): Some(period)AddressList::ban_for(address, period); None → the existing ban_with_reason exponential health-ban ladder (the fallback is the normal ladder, not a hardcoded default).
  • AddressList::ban_for / AddressStatus::ban_for (address_list.rs): sets banned_until = now + period and raises ban_count to a floor of 1 (so is_banned() stays consistent with banned_until). The rate-limit path is flat — it never inflates the exponential ladder. Reinstatement is the existing banned_until-expiry path.
  • dashmate (packages/dashmate/docker-compose.rate_limiter.yml): LIMIT_RESPONSE_HEADERS_ENABLED=true on the Lyft RLS container, so Envoy emits RateLimit-Reset (surfaced to the client as ratelimit-reset gRPC metadata). This is the operator-tunable knob requested in review.

Net effect: a throttled node is banned for exactly the window the server says it needs — no more, no less — instead of a fixed exponential health ban; genuine node ill-health still bans exactly as before.

How Has This Been Tested?

  • cargo test -p rs-dapi-client: 119 pass (106 unit + 5 rate_limit_ban integration + 3 failover + 5 doc-tests), 0 failures. cargo clippy -p rs-dapi-client -p dash-sdk -- -D warnings clean; cargo fmt clean.
  • Coverage: header → ban_for; clamp edges 1→1s / 600→600s / 601→600s; 0 / garbage / empty / missing header → ladder fallback (assertions bound banned_until to the ≈60s first ladder rung, so they fail if a bad header is wrongly routed to ban_for); full delegation chain; the ban_for ladder-floor side-effect (fresh node ban_count 0→1); and a banned address re-entering rotation after its window expires.
  • Operational dependency: the per-duration ban only engages where the rate limiter advertises RateLimit-Reset; the dashmate change above enables it. Where the header is absent, behaviour falls back to the unchanged exponential ban ladder.

Breaking Changes

None. Adds a CanRetry method with a default implementation; the genuine-failure ban path is unchanged.

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have added "!" to the title and described breaking changes (n/a — no breaking changes)
  • I have made corresponding changes to the documentation if needed

For repository code-owners and collaborators only

  • I have assigned this pull request to a milestone

Attribution

🤖 Co-authored by Claudius the Magnificent AI Agent

…limits

ResourceExhausted is a congestion/backpressure signal, not endpoint
ill-health. Banning a rate-limited (but healthy) node relocates its
load onto survivors and cascades to NoAvailableAddressesToRetry. Now
RE is classified as rate-limited: the node is not banned, the retry
rotates to a different node via explicit exclusion, and the existing
bounded retry count bounds an over-limit client. Genuine-failure ban
path (60s x e^n) is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a new is_rate_limited() method to the CanRetry trait and implements it from gRPC tonic::Status up through TransportError, DapiClientError, and ExecutionError. The update_address_ban_status function skips banning nodes on rate-limit signals, and the DapiClient retry loop tracks tried addresses using a new get_live_address_excluding API to rotate across distinct nodes. Rate-limited retries use capped exponential backoff with full jitter, while genuine failures maintain flat delays.

Changes

Rate-limit-aware retry rotation

Layer / File(s) Summary
CanRetry trait extension and gRPC classification
src/lib.rs, src/transport/grpc.rs, src/transport.rs, src/executor.rs
Adds is_rate_limited() -> bool default method to CanRetry; implements it for tonic::Status (ResourceExhausted only), TransportError (delegates to gRPC status), and ExecutionError (delegates to inner error); includes transport-layer unit tests validating the classification.
Exclusion-aware live address selection
src/address_list.rs
Adds get_live_address_excluding(&[Address]) filtering both ban-expired and explicitly excluded addresses; refactors get_live_address() to delegate to it; adds tests covering exclusion, ban interaction, and fallback rotation pattern across multiple attempts.
Rate-limit aware retry delay and execution loop
src/dapi_client.rs
Introduces rate-limit specific retry timing with capped exponential backoff and full jitter for rate-limited errors vs. flat delay for genuine failures; DapiClientError::is_rate_limited() delegates to TransportError; update_address_ban_status skips banning for rate-limited errors; DapiClient::execute tracks tried addresses and rotates using get_live_address_excluding with fallback, recording each tried address after transport-client creation failures and after transport execution failures.
Unit test invariants
src/dapi_client.rs
Tests verify is_rate_limited() delegation, that genuine failures still increment ban_count exponentially, and that repeated ResourceExhausted errors never set banned_until or reduce the live pool.
Integration tests: rotation and sustained congestion
tests/rate_limit_rotation.rs
Fake transport returns RESOURCE_EXHAUSTED for configurable attempts; asserts rotation to a different healthy node and that system-wide congestion surfaces a retryable error without banning any address.
SDK error layer rate-limit detection
src/error.rs
SDK Error::is_rate_limited() delegates to wrapped DapiClientError when applicable, enabling upper-layer retry logic to recognize rate-limited responses.

Sequence Diagram(s)

sequenceDiagram
  participant App
  participant DapiClient
  participant AddressList
  participant TransportClient
  participant Node

  App->>DapiClient: execute(request)
  DapiClient->>AddressList: get_live_address_excluding(tried)
  AddressList-->>DapiClient: address_A
  DapiClient->>TransportClient: create(address_A)
  TransportClient->>Node: send request
  Node-->>TransportClient: ResourceExhausted
  TransportClient-->>DapiClient: TransportError::Grpc(ResourceExhausted)
  DapiClient->>DapiClient: is_rate_limited() == true → skip ban, push address_A to tried
  DapiClient->>DapiClient: compute delay with jitter, sleep
  DapiClient->>AddressList: get_live_address_excluding([address_A])
  AddressList-->>DapiClient: address_B
  DapiClient->>TransportClient: create(address_B)
  TransportClient->>Node: send request
  Node-->>TransportClient: Ok(response)
  TransportClient-->>DapiClient: Ok(response)
  DapiClient-->>App: ExecutionResponse { inner, retries: 1 }
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

Possibly related PRs

  • dashpay/platform#3890: Both PRs modify update_address_ban_status in dapi_client.rs and the address banning logic in address_list.rs; one adds ban-reason threading while this one adds rate-limit bypass in the same function.

Suggested reviewers

  • shumkov
  • lklimek
  • llbartekll
  • ZocoLini

Poem

🐇 Hop, hop, skip the ban!
When throttled nodes cry "ResourceExhausted," friend,
I twirl my ears and rotate 'round the plan —
No banning healthy bunnies 'til the end.
The tried list grows, the pool stays full and grand,
Congestion fades — more carrots close at hand! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title claims to 'ban rate-limited nodes,' but the PR objectives explicitly state rate-limited errors should NOT be banned; they should trigger rotation/retry instead. The title describes the opposite of what the PR implements. Revise the title to accurately reflect the fix: e.g., 'fix(sdk): rotate to different node on rate-limit instead of banning' or 'fix(sdk): distinguish rate-limited errors from node failures to prevent cascading bans.'
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/rs-dapi-client-rate-limit-rotate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@lklimek lklimek requested a review from Copilot June 22, 2026 10:51
@lklimek lklimek marked this pull request as ready for review June 22, 2026 10:51
@lklimek lklimek requested a review from QuantumExplorer as a code owner June 22, 2026 10:51
@thepastaclaw

thepastaclaw commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

✅ Review complete (commit dade457)

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts rs-dapi-client retry/failover behavior so gRPC ResourceExhausted (rate-limiting / backpressure) no longer triggers the exponential “health ban” logic, preventing ban-cascades under sustained rate-limits while keeping genuine node-failure banning unchanged.

Changes:

  • Introduces a new CanRetry::is_rate_limited() classification (default false), implemented for gRPC Status::ResourceExhausted and delegated through TransportError, DapiClientError, and ExecutionError.
  • Updates update_address_ban_status to not ban rate-limited addresses (log + rotate instead), preserving the existing ban ladder for genuine failures.
  • Adds address-rotation across retries via AddressList::get_live_address_excluding(&[Address]), plus new unit/integration tests to assert both invariants and rotation behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
packages/rs-dapi-client/tests/rate_limit_rotation.rs New integration tests driving the real executor retry loop with a fake transport to validate “rotate, don’t ban” for ResourceExhausted.
packages/rs-dapi-client/src/transport/grpc.rs Implements CanRetry::is_rate_limited() for gRPC Status (ResourceExhausted only).
packages/rs-dapi-client/src/transport.rs Delegates is_rate_limited() for TransportError and adds unit tests pinning the “ResourceExhausted only” classification.
packages/rs-dapi-client/src/lib.rs Extends the public CanRetry trait with is_rate_limited() (default false) and documents intended semantics.
packages/rs-dapi-client/src/executor.rs Delegates is_rate_limited() through ExecutionError<E>.
packages/rs-dapi-client/src/dapi_client.rs Skips banning for rate-limited errors and rotates retries away from already-tried addresses. Adds invariants-focused tests for ban ladder vs rate-limit behavior.
packages/rs-dapi-client/src/address_list.rs Adds get_live_address_excluding and tests ensuring exclusion/rotation semantics and graceful fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +470 to +474
let info = address_list.ban_info();
let entry = info
.iter()
.find(|i| i.uri.contains("3000"))
.expect("address present in ban info");

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same brittle test-lookup as the parallel thread — acknowledged, left as-is in this pass (test-only, no production impact). Leaving open.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Narrow, well-tested fix that adds an is_rate_limited axis to CanRetry, suppresses banning for gRPC ResourceExhausted, and rotates the next retry off the throttled node via get_live_address_excluding plus a per-execution tried set. The default trait impl preserves prior behavior and the genuine-failure ban ladder is pinned by tests. Only minor docs/test-robustness nits — nothing blocking.

🟡 1 suggestion(s) | 💬 4 nitpick(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-dapi-client/src/lib.rs`:
- [SUGGESTION] packages/rs-dapi-client/src/lib.rs:107-109: Pin `is_rate_limited ⇒ can_retry` with a test or debug_assert
  The fix is load-bearing on the implicit contract that any rate-limited error is also retryable: `update_address_ban_status` only consults `is_rate_limited()` from inside the `if error.can_retry()` branch. If a future change ever marks `ResourceExhausted` non-retryable (e.g. a tweak to `Status::can_retry`), rate-limit handling silently stops firing and the error falls through to the no-op tail — neither ban nor rotation happens. Lock the invariant with either a unit test asserting `is_rate_limited() ⇒ can_retry()` across the gRPC code matrix, or a `debug_assert!(!self.is_rate_limited() || self.can_retry())` on the hot path, so the two predicates can't drift unnoticed.

Comment thread packages/rs-dapi-client/src/lib.rs Outdated
Comment thread packages/rs-dapi-client/src/lib.rs Outdated
Comment on lines +105 to +106
/// Instead the retry should rotate to a *different* node and leave the
/// throttled one in the pool. See `update_address_ban_status`.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Nitpick: Doc overstates the rotation guarantee

The doc states the retry "should rotate to a different node and leave the throttled one in the pool." The executor's documented fallback in dapi_client.rs:761 explicitly re-selects the throttled node when get_live_address_excluding(&tried) returns None (single-node pool, or every live node already in tried). Rotation is best-effort, not guaranteed. A one-line caveat keeps the trait contract honest, e.g. "... and prefer rotating to a different node when one is available."

source: ['claude']

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially addressed: M2 makes the retry genuinely rotate to a different live node when one exists (396865f). The doc wording at this site still does not spell out the single-node-pool fallback (where the sole live address is reused with backoff) — that doc caveat was intentionally not added in this pass. Flagging as a known doc nit; leaving open.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in this update — Doc overstates the rotation guarantee no longer present.

Auto-resolved by the review system based on the latest commit diff. If you believe this was closed in error, reopen the thread.

Comment on lines +471 to +474
let entry = info
.iter()
.find(|i| i.uri.contains("3000"))
.expect("address present in ban info");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Nitpick: Match the ban-info test entry by exact URI

The invariant test locates the address under test with i.uri.contains("3000"). The current fixture only inserts one address, so this is correct today, but it ties the assertion to a magic substring inside mock_address(). A future fixture adding another URI containing 3000, or a rename of mock_address, would silently match the wrong entry. Match the exact URI from addr so the test is tied to the value it actually inserted.

Suggested change
let entry = info
.iter()
.find(|i| i.uri.contains("3000"))
.expect("address present in ban info");
let info = address_list.ban_info();
let addr_uri = addr.to_string();
let entry = info
.iter()
.find(|i| i.uri == addr_uri)
.expect("address present in ban info");

source: ['codex']

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — the i.uri.contains("3000") lookup is brittle. It is test-only and unaffected by the production rate-limit changes, so it was left as-is in this pass. Leaving open.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 396865fMatch the ban-info test entry by exact URI no longer present.

Auto-resolved by the review system based on the latest commit diff. If you believe this was closed in error, reopen the thread.

Comment on lines +758 to +762
let Some(address) = self
.address_list
.get_live_address_excluding(&tried)
.or_else(|| self.address_list.get_live_address())
else {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Nitpick: Fallback re-picks the throttled node when exclusion empties the live pool

When get_live_address_excluding(&tried) returns None, the or_else(|| self.address_list.get_live_address()) fallback can hand back an address already present in tried — i.e. the just-rate-limited node — and tried.push(address.clone()) then appends a duplicate. With a fixed 10ms retry_delay, the remaining retry budget collapses onto the throttled node within tens of ms, which adds no value against a per-minute Envoy rate-limit. The behavior is intentionally documented and bounded by max_retries, so it is non-blocking. Two cheap improvements worth considering: (a) when the exclusion empties the pool, surface the raw ResourceExhausted immediately rather than spinning on the same node; or (b) switch tried to a HashSet<Address> so the contract "already attempted" stays exact (also makes the intent self-documenting).

source: ['claude']

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 396865f/b8a8c329: the rate-limit retry path now uses a 3-tier selection that rotates off the just-throttled node whenever any alternative live address exists, plus capped exponential backoff (10 ms → 500 ms) with full jitter. tried is intentionally kept as a Vec — it is bounded by max_retries and the adversarial review agreed a HashSet is not warranted here. Leaving this thread open for maintainer confirmation.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 396865fFallback re-picks the throttled node when exclusion empties the live pool no longer present.

Auto-resolved by the review system based on the latest commit diff. If you believe this was closed in error, reopen the thread.

Comment on lines +461 to +498
let before = chrono::Utc::now();
let result: ExecutionResult<i32, DapiClientError> = Err(ExecutionError {
inner: DapiClientError::Transport(TransportError::Grpc(status)),
retries: 0,
address: Some(addr.clone()),
});
update_address_ban_status(&address_list, &result, &make_applied_settings(true));
let after = chrono::Utc::now();

let info = address_list.ban_info();
let entry = info
.iter()
.find(|i| i.uri.contains("3000"))
.expect("address present in ban info");

assert!(entry.banned, "genuine failure must ban the node");
assert_eq!(
entry.ban_count, expected_count,
"ban_count must increment on each genuine failure"
);

// Window must equal base × e^(ban_count - 1). Since
// banned_until = t_call + period with before <= t_call <= after:
// (banned_until - before) >= period and
// (banned_until - after) <= period.
let period = base_secs * ((expected_count - 1) as f64).exp();
let banned_until = entry.banned_until.expect("banned_until is set");
let lower = (banned_until - before).num_milliseconds() as f64 / 1000.0;
let upper = (banned_until - after).num_milliseconds() as f64 / 1000.0;
assert!(
lower >= period - 0.05,
"ban #{expected_count} window lower bound {lower}s < expected {period}s",
);
assert!(
upper <= period + 0.05,
"ban #{expected_count} window upper bound {upper}s > expected {period}s",
);
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Nitpick: Wall-clock tolerance on ban-window assertion is thin for CI

The invariant test snapshots chrono::Utc::now() before/after update_address_ban_status and asserts the ban window is within ±50 ms of base × e^(n-1). On a loaded shared CI runner a 50 ms scheduling stall between the before snapshot and the ban write is plausible and would flip the lower-bound check. The assertion itself is exact, but the wall-clock dependence is what flakes. Consider widening the tolerance (e.g. to a few hundred ms) or recomputing the period from a directly observable banned_at field if one is exposed. Not blocking — current headroom is empirically fine — but worth a follow-up if these ever flake.

source: ['claude']

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left unchanged in this pass — the ±50 ms wall-clock tolerance is a known timing tradeoff in this test. Not modified by the rate-limit fix; leaving open for the maintainer's call.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 396865fWall-clock tolerance on ban-window assertion is thin for CI no longer present.

Auto-resolved by the review system based on the latest commit diff. If you believe this was closed in error, reopen the thread.

@lklimek lklimek changed the title fix(rs-dapi-client): rotate instead of ban on ResourceExhausted rate-limits fix(sdk): rotate instead of ban on ResourceExhausted rate-limits Jun 22, 2026
lklimek and others added 2 commits June 22, 2026 16:55
…limit retry, lock can_retry invariant

Three review fixes for the rotate-instead-of-ban rate-limit handling.

M1 (invariant lock): the rotate-don't-ban behavior depends on
`is_rate_limited() => can_retry()`. Add a debug_assert! at the rotation
decision in `execute` and an exhaustive property test
(`test_rate_limit_implies_retryable_invariant`) covering every gRPC code
across tonic::Status, TransportError, DapiClientError and ExecutionError,
so a future widening of `is_rate_limited` (or narrowing of `can_retry`)
fails loudly instead of silently killing rotation.

M2 (rotate off the throttled node + backoff): on a rate-limited retry the
selection no longer re-picks the just-throttled node in small pools — the
fallback now excludes the just-tried address and only reuses it when it is
the single live node. The rate-limit retry path also replaces the flat
10ms delay with capped exponential backoff (base 10ms, x2 per attempt,
capped at 500ms) plus full jitter, so a congested fleet de-correlates
instead of retrying in lockstep. Genuine-failure delays are unchanged.
Server-side RetryInfo is left as a documented TODO (needs metadata
plumbing).

M3 (rs-sdk propagation): override `is_rate_limited` on the SDK `Error` to
delegate to the wrapped DapiClientError, so the rotate-don't-ban semantics
survive at the SDK layer independently of how `can_retry` is defined.
Pure future-proofing (today the can_retry guard short-circuits first);
locked with a regression test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iant assert

Two self-introduced nits from the verify pass.

1. `rate_limit_backoff_window`'s `checked_shl` guards only the shift AMOUNT,
   not value bit-loss: a large `attempt` (e.g. 63) shifts the set bits out of
   `10 << attempt`, collapsing the window to 0ms — no backoff at all.
   Pathological (needs retries >= 64; default is 5) but it's a footgun the
   backoff math created. Clamp the shift amount at 16 (already saturates the
   500ms cap, so nothing changes in the valid range) and extend the unit test
   with the attempt=63/64 boundary asserting the window stays at the cap.

2. Add the symmetric `debug_assert!(!is_rate_limited() || can_retry())` to the
   transport-client-creation error block for parity with the transport-error
   block. Defense-in-depth; the property test already covers the invariant.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lklimek lklimek self-requested a review as a code owner June 22, 2026 15:10
@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

Review fixes (adversarial review pass)

Addressed reviewer findings (CodeRabbit / thepastaclaw / internal review). All MEDIUM; two LOW left open by design.

M1 — is_rate_limited ⇒ can_retry invariant locked (thepastaclaw)

  • Exhaustive property test over 17 gRPC codes × 4 CanRetry impls (non-vacuous) + debug_assert at the rotation decision in the typed execute() loop.

M2 — rate-limit retry hardening

  • Retry now rotates to a different node (3-tier selection; never re-picks the just-throttled node while any alternative exists) — closes a gap where small/single-node pools re-hit the throttled node.
  • Flat 10 ms retry replaced with capped exponential backoff (10 ms → 500 ms) + full jitter on the rate-limit path only; genuine-failure path unchanged. Shift amount clamped to avoid pathological window collapse. Server RetryInfo honoring left as a documented TODO (needs grpc-status-details decoding).

M3 — rs-sdk rate-limit classification propagated

  • Error::is_rate_limited() now delegates to the inner DapiClientError, future-proofing against a latent SDK-layer ban regression. Zero behavior change today (no consumer reads it outside if can_retry()).

Left open (LOW, by design): ban-evasion blind spot for a node that always returns ResourceExhausted; .read().unwrap() panic on a poisoned routing lock (TODO breadcrumb added).

Commits: 396865fc, b8a8c329. Tests: clippy -D warnings clean; 110 lib + 7 integration + 5 doc-tests pass.

🤖 Co-authored by Claudius the Magnificent AI Agent

@lklimek lklimek added the ready for final review Ready for the final review. If AI was involved in producing this PR, it has already had a reviewer. label Jun 22, 2026

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Cumulative review at b8a8c32. The latest delta hardens the rate-limit handling: capped exponential backoff with full jitter, three-tier address selection (excluding tried → just-tried → fallback), debug_assert! invariants on both retry arms, and a property test pinning is_rate_limited() ⇒ can_retry() across all CanRetry implementors. Prior findings #1 and #4 are resolved; #2, #3, #5 are still valid nitpicks. One new latest-delta finding: the new AddressStatus ban-ladder test inherits the same thin 50ms wall-clock tolerance. No blocking issues.

💬 3 nitpick(s)

1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.

Comment thread packages/rs-dapi-client/src/lib.rs Outdated
Comment on lines +101 to +106
/// Rate-limit errors are retryable (see [`CanRetry::can_retry`]) but,
/// unlike genuine node ill-health, the node must NOT be banned: it is
/// healthy, just throttled. Banning it would shift its load onto the
/// remaining nodes and cascade into `NoAvailableAddressesToRetry`.
/// Instead the retry should rotate to a *different* node and leave the
/// throttled one in the pool. See `update_address_ban_status`.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Nitpick: CanRetry::is_rate_limited doc asserts unconditional rotation

The doc says Instead the retry should rotate to a *different* node and leave the throttled one in the pool. The executor at dapi_client.rs:980-990 has a three-tier fallback whose final tier (get_live_address()) intentionally returns the just-tried address when no alternative is live (single-node pool, or every live node already in tried). Rotation is best-effort, not guaranteed. Soften the contract so future maintainers don't read it as a hard invariant.

Suggested change
/// Rate-limit errors are retryable (see [`CanRetry::can_retry`]) but,
/// unlike genuine node ill-health, the node must NOT be banned: it is
/// healthy, just throttled. Banning it would shift its load onto the
/// remaining nodes and cascade into `NoAvailableAddressesToRetry`.
/// Instead the retry should rotate to a *different* node and leave the
/// throttled one in the pool. See `update_address_ban_status`.
/// Rate-limit errors are retryable (see [`CanRetry::can_retry`]) but,
/// unlike genuine node ill-health, the node must NOT be banned: it is
/// healthy, just throttled. Banning it would shift its load onto the
/// remaining nodes and cascade into `NoAvailableAddressesToRetry`.
/// Instead the retry path should prefer a different live node while leaving
/// the throttled one in the pool; if no alternative exists, bounded retries
/// may reuse it after backoff. See `update_address_ban_status`.

source: ['claude', 'codex']

Comment on lines +686 to +689
let entry = info
.iter()
.find(|i| i.uri.contains("3000"))
.expect("address present in ban info");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Nitpick: Ban-info invariant test matches by URI substring instead of the inserted address

info.iter().find(|i| i.uri.contains("3000")).expect("address present in ban info") ties the assertion to a magic substring inside mock_address(). The test owns the addr it inserted; matching by exact URI is both safer and self-documenting. A future fixture adding another URI that happens to contain 3000, or a rename of mock_address, would silently match the wrong entry.

Suggested change
let entry = info
.iter()
.find(|i| i.uri.contains("3000"))
.expect("address present in ban info");
let entry = info
.iter()
.find(|i| i.uri == addr.to_string())
.expect("address present in ban info");

source: ['claude', 'codex']

Comment on lines +1125 to +1136
match execution_result {
Ok(response) => return Ok(response),
Err(error) => {
// Invariant lock (load-bearing): a rate-limited error MUST
// be retryable, otherwise the rotation below never fires
// and the rotate-don't-ban behavior silently dies. See the
// `CanRetry` docs and the `test_rate_limit_implies_retryable_invariant`
// property test.
debug_assert!(
!error.is_rate_limited() || error.can_retry(),
"is_rate_limited() must imply can_retry(); rate-limit rotation depends on it"
);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Nitpick: debug_assert! placement is inconsistent between the two retry arms

The transport-create-error arm (line 1049) places the debug_assert!(!cloned_error.is_rate_limited() || cloned_error.can_retry(), ...) before update_address_ban_status, while the transport-call-error arm (line 1133) places the equivalent assert after update_address_ban_status. Functionally equivalent (the assert reads nothing the ban-status update mutates), but the two arms otherwise mirror each other. Aligning placement removes a small future-maintainer puzzle about whether the ordering was intentional.

source: ['claude']

@thepastaclaw

Copy link
Copy Markdown
Collaborator

Opened #3956 against this branch to address the Copilot note by matching the ban-info entry on the exact URI instead of a substring search.

lklimek and others added 3 commits June 23, 2026 14:30
…an-for-duration

Replace the "rotate-don't-ban" approach with "ban for server-dictated window":

- Add `rate_limit.rs`: decode `google.rpc.RetryInfo` from gRPC status details
  via minimal prost structs; `DAPI_RATE_LIMIT_BAN_MS` env-var fallback (default
  60 s) when no RetryInfo is present; WASM-compatible guard on env-var read.
- Add `AddressStatus::ban_for_duration()` + `AddressList::ban_for_duration()`:
  set `banned_until` without touching `ban_count` so the health-ban exponential
  ladder is never disturbed by rate-limit events.
- Add `CanRetry::rate_limit_ban_duration()` (default `None`) and wire it
  through `tonic::Status`, `TransportError`, `ExecutionError`, `DapiClientError`
  and `dash_sdk::Error`.
- Update `update_address_ban_status`: `ResourceExhausted` now calls
  `ban_for_duration(server_hint.unwrap_or(fallback))` instead of skipping.
- Remove rotation machinery: `tried: Vec<Address>`, 3-tier
  `get_live_address_excluding` selection, jitter backoff (`retry_delay`,
  `rate_limit_backoff_window`, `RATE_LIMIT_MAX_DELAY_MS`), and rand import.
  The retry loop now always calls `get_live_address()` — rate-limited nodes are
  naturally absent from the live pool until their ban window expires.
- Fix `AddressBanInfo.banned`: now consistent with `get_live_address` filtering
  (checks only `banned_until`, not `ban_count > 0`), so rate-limit bans are
  visible in diagnostics.
- Delete `tests/rate_limit_rotation.rs`; update/add invariant tests.

All 419 unit/integration/doc tests pass. Zero clippy warnings.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r; drop hand-rolled rotation

Replace prost/RetryInfo-based approach with direct `ratelimit-reset` HTTP
header parsing.  No hand-rolled rotation machinery — banning a node already
achieves rotation because `get_live_address()` skips banned nodes.

Changes:
- Delete `rate_limit.rs` (prost/RetryInfo decoder) and remove prost dep
- Replace `is_rate_limited() -> bool` on `CanRetry` trait with
  `rate_limit_ban_duration() -> Option<Duration>` (default `None`)
- `tonic::Status` impl: if `ResourceExhausted` + `ratelimit-reset` header
  is a parseable u64 > 0, return `Some(clamped [1s, 600s])`; else `None`
- Add `AddressStatus::ban_for(period, reason)` + `AddressList::ban_for`
  that sets `banned_until = now + period`, `ban_count = max(ban_count, 1)`;
  does NOT inflate the exponential health-ban ladder
- `update_address_ban_status`: dispatch to `ban_for` on `Some(period)`,
  fall back to existing `ban_with_reason` on `None`
- Revert execute loop to base: flat 10ms retry delay, plain `get_live_address()`
- Revert `rs-sdk/src/error.rs` to base (banning stays inside rs-dapi-client)
- Add `LIMIT_RESPONSE_HEADERS_ENABLED=true` to dashmate rate-limiter compose
- Replace deleted `tests/rate_limit_rotation.rs` with
  `tests/rate_limit_ban.rs` (5 focused tests for the header-parse path)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… fixes

QA-001: Correct `AddressStatus::ban_for` / `AddressList::ban_for` docstrings —
  the "no ladder inflation" claim was false for fresh nodes (ban_count 0 → 1
  raises the next genuine health ban from ~60 s to ~163 s). Docs now describe
  the floor/side-effect accurately. New test pins the 0→1 ladder-floor behaviour.

QA-002: Fix `MIN_RATE_LIMIT_BAN_SECS` comment — the `> 0` filter before the
  clamp already rejects 0, so MIN=1 is a documentation/intent marker, not an
  active lower clamp.

QA-003: De-tautologize "bad/zero/missing header → ladder" tests. Previous
  assertions (ban_count == 1) passed for BOTH ban_for and ban_with_reason.
  Now also assert the banned_until window is ~60 s (first ladder rung), which
  would fail if a bad header were wrongly routed to ban_for.

QA-004: Add clamp-edge coverage: 1 → 1 s, 600 → 600 s, 601 → 600 s (was only
  45 and 9999).

QA-005: Add test that a ban_for'd address re-enters rotation via
  get_live_address after its window expires (zero-duration re-ban simulates
  an expired window).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Claudius-Maginificent Claudius-Maginificent changed the title fix(sdk): rotate instead of ban on ResourceExhausted rate-limits fix(rs-dapi-client): ban rate-limited node for Envoy-advertised reset window Jun 23, 2026
@lklimek

lklimek commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@thepastaclaw review most recent commit,

@lklimek lklimek changed the title fix(rs-dapi-client): ban rate-limited node for Envoy-advertised reset window fix(dapi-client): ban rate-limited node for Envoy-advertised reset window Jun 23, 2026
@lklimek lklimek changed the title fix(dapi-client): ban rate-limited node for Envoy-advertised reset window fix(sdk): ban rate-limited node for Envoy-advertised reset window Jun 23, 2026

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

New latest-delta verification found one in-scope blocking state-management bug: the new exact-window rate-limit ban can overwrite a longer active ban. Two non-blocking latest-delta issues remain around WASM/browser header visibility and misleading logging, and the carried-forward wall-clock tolerance test flake is still present. No CodeRabbit inline findings or comment IDs were supplied.

🔴 1 blocking | 🟡 2 suggestion(s)

1 additional finding(s) omitted (not in diff).

1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-dapi-client/src/address_list.rs`:
- [BLOCKING] packages/rs-dapi-client/src/address_list.rs:116-119: Rate-limit ban can shorten an active longer ban
  `ban_for` always replaces `banned_until` with `now + period`. Because request completions can arrive out of order for the same shared `AddressList`, a later `ResourceExhausted` response with a short `ratelimit-reset` value can shorten a longer health-ban or longer prior rate-limit ban that was already applied. That reintroduces the address before the existing ban window expires, so genuine failure bans are no longer preserved under overlapping requests.

In `packages/dashmate/templates/platform/gateway/envoy.yaml.dot`:
- [SUGGESTION] packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201: Expose rate-limit reset headers to browser clients
  The PR enables the rate limiter to emit `RateLimit-Reset`, and the Rust client parses it from tonic metadata, but browser/WASM clients go through fetch/CORS. Since Envoy's CORS `expose_headers` list does not include `ratelimit-reset` or the other `ratelimit-*` headers, browser clients cannot read the advertised reset window and will fall back to the exponential health-ban path even when Envoy sent the exact duration.

In `packages/rs-dapi-client/src/dapi_client.rs`:
- [SUGGESTION] packages/rs-dapi-client/src/dapi_client.rs:213-221: Rate-limit ban log is emitted before the ban succeeds
  The `Some(period)` branch logs that it is banning the address before calling `address_list.ban_for`. If the address was removed by an SML update, `ban_for` returns `false` and the code then logs that it was unable to ban the same address. Move the rate-limit-specific debug log behind the `banned` check so logs describe the state transition that actually happened.

Comment on lines +116 to +119
pub fn ban_for(&mut self, period: Duration, reason: Option<String>) {
self.banned_until = Some(chrono::Utc::now() + period);
self.ban_count = self.ban_count.max(1);
self.ban_reason = reason;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Blocking: Rate-limit ban can shorten an active longer ban

ban_for always replaces banned_until with now + period. Because request completions can arrive out of order for the same shared AddressList, a later ResourceExhausted response with a short ratelimit-reset value can shorten a longer health-ban or longer prior rate-limit ban that was already applied. That reintroduces the address before the existing ban window expires, so genuine failure bans are no longer preserved under overlapping requests.

Suggested change
pub fn ban_for(&mut self, period: Duration, reason: Option<String>) {
self.banned_until = Some(chrono::Utc::now() + period);
self.ban_count = self.ban_count.max(1);
self.ban_reason = reason;
pub fn ban_for(&mut self, period: Duration, reason: Option<String>) {
let advertised_until = chrono::Utc::now() + period;
if self
.banned_until
.map(|current_until| current_until < advertised_until)
.unwrap_or(true)
{
self.banned_until = Some(advertised_until);
self.ban_reason = reason;
}
self.ban_count = self.ban_count.max(1);
}

source: ['codex']

Comment on lines +213 to +221
Some(period) => {
tracing::debug!(
?address,
ban_secs = period.as_secs(),
"rate-limited (ResourceExhausted): banning {address} \
for {}s (from RateLimit-Reset header)",
period.as_secs()
);
address_list.ban_for(address, period, reason)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Rate-limit ban log is emitted before the ban succeeds

The Some(period) branch logs that it is banning the address before calling address_list.ban_for. If the address was removed by an SML update, ban_for returns false and the code then logs that it was unable to ban the same address. Move the rate-limit-specific debug log behind the banned check so logs describe the state transition that actually happened.

source: ['claude']

# Emit RateLimit-Limit / RateLimit-Remaining / RateLimit-Reset response
# headers so rs-dapi-client can read the exact reset window and ban the
# node for that duration instead of the exponential health-ban ladder.
- LIMIT_RESPONSE_HEADERS_ENABLED=true

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I suggested the wrong direction. I mean, the ban system is correct, but exposing headers that are disabled by default and enabling them via env is risky. Could you please investigate what consequences and why it's disabled by default?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for final review Ready for the final review. If AI was involved in producing this PR, it has already had a reviewer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants