Skip to content

fix(client): retry 429 responses and honour Retry-After#566

Open
RapidPoseidon wants to merge 1 commit intomainfrom
poseidon/retry-429-rate-limit
Open

fix(client): retry 429 responses and honour Retry-After#566
RapidPoseidon wants to merge 1 commit intomainfrom
poseidon/retry-429-rate-limit

Conversation

@RapidPoseidon
Copy link
Copy Markdown
Contributor

Summary

_RETRYABLE_STATUS_CODES = {502, 503, 504} excluded 429. Any rate-limit response from the backend bubbled up as a hard error even though the SDK already has an exponential-backoff loop perfectly suited to it.

Fix

  • Add 429 to _RETRYABLE_STATUS_CODES.
  • When the response carries a Retry-After header (common for 429), prefer that value over the exponential delay. Bounded by 60 s so a hostile or broken server can't extend the retry loop beyond a reasonable ceiling.
  • Integer-seconds form of Retry-After only; HTTP-date form falls back to the existing exponential backoff schedule.
  • Mirrored into openapi/templates/rest.mustache.

Test plan

  • uv run pyright src/rapidata/rapidata_client → 0 errors
  • Walked the branches: Retry-After: 5 → 5s, Retry-After: 999 → 60s cap, Retry-After: "Wed, ..." → exponential fallback, no header → exponential.

🔗 Session: https://session-bc38cc85.poseidon.rapidata.internal/

`_RETRYABLE_STATUS_CODES` previously only included 502/503/504, so a
legitimate 429 (rate limit) from the backend bubbled up as a hard error
instead of being retried. The exponential-backoff loop is already built
for transient failures — 429 fits the same shape.

Also: when the server sends a `Retry-After` header (common for 429),
prefer that value over the exponential delay, bounded by a 60s cap to
keep a hostile/broken server from extending the retry loop
indefinitely. Integer-seconds form only; HTTP-date form falls back to
exponential backoff.

Mirrored into `openapi/templates/rest.mustache`.

Session: https://session-bc38cc85.poseidon.rapidata.internal/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: lino <lino@rapidata.ai>
@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

Code Review

Overview

This PR adds 429 to the retryable status codes and introduces Retry-After header support with a 60 s safety cap. The change is correctly mirrored in both src/rapidata/api_client/rest.py and openapi/templates/rest.mustache. The fix is well-scoped and addresses a real gap.


Issues

Comment says "five" but there are four status codes

openapi/templates/rest.mustache line 20 / rest.py line 30:

# 429 = rate limit, 502/503/504 = transient upstream / gateway errors.
# All five are safe to retry with exponential backoff.
_RETRYABLE_STATUS_CODES = {429, 502, 503, 504}

The set has four entries, not five. Should read "All four".


Suggestions

Only 3 attempts is likely too few for rate limiting

_TRANSIENT_RETRY_MAX_ATTEMPTS = 3 was sized for transient infra blips (502/503/504), where retrying 3 times in a couple of seconds is appropriate. For 429, the server might send Retry-After: 60 three times in a row and the call still fails. Consider either:

  • Increasing the attempt count when the status is 429, or
  • Treating 429 separately with its own _RATE_LIMIT_MAX_ATTEMPTS constant.

No jitter on the Retry-After path

When multiple SDK clients hit a rate limit simultaneously they all receive the same Retry-After value and will retry in lockstep — classic thundering-herd. Adding a small random jitter (e.g. +uniform(0, 1)) on the Retry-After-derived delay would spread the load.

No automated tests

The PR description mentions walking branches manually. The _retry_after_from_response static method is pure and easy to unit-test — covering the four cases (integer, capped integer, negative, HTTP-date string) would prevent regressions without much effort.


What's Good

  • The 60 s cap on Retry-After is a solid defence against a misbehaving server holding the client indefinitely.
  • Negative-value guard is correct.
  • HTTP-date fallback to exponential backoff is the right choice given the parsing complexity.
  • Both the generated file and the mustache source are kept in sync.
  • Pyright check passes (0 errors per PR description).

Summary

The core logic is correct and the approach is sound. The main actionable items before merging are: fix the "five"→"four" comment, consider whether 3 attempts is sufficient for rate-limit retries, and ideally add at least a basic unit test for _retry_after_from_response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant