Skip to content

feat(api): make API calls non-blocking#6733

Open
xxo1shine wants to merge 1 commit intotronprotocol:developfrom
xxo1shine:feat/api-rate-limiter-opt
Open

feat(api): make API calls non-blocking#6733
xxo1shine wants to merge 1 commit intotronprotocol:developfrom
xxo1shine:feat/api-rate-limiter-opt

Conversation

@xxo1shine
Copy link
Copy Markdown
Collaborator

@xxo1shine xxo1shine commented Apr 29, 2026

What does this PR do?

Convert the HTTP and gRPC rate-limiter paths from blocking acquire() to non-blocking tryAcquire(), and fix the related issues that surfaced along the way.

  1. Core API switched to non-blocking

    • GlobalRateLimiter.acquire()GlobalRateLimiter.tryAcquire() (returns boolean)
    • All IRateLimiter.acquire() / strategies / adapters updated to tryAcquire
    • Over-limit requests now return HTTP 429 / gRPC RESOURCE_EXHAUSTED immediately instead of holding the worker thread in a wait queue
  2. Reordered limit checks: per-endpoint first

    • Both RateLimiterServlet.service and RateLimiterInterceptor.interceptCall now check the per-endpoint limit first, and only consult the global limiter on
      success
    • acquireResource = perEndpointAcquired && GlobalRateLimiter.tryAcquire(...)
  3. Fixed semaphore leaks in IPreemptibleRateLimiter

    • HTTP path: the per-endpoint permit must be released even when the global limiter rejects (the finally block now keys off perEndpointAcquired instead of
      acquireResource)
    • gRPC path: explicit release() added on the early-return branch (global reject after per-endpoint acquired) and in the catch block when next.startCall()
      throws
  4. Hardened cache-load failure handling in GlobalRateLimiter / IPQpsStrategy

    • catch (ExecutionException) broadened to catch (Exception) (covers UncheckedExecutionException, etc.)
    • Added logger.warn with the offending IP and e.getMessage()
    • Fail-closed on exception (return false) — the previous code silently fail-opened by creating a fresh, non-cached RateLimiter per call, which effectively
      bypassed rate limiting for that IP after the first failure
  5. Cleanup + test coverage

    • GlobalPreemptibleStrategy: removed the unused @Slf4j and an obsolete timeout constant
    • Deleted AdaptorThread.java (the 20-thread timing-based assertion); replaced by deterministic sequential tests
    • Added RateLimiterServletTest and RateLimiterInterceptorTest; expanded GlobalRateLimiterTest

Why are these changes required?

  1. Blocking acquire() is a tail-latency hazard. Under load, request threads pile up waiting for permits. Once the bounded Netty/Jetty worker pool is exhausted
    by such waiters, the node stops responding to healthy traffic too — and the impact persists past the original burst. Non-blocking tryAcquire returns immediately,
    so threads remain available for legitimate requests.

  2. The old ordering wasted global quota. A request was charged a global token before the per-endpoint check ran. A hot endpoint that was itself throttled could
    still drain the global budget, letting attackers exhaust shared capacity through a single endpoint they were already capped on. Per-endpoint-first closes this gap.

  3. Permit leaks in IPreemptibleRateLimiter are a slow-motion outage. The previous code skipped release() in two paths:

    • the per-endpoint permit on global rejection;
    • the per-endpoint permit when next.startCall() threw in the gRPC path.

    The semaphore drifts monotonically downward; once it reaches zero, every subsequent gRPC call blocks indefinitely. This is a latent availability bug that
    accumulates with time.

  4. Silent fail-open is an anti-abuse anti-pattern. When cache.get(loader) threw, the original code created a fresh RateLimiter without inserting it into the
    cache, so every subsequent call from the failing path got a full burst budget — effectively no rate limit, and no operational signal. Fail-closed plus a warn log
    makes the failure both safe and observable.

  5. The old AdaptorTest was a flaky test. 20 concurrent threads plus assertTrue(t1 - t0 > 4000) is highly sensitive to CI scheduler jitter and produced
    sporadic failures. Replaced with sequential tryAcquire calls that assert deterministic burst behavior of Guava's SmoothBursty, so CI is stable.

This PR has been tested by:

  • Unit Tests
  • Manual Testing

Follow up

Extra details

Fixes #6363

Copy link
Copy Markdown
Collaborator

@waynercheung waynercheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTTP calls experience delay due to rate limiting

2 participants