Skip to content

outbound/urltest: fix gateway freeze when relay stops forwarding#4256

Open
HouMinXi wants to merge 1 commit into
SagerNet:testingfrom
HouMinXi:fix/urltest-batch-timeout
Open

outbound/urltest: fix gateway freeze when relay stops forwarding#4256
HouMinXi wants to merge 1 commit into
SagerNet:testingfrom
HouMinXi:fix/urltest-batch-timeout

Conversation

@HouMinXi

Copy link
Copy Markdown

Summary

Fix transparent gateway freeze caused by URLTestGroup.urlTest() blocking indefinitely when a relay server accepts TCP but stops forwarding data.

Problem

When a relay becomes unresponsive at the application layer (TCP keepalive keeps the connection alive), three mechanisms lock simultaneously:

  1. bufio.CopyConn goroutines block in Read() forever (no idle timeout on relay connections)
  2. URL test probe goroutines also block in Read() waiting for HTTP response from the dead relay
  3. batch.Wait() requires ALL probes to complete, so one stuck probe blocks the entire health check
  4. checking atomic flag stays true, silencing all future timer-triggered health checks
  5. selectedOutbound is never updated, routing all new connections to the dead relay

Confirmed by SIGQUIT goroutine dump during a live stall: 205 goroutines total, 168 blocked in CopyConn IO wait, 1 in batch.Wait() semacquire for 3 minutes. Full dump available in #4255.

Fix

common/urltest/urltest.go: Set SetReadDeadline on the dialed connection. Context cancellation does not interrupt net.Conn.Read() when the connection comes from a custom DialContext. Uses relative timeout to avoid issues with time already consumed by the dial phase.

protocol/group/urltest.go:

  • Derive per-probe testCtx from batch context (was g.ctx), so batch cancellation propagates
  • Wrap batch.Wait() with a hard timer (2*TCPTimeout). On timeout, proceed with available results
  • Delete stale history for incomplete probes to prevent performUpdateCheck from selecting a stuck outbound

Testing

Deployed patched binary on the affected gateway (N100 iStoreOS, tproxy + urltest with 6 exit nodes). Before patch: 8 stalls in 2 hours with intervals shrinking to 57s. After patch: monitoring for stability.

Ref: #4255 (goroutine dump and full analysis), #4144 (same symptom on different deployments), #1620 (same symptom on N100 tproxy, closed as stale)

When a relay server accepts TCP but stops forwarding application data,
URL test probe goroutines block in Read() indefinitely. batch.Wait()
then blocks forever, keeping the checking flag true and suppressing all
future health checks. selectedOutbound is never updated, so new
connections keep routing to the dead relay. This creates a triple
self-locking loop that makes the gateway completely unresponsive.

Fix three issues:

1. Set SetReadDeadline on the URL test connection. Context cancellation
   does not interrupt net.Conn.Read() when the connection was obtained
   through a custom DialContext. Use a relative timeout to avoid issues
   with clock time already consumed by the dial phase.

2. Add a hard timeout (2*TCPTimeout) around batch.Wait(). When the
   timeout fires, proceed with whatever results are available rather
   than blocking indefinitely.

3. Propagate batch context to individual probes by deriving testCtx
   from batchCtx instead of g.ctx, so batch cancellation reaches stuck
   probes.

4. Clean up stale history entries for probes that did not complete
   within the timeout, preventing performUpdateCheck from selecting
   a stuck outbound based on outdated results.

Root cause confirmed by SIGQUIT goroutine dump during a live stall
event on a tproxy gateway (205 goroutines, 168 blocked in CopyConn,
1 semacquire in batch.Wait for 3 minutes).

Fixes SagerNet#4255
Ref: SagerNet#4144 SagerNet#1620

Signed-off-by: Minxi Hou <houminxi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant