DO NOT MERGE: make connection loop more aggressive, use hickory DNS client#5328
DO NOT MERGE: make connection loop more aggressive, use hickory DNS client#5328graydon wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: faeed5a5a7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Falls back to address-based dials for peers we haven't connected to yet. | ||
| // This is a fallback — targeted reconnection on disconnect handles the fast path. | ||
| let mut reconnect_interval = tokio::time::interval(Duration::from_secs(30)); | ||
| let mut reconnect_interval = tokio::time::interval(CONFIGURED_PEER_RECONNECT_INTERVAL); |
There was a problem hiding this comment.
Avoid overlapping raw reconnect dials
When a configured peer has not reached PeerConnected yet (startup, DNS change, or a slow QUIC handshake), it remains in unknown_addrs, and the reconnect branch raw-dials it from a spawned task on every timer tick. With this interval now at 1s, any resolution/handshake taking longer than a tick can create concurrent address-only dials to the same peer; the code below even notes those raw dials cannot be deduplicated by libp2p before the PeerId is known, so this can inflate pending connections and duplicate handshakes rather than just retry faster. Please gate raw dials on an in-flight/resolved set or keep the fast timer to PeerId-based retries only.
Useful? React with 👍 / 👎.
this is an attempt at making overlay v2 a little more aggressive about trying to connect to its peers, as well as switching to a custom DNS resolver (so we can clear any cached failed lookups) and an improved sync of metrics when the /info endpoint of core gets hit (which is what supercluster listens to). there's a corresponding change in supercluster to help with DNS publishing.
I don't know which if any of these will help, but the goal here is to try to improve the behavior of supercluster runs that just stall out trying to get to their target connection count.