Skip to content

DO NOT MERGE: make connection loop more aggressive, use hickory DNS client#5328

Open
graydon wants to merge 1 commit into
stellar:overlay-v2-sharedfrom
graydon:aggressive-reconnect
Open

DO NOT MERGE: make connection loop more aggressive, use hickory DNS client#5328
graydon wants to merge 1 commit into
stellar:overlay-v2-sharedfrom
graydon:aggressive-reconnect

Conversation

@graydon

@graydon graydon commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

this is an attempt at making overlay v2 a little more aggressive about trying to connect to its peers, as well as switching to a custom DNS resolver (so we can clear any cached failed lookups) and an improved sync of metrics when the /info endpoint of core gets hit (which is what supercluster listens to). there's a corresponding change in supercluster to help with DNS publishing.

I don't know which if any of these will help, but the goal here is to try to improve the behavior of supercluster runs that just stall out trying to get to their target connection count.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: faeed5a5a7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread overlay/src/main.rs
// Falls back to address-based dials for peers we haven't connected to yet.
// This is a fallback — targeted reconnection on disconnect handles the fast path.
let mut reconnect_interval = tokio::time::interval(Duration::from_secs(30));
let mut reconnect_interval = tokio::time::interval(CONFIGURED_PEER_RECONNECT_INTERVAL);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid overlapping raw reconnect dials

When a configured peer has not reached PeerConnected yet (startup, DNS change, or a slow QUIC handshake), it remains in unknown_addrs, and the reconnect branch raw-dials it from a spawned task on every timer tick. With this interval now at 1s, any resolution/handshake taking longer than a tick can create concurrent address-only dials to the same peer; the code below even notes those raw dials cannot be deduplicated by libp2p before the PeerId is known, so this can inflate pending connections and duplicate handshakes rather than just retry faster. Please gate raw dials on an in-flight/resolved set or keep the fast timer to PeerId-based retries only.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant