Skip to content

fix: apply fixes from bohdan/finish-dkg#361

Merged
varex83 merged 9 commits intomainfrom
bohdan/pre-dkg-bug-fix
May 4, 2026
Merged

fix: apply fixes from bohdan/finish-dkg#361
varex83 merged 9 commits intomainfrom
bohdan/pre-dkg-bug-fix

Conversation

@varex83
Copy link
Copy Markdown
Collaborator

@varex83 varex83 commented Apr 29, 2026

Summary

This PR fixes several bugs discovered during DKG integration testing, primarily around relay connection lifecycle management in the P2P layer.

Relay reconnection on disconnect / dial failure (relay.rs)

Previously, if a relay connection dropped or a dial attempt failed, the node would not attempt to re-establish the connection, leaving it permanently unable to route traffic through that relay.

  • Added relay_peers: HashMap<PeerId, Peer> to MutableRelayReservation to remember all known relay peers so they can be re-dialed when a connection is lost.
  • Added connected_relays: HashSet<PeerId> to skip redundant dials when a connection is already established or in-flight.
  • on_swarm_event now handles ConnectionClosed (last connection dropped) and DialFailure to trigger a re-dial, clearing stale pending/connected state first.
  • Switched relay dials from DialOpts::unknown_peer_id() to DialOpts::peer_id(...).condition(DisconnectedAndNotDialing) so libp2p can deduplicate concurrent dial attempts.

Relay-ready settling delay (relay.rs)

RelayRouter was immediately attempting to route peers through a relay the moment the connection was established, before the relay reservation handshake could complete. This caused circuit-dial attempts to fail silently.

  • Added connected_relays: HashMap<PeerId, Instant> to RelayRouter to record when each relay became connected.
  • Added relay_ready() which gates routing behind a 2-second RELAY_READY_DELAY so the reservation handshake has time to finish.
  • RelayRouter::run_relay_router now iterates over connected_relays (only relays with an active connection) instead of all configured relays, and skips any that haven't settled yet.
  • Peer dials via relay also use DisconnectedAndNotDialing to avoid redundant dial attempts.

Fix malformed circuit multiaddresses (relay.rs, utils.rs)

Relay peer addresses stored in the Peer struct sometimes already included a trailing /p2p/<peer-id> component. Appending another /p2p/<relay-id>/p2p-circuit (or /p2p-circuit/p2p/<target-id>) on top produced invalid multiaddresses that libp2p silently rejected.

  • Both queue_relay_dial and multi_addrs_via_relay now strip any existing /p2p/... protocol components from the base address before constructing circuit or direct-dial addresses.

Fix encode_0x_hex for empty input (serde_utils.rs)

encode_0x_hex(&[]) previously returned "0x" instead of "", which caused downstream deserialization failures for optional byte fields encoded as empty strings.

Comment thread crates/p2p/src/relay.rs Outdated
&& let Some(peer) = self.relay_peers.get(&peer_id).cloned()
{
self.pending_relays.remove(&peer_id);
Self::queue_relay_dial(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread crates/ssz/src/serde_utils.rs Outdated
}

/// Encodes bytes as lowercase `0x`-prefixed hex.
/// In case of empty bytes, returns an empty string.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will affect the eth2util serialization.
From here: https://github.com/NethermindEth/pluto/blob/main/crates/eth2api/src/spec/bellatrix.rs#L121

The bellatrix extra_data use the Hex0x which uses encode_0x_hex.

In go-eth2-client, this field expect "0x" when the extra_data is empty: https://github.com/attestantio/go-eth2-client/blob/master/spec/bellatrix/executionpayload.go#L91

varex83agent and others added 2 commits May 3, 2026 21:49
Re-dial attempts on ConnectionClosed and DialFailure now wait with
exponential backoff (base=1s, multiplier=1.6, max=120s) matching
Charon's DefaultConfig. A pinned Sleep future in the behaviour
registers the waker so the swarm is woken at the right time.

Backoff state is reset on successful ConnectionEstablished.

Co-Authored-By: Bohdan Ohorodnii <35969035+varex83@users.noreply.github.com>
encode_0x_hex always returns "0x{hex}" (including "0x" for empty),
matching the go-eth2-client expectation for fields like extra_data.

A new encode_hex_or_empty function and HexBytes serde adapter return ""
for empty bytes, matching Charon's to0xHex convention used throughout
cluster/definition JSON for optional fields like unsigned signatures.

All cluster crates switch from Hex0x to HexBytes. Regression tests
cover both serialization behaviors and deserialization of "" vs "0x".

Co-Authored-By: Bohdan Ohorodnii <35969035+varex83@users.noreply.github.com>
@varex83
Copy link
Copy Markdown
Collaborator Author

varex83 commented May 4, 2026

@claude review it

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @varex83's task in 3m 12s —— View job


Code Review

  • Read changed files and gather context
  • Review relay.rs changes (reconnection, settling delay, multiaddr fixes)
  • Review serde_utils.rs changes (encode_0x_hex fix)
  • Review utils.rs changes
  • Review cluster crate changes
  • Analyze existing review comments
  • Post full review

Overall Assessment

The PR correctly addresses several real bugs (relay reconnection, malformed multiaddrs, empty-hex serialisation). The structure is sound, but there are two functional issues worth fixing before merge.


crates/p2p/src/relay.rs

Bug: RelayRouter::poll() strands events for up to 60 seconds

After run_relay_router() fills self.events, the function falls through to Poll::Pending without returning any of those events. Since no immediate waker is armed after the interval consumes its tick, the swarm won't re-poll RelayRouter until the next interval tick 60 seconds later — meaning every batch of relay-circuit dials is delayed by a full interval.

// relay.rs ~529-540
fn poll(&mut self, cx: &mut Context<'_>) -> Poll<...> {
    if let Some(event) = self.events.pop_front() {
        return Poll::Ready(event);
    }
    if self.interval.poll_tick(cx).is_ready() {
        self.run_relay_router();   // fills self.events…
    }
    Poll::Pending                  // …but nothing is returned here
}

Fix: check the queue after the interval fills it:

fn poll(&mut self, cx: &mut Context<'_>) -> Poll<...> {
    if self.interval.poll_tick(cx).is_ready() {
        self.run_relay_router();
    }
    if let Some(event) = self.events.pop_front() {
        return Poll::Ready(event);
    }
    Poll::Pending
}

Fix this →


Missing jitter in backoff_delay

The comment says it mirrors Charon's DefaultConfig, but Charon's expbackoff.Backoff applies a ±20% jitter (backoff *= 1 + 0.2*(rand*2-1)). Without jitter, all nodes that disconnect from the relay at the same time will retry simultaneously — a classic thundering-herd problem against the relay server.

// relay.rs ~195-205
fn backoff_delay(retry_count: u32) -> Duration {
    // ...no jitter applied...
}

Charon reference: expbackoff/expbackoff.go:146-169

Fix this →


DialFailure handler doesn't clear connected_relays

The handler removes from pending_relays but not from connected_relays. This shouldn't be reachable in normal flow (a dial failure implies the peer isn't connected), but if there's ever a race where both events fire for the same peer, connected_relays will contain a stale entry blocking future re-dials via queue_relay_dial's early return.

FromSwarm::DialFailure(ev) => {
    if let Some(peer_id) = ev.peer_id
        && let Some(peer) = self.relay_peers.get(&peer_id).cloned()
    {
        self.pending_relays.remove(&peer_id);
        // self.connected_relays.remove(&peer_id);  ← missing
        self.schedule_retry(peer);
    }
}

Minor: indentation inconsistency in poll()

The block inside if retry_due { ... } (lines ~356-391) is indented with 16 spaces while the surrounding code uses 8. cargo fmt will fix this.


RELAY_READY_DELAY is a magic number

2 seconds is hardcoded with no Charon equivalent and no rationale for why 2s is appropriate. Since this directly affects whether circuit dials succeed, it's worth either linking a spec reference or adding a brief justification comment.


crates/ssz/src/serde_utils.rs

The Hex0x / HexBytes split is the right approach and directly addresses the concern raised in the existing review comment (keeping "0x" for extra_data in bellatrix.rs while emitting "" for cluster fields). The tests cover both adapters well.

One note: HexBytes::deserialize_as is identical to Hex0x::deserialize_as — both call decode_0x_hex which accepts "", "0x", and any valid hex. This is intentional and correct since Charon may produce either form.


crates/p2p/src/utils.rs

The /p2p/... stripping in multi_addrs_via_relay is correct. The filter + collect pattern over Multiaddr::iter() produces a clean base address before appending relay/circuit components.


crates/cluster/

Mechanical rename of Hex0xHexBytes across all cluster types. Correct — these fields follow Charon's to0xHex convention where empty bytes serialise as "".


Addressing existing review comments

  • @iamquang95 re: backoff delay — addressed by the new backoff_delay / schedule_retry implementation.
  • @iamquang95 re: encode_0x_hex affecting eth2util — addressed: eth2api fields still use Hex0x (which retains "0x" for empty); cluster fields now use HexBytes. The bellatrix.rs extra_data field is unaffected.

Summary: Two functional issues to fix (events stranded in RelayRouter::poll, missing backoff jitter), one defensive improvement (DialFailure cleanup), and minor formatting. The core logic is sound.

@varex83
Copy link
Copy Markdown
Collaborator Author

varex83 commented May 4, 2026

crates/p2p/src/relay.rs
Bug: RelayRouter::poll() strands events for up to 60 seconds
After run_relay_router() fills self.events, the function falls through to Poll::Pending without returning any of those events. Since no immediate waker is armed after the interval consumes its tick, the swarm won't re-poll RelayRouter until the next interval tick 60 seconds later — meaning every batch of relay-circuit dials is delayed by a full interval.

I think that's how it's done in ping implementation in libp2p, so we should be okay with that, wdyt @iamquang95 ?

@varex83 varex83 requested a review from iamquang95 May 4, 2026 09:32
@iamquang95
Copy link
Copy Markdown
Collaborator

@varex83 I would implememnt by this

if let Some(event) = self.events.pop_front() {
    return Poll::Ready(event);
}

if self.interval.poll_tick(cx).is_ready() {
    self.run_relay_router();
}

if let Some(event) = self.events.pop_front() {
    return Poll::Ready(event);
}

Poll::Pending

the rust-libp2p push to event on on_connection_handler_event then poll() and drain the event https://github.com/libp2p/rust-libp2p/blob/22fb4c784fc55ad8b15d05fdc9f98d663107d4cb/protocols/ping/src/lib.rs#L143

But the relay-router add Dial events inside poll() after interval.poll_tick() returns ready, then returns Pending without draining them.

varex83agent and others added 2 commits May 4, 2026 12:43
Apply ±20% jitter to relay reconnect backoff, mirroring Charon's
expbackoff.DefaultConfig (Jitter=0.2) so that nodes don't retry in
lockstep after a relay outage.

Co-Authored-By: varex83 <varex83@gmail.com>
@varex83
Copy link
Copy Markdown
Collaborator Author

varex83 commented May 4, 2026

@claude please review

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @varex83's task in 1m 41s —— View job


Code Review

  • Read changed files and gather context
  • Review relay.rs changes (reconnection, settling delay, multiaddr fixes, poll() issue)
  • Review serde_utils.rs changes
  • Review utils.rs changes
  • Review cluster crate changes
  • Post full review

Overall Assessment

Since the last review, two of the functional issues have been addressed and the PR is in good shape. One defensive fix is still missing, and there's a minor nit on RELAY_READY_DELAY. Everything else looks correct.


Previously flagged — now fixed ✅

RelayRouter::poll() stranding events — fixed. The current code (lines 559–573) correctly checks the interval, calls run_relay_router(), then drains the queue before returning Pending. This matches the pattern iamquang95 suggested and is consistent with how libp2p behaviours should work.

Jitter in backoff_delay — fixed. RELAY_BACKOFF_JITTER = 0.2 is defined and applied correctly (lines 46–47, 216–217), matching Charon's DefaultConfig.


Still present

DialFailure handler doesn't clear connected_relaysrelay.rs:387

FromSwarm::DialFailure(ev) => {
    if let Some(peer_id) = ev.peer_id
        && let Some(peer) = self.relay_peers.get(&peer_id).cloned()
    {
        self.pending_relays.remove(&peer_id);
        // self.connected_relays.remove(&peer_id);  ← still missing
        self.schedule_retry(peer);
    }
}

A dial failure implies no connection exists, but if connected_relays contains a stale entry for that peer (e.g. from a race between ConnectionClosed and DialFailure events), queue_relay_dial's early-return guard at line 153 will silently skip the scheduled re-dial forever. The ConnectionClosed handler at line 376 does clear connected_relays correctly — the DialFailure handler should do the same.

Fix this →

RELAY_READY_DELAY magic number — relay.rs:39

The 2-second constant has no Charon equivalent and no rationale in a comment. Since this directly gates whether circuit dials succeed, a brief justification (// enough time for the relay reservation handshake) or a link to an empirical basis would help future readers calibrate or tune this value.


crates/ssz/src/serde_utils.rs — looks good

The Hex0x / HexBytes split is clean and the test coverage is thorough: separate round-trip tests, empty-string regression tests, and cross-compatibility (both adapters accept "" and "0x" on deserialization). The bellatrix.rs extra_data field uses Hex0x and is unaffected.

crates/p2p/src/utils.rs — looks good

The /p2p/... stripping via filter(!matches!(p, MaProtocol::P2p(_))) in multi_addrs_via_relay correctly handles addresses that already include a trailing peer-id component before the circuit suffix is appended.

crates/cluster/ — looks good

Mechanical Hex0xHexBytes rename across all cluster types. Correct — these fields follow Charon's to0xHex convention where empty bytes serialise as "".


Summary: One defensive fix still needed (DialFailure should also clear connected_relays). The two functional bugs from the last review are resolved. Ready to merge once that one-liner is added.

@varex83 varex83 merged commit c6043dd into main May 4, 2026
9 checks passed
@varex83 varex83 deleted the bohdan/pre-dkg-bug-fix branch May 4, 2026 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants