feat(core): multi-link routing#123
Conversation
|
Hi Yifan. Thanks for the PR, I appreciate the enthusiasm. Nylon already supports multi-endpoint probing. It will only send data through the active/best link, but will continually probe (send control packets) over all configured endpoints. My suggestion is to look at code under Can you double check? Also, can you elaborate on "Lack of Interface Awareness"? Nylon currently does not support sending packets directly over a specified interface, but that should be a relatively small change without needing to do a large refactor. Thanks P.S: This is a very big change, if possible, split it into a set of smaller PRs so it is easier for me to review. |
|
Thank you very much for the comment and suggestions. I re-checked the current probing logic in What I meant to describe is the lack of local egress/source/interface awareness. For example, if nodes A and B each have three interfaces, A1/A2/A3 and B1/B2/B3, and A1/B1 are the default egress paths, today A will probe B1/B2/B3 mostly from A1, while B will probe A1/A2/A3 mostly from B1. So Nylon observes only part of the possible interface-pair combinations. With explicit local binds, it can probe the full So this PR is not intended to replace Nylon’s existing multi-endpoint probing. It reuses polyamide’s multi-endpoint support, and tries to add local egress as part of the link identity and metric model. Most of the larger changes come from carrying that link identity as first class citizen through probes, control packets, routing state, forwarding, and status output. I agree the current PR is too large. I’ll try to restructure it into smaller, easier-to-review pieces, and see whether the local bind/source selection part can be extracted first with a smaller router change. If you have guidance on what the smallest acceptable version should look like, I would really appreciate it. Thanks again for creating Nylon. It's been a pleasure to work with the codebase and learn from its design. |
ebe46ad to
ea5f42f
Compare
|
I have force-pushed a rewritten history with smaller, buildable commits that might make the dependency chain easier to review. If you have a particular smallest acceptable version in mind for this PR, I would be very grateful for your guidance. |
19f079d to
6bef547
Compare
|
Hi Yifan, Thanks for the quick response and tidying up the commit history! Regarding your changes, I looked over the diff, as well as your design doc. Here are some comments:
Let's discuss about this before making more changes to the code. We also need a less clunky API for specifying the interface. One way trivial way would to just produce I*E links from I interfaces, and E endpoints per peering (but this can also lead to a mess). I'd love to hear your POV |
|
I completely agree with your separation of the link and routing models. I think I had fallen into the mindset of FRRouting-style designs, where multiple interfaces are explicitly exposed to Babel. In retrospect, Nylon's design is much more elegant: it finds a very nice optimal substructure by keeping all multi-link complexity confined to the peer-to-peer layer, which significantly simplifies the overall architecture. Your last comments actually inspired me to think about a different possible design. Using semantics similar to the Timestamp Sub-TLV from RFC 9616, it may be possible to improve asymmetric routing behavior relatively easily within the current endpoints model. Consider an extreme example: nodes A and B have two paths, A1 <-> B1 and A2 <-> B2, with the following one-way latencies:
In theory, if asymmetric routing is allowed, the best RTT would be:
rather than 101 ms. In practice, asymmetric routing is quite common on the Internet. Under the current endpoints model, exploiting this property at a single-hop level actually becomes relatively straightforward. However, this would likely require changes to the Ping/Pong packet format, replacing the current random-token + PingBuf RTT measurement mechanism with a Timestamp Sub-TLV-based measurement model. It would also be more efficient. My understanding is that this would be a fairly self-contained optimization. Would you prefer implementing something like this together with the interface-awareness work, or opening a separate issue for discussion and potentially addressing it in a later PR? Regarding the configuration schema, I initially considered a design that would automatically discover interfaces and addresses instead of requiring the current manual
It is also worth noting that, for multi-homing to work correctly, it is usually necessary to either explicitly bind sockets using SO_BINDTODEVICE and ensure that a corresponding default route exists on that interface, or manually configure policy routing, such as: For those reasons, I was thinking of starting with a purely manual configuration model for this feature. In practice, since the nylon cluster deployment is handled by Ansible + an AI agent, the configuration burden is not actually too high. On the contrary, purely manual specification makes the expected behavior clearer and more predictable. That said, perhaps we can find a middle ground between simplicity and completeness. For example, we could provide a built-in heuristic interface-name filter (such as a regular-expression-based rule) and automatically use all addresses on matching interfaces as source addresses. At the same time, users could override the interface or address filtering rules when necessary. This would allow most nodes to work with the default configuration, while only a small number of special cases would require manual configuration. We could also defer dynamic interface/address change handling for now, to avoid introducing too much uncertainty in the initial implementation. Do you have any preference regarding which direction would make the most sense? |
|
Hmm, in regards to asymmetric routing... I have actually added an experimental implementation over a year ago, but have since removed it.
In regards to interfaces. I also think its a good starting point to just specify interfaces when desired. Since nylon runs as root in most deployments anyways, I think it's fine to do I think when you do need to specify an interface, that interface tends to typically not change a lot. Your "middle ground" approach makes sense to me.
Let's not worry about dynamically changing interfaces yet! |
|
As far as I know, there are roughly several approaches to routing with asymmetric paths without GPS or Atomic clock: 1. NTP-synchronized clocks and one-way delay estimationThis is the most straightforward approach. If clocks are synchronized, we can compare absolute timestamps and estimate one-way latency directly. Routing decisions can then be made based on the estimated one-way delays. The downside is that the error can be quite large. For asymmetric paths, NTP synchronization error is often on the same order of magnitude as the network latency being measured, making the estimates rather noisy. 2. Only measure cycle latency, without solving for one-way delayInstead of trying to estimate one-way delays, we can work entirely with cycle latency. Cycle latency can be measured accurately using a mechanism similar to Babel's Timestamp Sub-TLV. The key observation is that clock offsets cancel out when measuring a cycle. As a result, time synchronization errors do not affect the final measurement. Reference: https://gemini.google.com/share/d470d773636d 3. Estimate one-way delays using measurements from many nodesThis approach first estimates one-way delays across the network and then applies the first method for routing decisions. See the paper: https://ieeexplore.ieee.org/document/1638554 or my old notes below: Several years ago I ran some simulation experiments. For a 100-node cluster with roughly 50% symmetric routes, the average one-way-delay estimation error could be kept below 2 ms. Running the solver on a CPU, a single optimization round in PyTorch took on the order of a few seconds. The larger the absolute number of symmetric links, the stronger the constraints become, and the more accurate the one-way-delay estimates are across the entire cluster. For an arbitrary strongly connected directed graph, it can be shown mathematically that—ignoring the small amount of drift inherent to hardware clocks—the cycle-latency approach (method 2) is equivalent, from a routing perspective, to routing based on perfectly accurate one-way-delay measurements. Intuitively, the best route from A to B must ultimately be part of some minimum cycle containing both A and B. Since method 2 can already compute the exact latency of arbitrary cycles, it provides equivalent routing information. From that perspective, method 3 is probably not particularly useful for routing itself. However, since we were discussing one-way delay estimation, I thought it was an interesting idea worth sharing. :) For the asymmetric-path scenario we discussed earlier between two peers, this is actually just the special case of a two-node graph, and can obviously be solved using method 2 as well. Regarding the "middle-ground" configuration approach, I'd like to make the proposal a bit more concrete. If we don't have any major disagreements, I plan to implement something along these lines in the near future:
Please let me know if there's anything I've overlooked or misunderstood. And don't hesitate to share any additional thoughts or concerns. update:
|
|
Method 3 sounds interesting, could be an interesting research topic! It sounds like Method 2 is similar what we implement, with the difference of accounting for processing time. Could be worth implementing, but right now I don't notice too much processing time overhead (since probes are handled in-dataplane, without dispatch). I think this would be good in a separate PR :) As for the interface bind, I think we can flesh out the interface filtering semantics later-- what you have right now, I think is ok. One part I'm not super clear with right now, is how you intend to bind to multiple interfaces, and send/recv. Looks like we'd need to create multiple binds: https://github.com/encodeous/nylon/blob/main/polyamide/device/device.go#L509 And, be able to send via some bind (so that the kernel can just use the default src addr): Maybe you can try it out, and see :P |
|
I pushed a new version that works by carrying auxiliary information when sendmsg. This approach doesn't require changing binding socket to specific interface or local IP. While it might not be as efficient as socket level bindings, the implementation is much simpler. Maybe this is a suitable choice for the initial implementation. Please note that the configuration-related code has not been completed yet. This version is a PoC for multi-interface sending mechanism. Additionally, the multi-interface sending mechanism for macOS has not yet been implemented. |
6bef547 to
f0ab00d
Compare
Background
Currently, nylon employs a "one neighbor, many candidate endpoints, one best endpoint" model. In this implementation, a neighbor is keyed strictly by its NodeId. While multiple remote endpoints can be configured for a single peer, only the single best-performing endpoint is active for routing at any given time.
This approach has several limitations:
To address these constraints, this PR implements a Multi-Link Routing design. The core change shifts the routing adjacency from a "node-level" view to a "link-level" view. By introducing LocalBind (local interface/source selection), each unique (Peer, LocalBind, RemoteEndpoint) tuple is now treated as an independent routing link. This allows the router to independently track metrics for multiple paths between the same two nodes and select the optimal link for traffic based on real-time performance and local policy.
Full design:
docs/reference/multi-link-routing.mdx.What changed
LocalBindID,RemoteEndpointID,LinkID, andLink; store links inRouterState; key selected routes by next-hop link (SelRoute.NhLink).IP_PKTINFO) so the same remote address on different binds is a distinct link.local bind × remote endpointproduct, track probes by link, dedupe duplicate transport tuples, and skip bind/endpoint address-family mismatches.TCElement.ToEpfrom the selected link endpoint so data follows the selected link rather than the peer default endpoint.Supporting commits
fix(conn):StdNetBind.Sendreused a poolednet.UDPAddrwhose IP slice could have been shrunk to 4 bytes by a prior IPv4 send, truncating the next IPv6 destination (e.g.2001:db8::1→2001:db8::). The link then never collected RTT samples and its metric stayed atINF. Resize the slice before copying.perf(core): batch a received bundle's control packets into a single dispatch that recomputes routes at most once, and coalesce pong-driven recomputation behind a pending flag, to avoid saturating the dispatch queue on multi-link meshes.Testing
The feature has currently been tested across a total of 12 nodes deployed in different geographic regions over the public Internet, with continuous operation exceeding 24 hours. The test coverage includes: