diff --git a/_posts/2021-06-26-v0-4-prerelease-benchmarks.md b/_posts/2021-06-26-v0-4-prerelease-benchmarks.md index 41e9a9e..53998ab 100644 --- a/_posts/2021-06-26-v0-4-prerelease-benchmarks.md +++ b/_posts/2021-06-26-v0-4-prerelease-benchmarks.md @@ -7,11 +7,13 @@ author: Arceliar ### The Problem with v0.3 -In the current stable release of Yggdrasil, `v0.3.16`, routing works basically the same way that it has always worked since release. Traffic is forwarded by greedy routing in a metric space. In essence, each node has a distance label (`coords` in the code), and given the distance label of any two nodes, you can calculate the distance of some path between them. Traffic is forwarded to whichever peer minimizes that distance to the destination. This has been discussed in an [earlier blog post](2018-07-17-world-tree.md), so lets not worry about the details of how it works right now. We'll just focus on what happens when it *doesn't* work. +In the current stable release of Yggdrasil, `v0.3.16`, routing works basically the same way that it has always worked since release. Traffic is forwarded by greedy routing in a metric space. In essence, each node has a distance label (`coords` in the code), and given the distance label of any two nodes, you can calculate the distance of some path between them. Traffic is forwarded to whichever peer minimizes that distance to the destination. This has been discussed in an [earlier blog post](2018-07-17-world-tree.md), so lets not worry about the details of how it works for now. Instead, we'll focus on what happens when it *doesn't* work. -To be able to send traffic to a destination `D`, the sender `S` must look up the node's distance label and key in the DHT. This happens just before session setup, where ephemeral keys are exchanged. You can think of it a bit like a DNS lookup: it maps some static information (the node's Yggdrasil IPv6 address) onto some dynamic information (the node's distance label). If anything happens to the network that causes the destination node `D`'s distance label to change, then all traffic to `D` will drop until the `S` can look up `D`'s new distance label. However, that lookup depends on the DHT, and the DHT *also* uses distance labels for communication, so DHT lookups for `D` will fail for some amount of time until that completes. While that's happening, `S` cannot communicate with `D`, even if the path between `S` and `D` is unaffected. Further exacerbating the problem, the DHT search is an iterative process that requires round trip communication with multiple nodes. These nodes essentially random, meaning most of them are likely to be near the edge of the network, where connections are comparatively unreliable and costly to use. If any part of the lookup fails, then this delays search progress (if it doesn't cause the search to fail entirely). +To be able to send traffic to a destination `D`, the sender `S` must look up the node's distance label and key in the DHT. This happens just before session setup, where ephemeral keys are exchanged. You can think of it a bit like a DNS lookup: it maps some known static information (the node's Yggdrasil IPv6 address) onto some unknown or dynamic information (the node's static key and dynamic distance label). If anything happens to the network that causes the destination node `D`'s distance label to change, then all traffic to `D` will drop until the `S` can look up `D`'s new distance label. However, that lookup depends on the DHT, and the DHT *also* uses distance labels for communication, so DHT lookups for `D` will fail for some amount of time, until the out-of-date information about `D` times out or is replaced. While that's happening, `S` cannot communicate with `D`, even if the path between `S` and `D` is unaffected. Further exacerbating the problem, the DHT search is an iterative process, which requires round trip communication with multiple nodes. These nodes are, for the most part, randomly distributed across the physical network, meaning most of them are likely to be near the edge of the network, where connections are comparatively unreliable and costly to use. If any part of the lookup fails, then this delays search progress (if it doesn't cause the search to fail entirely). -The network tries to combat these problems by having `D` refresh itself in the DHT and send a notification to `S` when `D`'s distance label changes. However, there is no guarantee that `D` knows every node which is tracking it in the DHT, and these notifications will hit a dead and and be dropped if the distance labels of the recipients have also changed. This often happens if `S` and `D` share a common ancestor in the tree. Basically, if `S` and `D` are in a LAN with gateway `G`, and `G`'s connection to the outside world dies, this disrupts the connection between `S` and `D` (and leaves the DHT in a broken state, where they can't look each other up, until things time out). +The network tries to combat these problems by having `D` refresh itself in the DHT and send a notification to `S` when `D`'s distance label changes. However, there is no guarantee that `D` knows every node which is tracking it in the DHT, and these notifications will hit a dead and and be dropped if the distance labels of the recipients have also changed. This often happens if `S` and `D` share a common ancestor in the tree. + +To give a concrete example, if `S` and `D` are in a LAN with gateway `G`, and `G`'s connection to the outside world dies, then this disrupts the traffic flow between `S` and `D`. That happens even when the path between them in their own network is unaffected. It also causes various issues in the DHT, which hurt performance for the network in general, and prevents `S` and `D` in particular from being able to resume communication. ### Improvements in v0.4 @@ -26,7 +28,7 @@ Since it may take a while to see how this affects performance in a live network, All of the results shown here are from [meshnet-lab](https://github.com/mwarning/meshnet-lab). You should probably just read the documentation if you want to know more, but to summarize: meshnet-lab simulates mesh networks using network namespace on linux. Each node is give a network namespace, which can be linked to other namespaces to simulate an arbitrary topology. Links are added and removed as needed to e.g. simulate movement in a mobile adhoc network. -Although meshnet-lab supports many other mesh networking protocols, this post will focus on comparing Yggdrasil `v0.3.16` (the latest stable release) with `v0.4rc3` (the most recent release candidate). The point of this post is to highlight what kind of performance changes we expect to see in the new Yggdrasil release, not to compare Yggdrasil to other mesh routers. +Although meshnet-lab supports many other mesh networking protocols, this post will focus on comparing Yggdrasil `v0.3.16` (the latest stable release) with `v0.4rc3` (the most recent release candidate). Comparisons with other mesh routers would be interesting, but it would be best if those were done by an unbiased 3rd party (and using a stable `v0.4.X` release instead of a release candidate). Instead, this post will try to highlight (qualitatively) what sort of performance changes we expect to see in the new release. #### Mobility1 @@ -59,7 +61,7 @@ The `scalability1` test set involves running the network over line, tree, or squ ![scalability1-rtree](/assets/images/2021-06-26/scalability1-rtree.svg) ![scalability1-grid](/assets/images/2021-06-26/scalability1-grid4.svg) -There's not a whole lot to say here, `v0.4rc3` is just an improvement across the board. Note that it's a little surprising how the bandwidth use *decreases* as the network grows. This may be an artifact of how the test works. Each network measures reliability by pinging a fixed number of paths (200). The bandwidth used by these pings counts towards the test results. In the line network, increasing the network size also increases the path length at an equal rate, so the bandwidth use per node stays about the same. In the grid and rtree networks, path length doesn't increases as rapidly as the number of nodes, so the bandwith from the 200 test pings is increasing slower than the network size, which results in decreased average bandwidth use per node. In the future, it may be interesting to run some variation of this test without the pings, to get a better measurement of how the idle protocol traffic scales. +There's not a whole lot to say here, `v0.4rc3` is just an improvement across the board. Note that it's a little surprising how the bandwidth use *decreases* as the network grows. This may be an artifact of how the test works, since a fixed number of pings may represent proportionally more traffic in small network, but that's speculation. ### Conclusion