diff --git a/_posts/2018-07-13-about-mtu.md b/_posts/2018-07-13-about-mtu.md new file mode 100644 index 0000000..1d3cd49 --- /dev/null +++ b/_posts/2018-07-13-about-mtu.md @@ -0,0 +1,134 @@ +--- +layout: post +title: "About MTUs in Yggdrasil" +date: 2018-07-13 10:24:00 +0100 +author: Neil Alexander +--- + +### That's a very big MTU. + +You might have noticed that Yggdrasil doesn't conform to the standard MTUs you +might have seen elsewhere. Most Ethernet networks sit somewhere around the 1500 +byte mark. Less if VPN tunnels are involved due to the overheads of those VPN +protocols. So you might have been surprised to see an interface MTU as high as +65535 bytes in your `yggdrasil.conf`! + +In fact, this is not a mistake. It is very deliberate. + +The MTU is a configurable option which determines how big a packet should be +before you should break out into a new packet. With every packet, the IP headers +and other possible headers (i.e. from TCP or UDP) have to be repeated. + +In addition to that, the operating system maintains a link MTU setting for each +network adapter - effectively a value that says "packets smaller than this +number of bytes are safe to send on this link in one piece". Any larger than +that and the operating system will have to fragment the packet down into smaller +ones before sending out onto the wire. + +With a smaller MTU, you will be forced to re-send the IP headers (and possibly +other headers) along with every single packet. Those are not only wasted bytes +on the wire, but every packet likely requires a new set of system calls to +handle. In the case of Yggdrasil, we rely on system calls not just for socket +operations but also for TUN/TAP. Each system call requires a context switch, +which is a slow operation. On embedded platforms, this can be a real killer for +performance - in fact, on an EdgeRouter X, context switching for the TUN/TAP +adapter is a greater bottleneck than the cryptographic algorithms themselves! + +### Selecting TCP instead of UDP + +Therefore it is reasonable to surmise that less context switches and less packet +headers are better. Using an MTU of 65535 over 1500 packs in almost 43 times +more data before the next set of headers. That's almost 840 bytes saved from IP +headers alone over 65535 bytes! But sending a 65535 byte Yggdrasil packet over a +link with an MTU of 1500 would require the packet to be fragmented 43 times, +right? Well, it would if we were using a stateless protocol like UDP. + +Instead, Yggdrasil uses TCP connections for peerings. This not only allows us to +take advantage of SOCKS proxies in a way that we cannot with UDP, but it also +gives us stream properties on our connections instead of being manually forced +to chunk UDP packets. In fact, we did this in the past, and it was ugly, and +actually worse than TCP performance-wise in many cases. + +TCP will adjust the window size to match the lower link - in this case a +probable 1500 MTU - and will stream the large packet in chunks until it arrives +at the remote side in a single piece. For this reason, Yggdrasil never has to +care about the MTU of the real link between you and your peers. It just receives +a packet up to 65535 bytes, which it then processes and hands off to the TUN/TAP +adapter. + +### But... TCP over TCP? + +But that means that you are tunnelling TCP over TCP, I hear you cry. That's +crazy talk, surely? The big problem with TCP-over-TCP is that in the event of +congestion or packet loss, TCP will attempt to retransmit the failed packets, +but if TCP control packets from the inner connection are retransmitted, +reordered etc. by the encapsulating TCP connection, this results in a +substantial performance drop whilst the operating system throttles down that +inner TCP connection to cope with what it believes to be congestion. + +However, by using large MTUs, and therefore larger window sizes on the inner TCP +connection, we send far less TCP control messages over the wire - as many as 43 +times less - therefore there are far less control messages from the inner +connection which may be affected or amplified by retransmission. This helps to +stabilise performance on the inner TCP connections. + +There are also some other clever things taking place at the Yggdrasil TCP layer +- in particular, LIFO queues are used for session traffic, which results in +reordering at almost the first sign of any congestion on the encapsulating link. +TCP very eagerly backs off in this instance, which helps to deal with real +congestion on the link in a sane manner. + +### Agreeing on an MTU + +A big problem we came across was that different operating systems have different +maximum MTUs. Linux, macOS and Windows all allow maximum MTUs of 65535 bytes, +but FreeBSD only allows half that at 32767 bytes, OpenBSD lower again at 16384 +bytes and NetBSD only allowing up to a paltry 9000 bytes! Not to mention that +there might also be good reasons for adjusting the MTU by hand, hence the +`IfMTU` option in `yggdrasil.conf`. + +We need a way for any two Yggdrasil nodes to agree on a suitable MTU that works +for both parties. To get around this, each node sends its own MTU to the remote +side when establishing an encrypted traffic session. Knowing both your own MTU +and the MTU of the remote side allows you to select the lower of the two MTUs, +as the lower of the two MTUs is guaranteed to be safe for both parties. In fact, +you can even see what the agreed MTU of a session is by using `yggdrasilctl +getSessions` or by sending a `getSessions` request to the admin socket. + +### Using ICMPv6 to signal userspace + +So what happens if you establish a session with someone who has a smaller MTU +configured than yourself? The session negotiation and MTU exchange will allow +Yggdrasil to select a lower MTU for the session, but your TUN/TAP adapter on +your own side still has the larger of the two MTUs. How do we tell userspace +about the new MTU? + +IPv6 implements a `Packet Too Big` packet type which is designed to facilitate +Path MTU Discovery on standard IPv6 networks. If you send a large packet, a node +en-route (like a router) can send back an ICMPv6 `Packet Too Big` message to the +sender, signifying that the packet that you sent exceeds the largest supported +packet size (effectively an upstream link MTU). The sending node can then +fragment the packet and resend it in smaller chunks so that it matches that +lower MTU on the path. + +Yggdrasil takes advantage of this behaviour by generating `Packet Too Big` +control messages and sending them back to the TUN/TAP adapter when it detects +that you have tried to send a packet across an Yggdrasil session that is larger +than the agreed MTU. + +This causes the operating system to fragment the packets down to the MTU size +supported by the session and resend them. In addition, the operating system +caches this newly discovered MTU for the future so that it fragments correctly +for that destination from that point forward. In effect, Yggdrasil "tricks" the +operating system into thinking that Path MTU Discovery is *really* taking place +on the link. + +### Conclusion + +You will see that Yggdrasil performs a number of tricks to allow flexible MTUs +between nodes, and takes advantage of the stream nature of TCP to provide the +greatest level of performance and flexibility. There are a number of moving +parts to this - the session negotiation, ICMPv6 message generation and the +choice of TCP over UDP for peering connections. However, I hope this demystifies +some of our choices and why we have encouraged the use of larger MTUs on our +traffic sessions. diff --git a/_posts/2018-07-13-test.md b/_posts/2018-07-13-test.md deleted file mode 100644 index 5a7b647..0000000 --- a/_posts/2018-07-13-test.md +++ /dev/null @@ -1,13 +0,0 @@ ---- -layout: post -title: "This is a test post!" -date: 2018-07-13 09:13:00 +1000 -categories: test -author: Neil Alexander ---- - -This is a test post. - -**Here is some Markdown.** - -Excellent!