yggdrasil-network.github.io/_posts/2018-07-13-about-mtu.md

---
layout: post
title:  "About huge MTUs in Yggdrasil"
date:   2018-07-13 10:24:00 +0100
author: Neil Alexander
---

### That's a very big MTU!

You might have noticed that Yggdrasil doesn't conform to the standard MTUs you
might have seen elsewhere. Most Ethernet networks sit somewhere around the 1500
byte mark. Less if VPN tunnels are involved due to the overheads of those VPN
protocols. So you might have been surprised to see an interface MTU as high as
65535 bytes in your `yggdrasil.conf`!

In fact, this is not a mistake. It is very deliberate.

The MTU is a configurable option which determines how big a packet should be
before you should break out into a new packet. In addition to that, the
operating system maintains a link MTU setting for each network adapter -
effectively a value that says "packets equal to or smaller than this number of
bytes are safe to send on this link in one piece". Any larger than that and the
operating system will have to fragment the packet down into smaller ones before
sending out onto the wire.

With a smaller MTU, you will be forced to re-send the IP headers (and possibly
other headers) far more often with every single packet. Those are not only
wasted bytes on the wire, but every packet likely requires a new set of system
calls to handle. In the case of Yggdrasil, we rely on system calls not just for
socket operations but also for TUN/TAP. Each system call requires a context
switch, which is a slow operation. On embedded platforms, this can be a real
killer for performance - in fact, on an EdgeRouter X, context switching for the
TUN/TAP adapter is a greater bottleneck than the cryptographic algorithms
themselves!

### Selecting TCP instead of UDP

Therefore it is reasonable to surmise that less context switches and less packet
headers are better. Using an MTU of 65535 over 1500 packs in almost 43 times
more data before the next set of headers. That's almost 1680 bytes saved from IP
headers alone over 65535 bytes! But sending a 65535 byte Yggdrasil packet over a
link with an MTU of 1500 would require the packet to be fragmented 43 times,
right?

Instead, Yggdrasil uses TCP connections for peerings. This not only allows us to
take advantage of SOCKS proxies (and Tor) in a way that we cannot with UDP, but
it also gives us stream properties on our connections instead of being manually
forced to chunk UDP packets. In fact, we did this in the past, and it was ugly,
and actually worse than TCP performance-wise in many cases.

TCP will adjust the window size to match the lower link - in this case a
probable 1500 MTU - and will stream the large packet in chunks until it arrives
at the remote side in a single piece. For this reason, Yggdrasil never has to
care about the MTU of the real link between you and your peers. It just receives
a packet up to 65535 bytes, which it then processes and hands off to the TUN/TAP
adapter.

### But... TCP over TCP?

But that means that you are tunnelling TCP over TCP, I hear you cry. That's
crazy talk, surely? The big problem with TCP-over-TCP is that in the event of
congestion or packet loss, TCP will attempt to retransmit the failed packets,
but if TCP control packets from the inner connection are retransmitted,
reordered etc. by the encapsulating TCP connection, this results in a
substantial performance drop whilst the operating system throttles down that
inner TCP connection to cope with what it believes to be congestion.

However, by using large MTUs, and therefore larger window sizes on the inner TCP
connection, we send far less TCP control messages over the wire - possibly up to
as many as 43 times less - therefore there are far less opportunities for
control messages on the inner connection to be affected or amplified by
retransmission. This helps to stabilise performance on the inner TCP
connections.

There are also some other clever things taking place at the Yggdrasil TCP layer.
In particular, LIFO queues are used for session traffic, which results in
reordering at almost the first sign of any congestion on the encapsulating link.
TCP very eagerly backs off in this instance, which helps to deal with real
congestion on the link in a sane manner.

### Agreeing on an MTU

A big problem we came across was that different operating systems have different
maximum MTUs. Linux, macOS and Windows all allow maximum MTUs of 65535 bytes,
but FreeBSD only allows half that at 32767 bytes, OpenBSD lower again at 16384
bytes and NetBSD only allowing up to a paltry 9000 bytes! Not to mention that
there might also be good reasons for adjusting the MTU by hand, hence the
`IfMTU` option in `yggdrasil.conf`.

We need a way for any two Yggdrasil nodes to agree on a suitable MTU that works
for both parties. To get around this, each node sends its own MTU to the remote
side when establishing an encrypted traffic session. Knowing both your own MTU
and the MTU of the remote side allows you to select the lower of the two MTUs,
as the lower of the two MTUs is guaranteed to be safe for both parties. In fact,
you can even see what the agreed MTU of a session is by using `yggdrasilctl
getSessions` or by sending a `getSessions` request to the admin socket.

### Using ICMPv6 to signal userspace

So what happens if you establish a session with someone who has a smaller MTU
configured than yourself? The session negotiation and MTU exchange will allow
Yggdrasil to select a lower MTU for the session, but your TUN/TAP adapter on
your own side still has the larger of the two MTUs. How do we tell userspace
about the new MTU?

IPv6 implements a `Packet Too Big` packet type which is designed to facilitate
Path MTU Discovery on standard IPv6 networks. If you send a large packet, a node
en-route (like a router) can send back an ICMPv6 `Packet Too Big` message to the
sender, signifying that the packet that you sent exceeds the largest supported
packet size (effectively an upstream link MTU). The sending node can then
fragment the packet and resend it in smaller chunks so that it matches that
lower MTU on the path. Yggdrasil takes advantage of this behaviour by generating
`Packet Too Big` control messages and sending them back to the TUN/TAP adapter
when it detects that you have tried to send a packet across an Yggdrasil session
that is larger than the agreed MTU.

This causes the operating system to fragment the packets down to the MTU size
supported by the session and resend them. In addition, the operating system
caches this newly discovered MTU for the future so that it fragments correctly
for that destination from that point forward. In effect, Yggdrasil "tricks" the
operating system into thinking that Path MTU Discovery is *really* taking place
on the link.

### Conclusion

You will see that Yggdrasil performs a number of tricks to allow flexible MTUs
between nodes, and takes advantage of the stream nature of TCP to provide the
greatest level of performance and flexibility. There are a number of moving
parts to this - the session negotiation, ICMPv6 message generation and the
choice of TCP over UDP for peering connections. However, I hope this demystifies
some of our choices and why we have encouraged the use of larger MTUs on our
traffic sessions.
Add post 2018-07-13 09:25:12 +00:00			`---`
			`layout: post`
Update title 2018-07-13 10:49:08 +00:00			`title: "About huge MTUs in Yggdrasil"`
Add post 2018-07-13 09:25:12 +00:00			`date: 2018-07-13 10:24:00 +0100`
			`author: Neil Alexander`
			`---`

Update MTU post 2018-07-13 09:57:27 +00:00			`### That's a very big MTU!`
Add post 2018-07-13 09:25:12 +00:00
			`You might have noticed that Yggdrasil doesn't conform to the standard MTUs you`
			`might have seen elsewhere. Most Ethernet networks sit somewhere around the 1500`
			`byte mark. Less if VPN tunnels are involved due to the overheads of those VPN`
			`protocols. So you might have been surprised to see an interface MTU as high as`
			65535 bytes in your `yggdrasil.conf`!

			`In fact, this is not a mistake. It is very deliberate.`

			`The MTU is a configurable option which determines how big a packet should be`
Update MTU post 2018-07-13 09:33:33 +00:00			`before you should break out into a new packet. In addition to that, the`
			`operating system maintains a link MTU setting for each network adapter -`
			`effectively a value that says "packets equal to or smaller than this number of`
			`bytes are safe to send on this link in one piece". Any larger than that and the`
			`operating system will have to fragment the packet down into smaller ones before`
			`sending out onto the wire.`
Add post 2018-07-13 09:25:12 +00:00
			`With a smaller MTU, you will be forced to re-send the IP headers (and possibly`
Update MTU post 2018-07-13 09:34:11 +00:00			`other headers) far more often with every single packet. Those are not only`
			`wasted bytes on the wire, but every packet likely requires a new set of system`
			`calls to handle. In the case of Yggdrasil, we rely on system calls not just for`
			`socket operations but also for TUN/TAP. Each system call requires a context`
			`switch, which is a slow operation. On embedded platforms, this can be a real`
			`killer for performance - in fact, on an EdgeRouter X, context switching for the`
			`TUN/TAP adapter is a greater bottleneck than the cryptographic algorithms`
			`themselves!`
Add post 2018-07-13 09:25:12 +00:00
			`### Selecting TCP instead of UDP`

			`Therefore it is reasonable to surmise that less context switches and less packet`
			`headers are better. Using an MTU of 65535 over 1500 packs in almost 43 times`
Update MTU post 2018-07-13 09:35:17 +00:00			`more data before the next set of headers. That's almost 1680 bytes saved from IP`
Add post 2018-07-13 09:25:12 +00:00			`headers alone over 65535 bytes! But sending a 65535 byte Yggdrasil packet over a`
			`link with an MTU of 1500 would require the packet to be fragmented 43 times,`
Update MTU post 2018-07-13 09:28:15 +00:00			`right?`
Add post 2018-07-13 09:25:12 +00:00
			`Instead, Yggdrasil uses TCP connections for peerings. This not only allows us to`
Update title 2018-07-13 11:09:15 +00:00			`take advantage of SOCKS proxies (and Tor) in a way that we cannot with UDP, but`
			`it also gives us stream properties on our connections instead of being manually`
			`forced to chunk UDP packets. In fact, we did this in the past, and it was ugly,`
			`and actually worse than TCP performance-wise in many cases.`
Add post 2018-07-13 09:25:12 +00:00
			`TCP will adjust the window size to match the lower link - in this case a`
			`probable 1500 MTU - and will stream the large packet in chunks until it arrives`
			`at the remote side in a single piece. For this reason, Yggdrasil never has to`
			`care about the MTU of the real link between you and your peers. It just receives`
			`a packet up to 65535 bytes, which it then processes and hands off to the TUN/TAP`
			`adapter.`

			`### But... TCP over TCP?`

			`But that means that you are tunnelling TCP over TCP, I hear you cry. That's`
			`crazy talk, surely? The big problem with TCP-over-TCP is that in the event of`
			`congestion or packet loss, TCP will attempt to retransmit the failed packets,`
			`but if TCP control packets from the inner connection are retransmitted,`
			`reordered etc. by the encapsulating TCP connection, this results in a`
			`substantial performance drop whilst the operating system throttles down that`
			`inner TCP connection to cope with what it believes to be congestion.`

			`However, by using large MTUs, and therefore larger window sizes on the inner TCP`
Update MTU post 2018-07-13 09:31:17 +00:00			`connection, we send far less TCP control messages over the wire - possibly up to`
			`as many as 43 times less - therefore there are far less opportunities for`
			`control messages on the inner connection to be affected or amplified by`
			`retransmission. This helps to stabilise performance on the inner TCP`
			`connections.`
Add post 2018-07-13 09:25:12 +00:00
Update MTU post 2018-07-13 09:28:15 +00:00			`There are also some other clever things taking place at the Yggdrasil TCP layer.`
			`In particular, LIFO queues are used for session traffic, which results in`
Add post 2018-07-13 09:25:12 +00:00			`reordering at almost the first sign of any congestion on the encapsulating link.`
			`TCP very eagerly backs off in this instance, which helps to deal with real`
			`congestion on the link in a sane manner.`

			`### Agreeing on an MTU`

			`A big problem we came across was that different operating systems have different`
			`maximum MTUs. Linux, macOS and Windows all allow maximum MTUs of 65535 bytes,`
			`but FreeBSD only allows half that at 32767 bytes, OpenBSD lower again at 16384`
			`bytes and NetBSD only allowing up to a paltry 9000 bytes! Not to mention that`
			`there might also be good reasons for adjusting the MTU by hand, hence the`
			`IfMTU` option in `yggdrasil.conf`.

			`We need a way for any two Yggdrasil nodes to agree on a suitable MTU that works`
			`for both parties. To get around this, each node sends its own MTU to the remote`
			`side when establishing an encrypted traffic session. Knowing both your own MTU`
			`and the MTU of the remote side allows you to select the lower of the two MTUs,`
			`as the lower of the two MTUs is guaranteed to be safe for both parties. In fact,`
			you can even see what the agreed MTU of a session is by using `yggdrasilctl
			getSessions` or by sending a `getSessions` request to the admin socket.

			`### Using ICMPv6 to signal userspace`

			`So what happens if you establish a session with someone who has a smaller MTU`
			`configured than yourself? The session negotiation and MTU exchange will allow`
			`Yggdrasil to select a lower MTU for the session, but your TUN/TAP adapter on`
			`your own side still has the larger of the two MTUs. How do we tell userspace`
			`about the new MTU?`

			IPv6 implements a `Packet Too Big` packet type which is designed to facilitate
			`Path MTU Discovery on standard IPv6 networks. If you send a large packet, a node`
			en-route (like a router) can send back an ICMPv6 `Packet Too Big` message to the
			`sender, signifying that the packet that you sent exceeds the largest supported`
			`packet size (effectively an upstream link MTU). The sending node can then`
			`fragment the packet and resend it in smaller chunks so that it matches that`
Update MTU post 2018-07-13 09:28:15 +00:00			`lower MTU on the path. Yggdrasil takes advantage of this behaviour by generating`
			`Packet Too Big` control messages and sending them back to the TUN/TAP adapter
			`when it detects that you have tried to send a packet across an Yggdrasil session`
			`that is larger than the agreed MTU.`
Add post 2018-07-13 09:25:12 +00:00
			`This causes the operating system to fragment the packets down to the MTU size`
			`supported by the session and resend them. In addition, the operating system`
			`caches this newly discovered MTU for the future so that it fragments correctly`
			`for that destination from that point forward. In effect, Yggdrasil "tricks" the`
			`operating system into thinking that Path MTU Discovery is really taking place`
			`on the link.`

			`### Conclusion`

			`You will see that Yggdrasil performs a number of tricks to allow flexible MTUs`
			`between nodes, and takes advantage of the stream nature of TCP to provide the`
			`greatest level of performance and flexibility. There are a number of moving`
			`parts to this - the session negotiation, ICMPv6 message generation and the`
			`choice of TCP over UDP for peering connections. However, I hope this demystifies`
			`some of our choices and why we have encouraged the use of larger MTUs on our`
			`traffic sessions.`