5
0
mirror of https://github.com/cwinfo/yggdrasil-network.github.io.git synced 2024-11-10 03:20:25 +00:00
yggdrasil-network.github.io/_posts/2018-07-13-about-mtu.md

135 lines
7.2 KiB
Markdown
Raw Normal View History

2018-07-13 09:25:12 +00:00
---
layout: post
title: "About MTUs in Yggdrasil"
date: 2018-07-13 10:24:00 +0100
author: Neil Alexander
---
### That's a very big MTU.
You might have noticed that Yggdrasil doesn't conform to the standard MTUs you
might have seen elsewhere. Most Ethernet networks sit somewhere around the 1500
byte mark. Less if VPN tunnels are involved due to the overheads of those VPN
protocols. So you might have been surprised to see an interface MTU as high as
65535 bytes in your `yggdrasil.conf`!
In fact, this is not a mistake. It is very deliberate.
The MTU is a configurable option which determines how big a packet should be
before you should break out into a new packet. With every packet, the IP headers
and other possible headers (i.e. from TCP or UDP) have to be repeated.
In addition to that, the operating system maintains a link MTU setting for each
network adapter - effectively a value that says "packets smaller than this
number of bytes are safe to send on this link in one piece". Any larger than
that and the operating system will have to fragment the packet down into smaller
ones before sending out onto the wire.
With a smaller MTU, you will be forced to re-send the IP headers (and possibly
other headers) along with every single packet. Those are not only wasted bytes
on the wire, but every packet likely requires a new set of system calls to
handle. In the case of Yggdrasil, we rely on system calls not just for socket
operations but also for TUN/TAP. Each system call requires a context switch,
which is a slow operation. On embedded platforms, this can be a real killer for
performance - in fact, on an EdgeRouter X, context switching for the TUN/TAP
adapter is a greater bottleneck than the cryptographic algorithms themselves!
### Selecting TCP instead of UDP
Therefore it is reasonable to surmise that less context switches and less packet
headers are better. Using an MTU of 65535 over 1500 packs in almost 43 times
more data before the next set of headers. That's almost 840 bytes saved from IP
headers alone over 65535 bytes! But sending a 65535 byte Yggdrasil packet over a
link with an MTU of 1500 would require the packet to be fragmented 43 times,
right? Well, it would if we were using a stateless protocol like UDP.
Instead, Yggdrasil uses TCP connections for peerings. This not only allows us to
take advantage of SOCKS proxies in a way that we cannot with UDP, but it also
gives us stream properties on our connections instead of being manually forced
to chunk UDP packets. In fact, we did this in the past, and it was ugly, and
actually worse than TCP performance-wise in many cases.
TCP will adjust the window size to match the lower link - in this case a
probable 1500 MTU - and will stream the large packet in chunks until it arrives
at the remote side in a single piece. For this reason, Yggdrasil never has to
care about the MTU of the real link between you and your peers. It just receives
a packet up to 65535 bytes, which it then processes and hands off to the TUN/TAP
adapter.
### But... TCP over TCP?
But that means that you are tunnelling TCP over TCP, I hear you cry. That's
crazy talk, surely? The big problem with TCP-over-TCP is that in the event of
congestion or packet loss, TCP will attempt to retransmit the failed packets,
but if TCP control packets from the inner connection are retransmitted,
reordered etc. by the encapsulating TCP connection, this results in a
substantial performance drop whilst the operating system throttles down that
inner TCP connection to cope with what it believes to be congestion.
However, by using large MTUs, and therefore larger window sizes on the inner TCP
connection, we send far less TCP control messages over the wire - as many as 43
times less - therefore there are far less control messages from the inner
connection which may be affected or amplified by retransmission. This helps to
stabilise performance on the inner TCP connections.
There are also some other clever things taking place at the Yggdrasil TCP layer
- in particular, LIFO queues are used for session traffic, which results in
reordering at almost the first sign of any congestion on the encapsulating link.
TCP very eagerly backs off in this instance, which helps to deal with real
congestion on the link in a sane manner.
### Agreeing on an MTU
A big problem we came across was that different operating systems have different
maximum MTUs. Linux, macOS and Windows all allow maximum MTUs of 65535 bytes,
but FreeBSD only allows half that at 32767 bytes, OpenBSD lower again at 16384
bytes and NetBSD only allowing up to a paltry 9000 bytes! Not to mention that
there might also be good reasons for adjusting the MTU by hand, hence the
`IfMTU` option in `yggdrasil.conf`.
We need a way for any two Yggdrasil nodes to agree on a suitable MTU that works
for both parties. To get around this, each node sends its own MTU to the remote
side when establishing an encrypted traffic session. Knowing both your own MTU
and the MTU of the remote side allows you to select the lower of the two MTUs,
as the lower of the two MTUs is guaranteed to be safe for both parties. In fact,
you can even see what the agreed MTU of a session is by using `yggdrasilctl
getSessions` or by sending a `getSessions` request to the admin socket.
### Using ICMPv6 to signal userspace
So what happens if you establish a session with someone who has a smaller MTU
configured than yourself? The session negotiation and MTU exchange will allow
Yggdrasil to select a lower MTU for the session, but your TUN/TAP adapter on
your own side still has the larger of the two MTUs. How do we tell userspace
about the new MTU?
IPv6 implements a `Packet Too Big` packet type which is designed to facilitate
Path MTU Discovery on standard IPv6 networks. If you send a large packet, a node
en-route (like a router) can send back an ICMPv6 `Packet Too Big` message to the
sender, signifying that the packet that you sent exceeds the largest supported
packet size (effectively an upstream link MTU). The sending node can then
fragment the packet and resend it in smaller chunks so that it matches that
lower MTU on the path.
Yggdrasil takes advantage of this behaviour by generating `Packet Too Big`
control messages and sending them back to the TUN/TAP adapter when it detects
that you have tried to send a packet across an Yggdrasil session that is larger
than the agreed MTU.
This causes the operating system to fragment the packets down to the MTU size
supported by the session and resend them. In addition, the operating system
caches this newly discovered MTU for the future so that it fragments correctly
for that destination from that point forward. In effect, Yggdrasil "tricks" the
operating system into thinking that Path MTU Discovery is *really* taking place
on the link.
### Conclusion
You will see that Yggdrasil performs a number of tricks to allow flexible MTUs
between nodes, and takes advantage of the stream nature of TCP to provide the
greatest level of performance and flexibility. There are a number of moving
parts to this - the session negotiation, ICMPv6 message generation and the
choice of TCP over UDP for peering connections. However, I hope this demystifies
some of our choices and why we have encouraged the use of larger MTUs on our
traffic sessions.