mirror of
https://github.com/cwinfo/yggdrasil-network.github.io.git
synced 2024-11-14 20:50:27 +00:00
133 lines
7.0 KiB
Markdown
133 lines
7.0 KiB
Markdown
---
|
|
layout: post
|
|
title: "About huge MTUs in Yggdrasil"
|
|
date: 2018-07-13 10:24:00 +0100
|
|
author: Neil Alexander
|
|
---
|
|
|
|
### That's a very big MTU!
|
|
|
|
You might have noticed that Yggdrasil doesn't conform to the standard MTUs you
|
|
might have seen elsewhere. Most Ethernet networks sit somewhere around the 1500
|
|
byte mark. Less if VPN tunnels are involved due to the overheads of those VPN
|
|
protocols. So you might have been surprised to see an interface MTU as high as
|
|
65535 bytes in your `yggdrasil.conf`!
|
|
|
|
In fact, this is not a mistake. It is very deliberate.
|
|
|
|
The MTU is a configurable option which determines how big a packet should be
|
|
before you should break out into a new packet. In addition to that, the
|
|
operating system maintains a link MTU setting for each network adapter -
|
|
effectively a value that says "packets equal to or smaller than this number of
|
|
bytes are safe to send on this link in one piece". Any larger than that and the
|
|
operating system will have to fragment the packet down into smaller ones before
|
|
sending out onto the wire.
|
|
|
|
With a smaller MTU, you will be forced to re-send the IP headers (and possibly
|
|
other headers) far more often with every single packet. Those are not only
|
|
wasted bytes on the wire, but every packet likely requires a new set of system
|
|
calls to handle. In the case of Yggdrasil, we rely on system calls not just for
|
|
socket operations but also for TUN/TAP. Each system call requires a context
|
|
switch, which is a slow operation. On embedded platforms, this can be a real
|
|
killer for performance - in fact, on an EdgeRouter X, context switching for the
|
|
TUN/TAP adapter is a greater bottleneck than the cryptographic algorithms
|
|
themselves!
|
|
|
|
### Selecting TCP instead of UDP
|
|
|
|
Therefore it is reasonable to surmise that less context switches and less packet
|
|
headers are better. Using an MTU of 65535 over 1500 packs in almost 43 times
|
|
more data before the next set of headers. That's almost 1680 bytes saved from IP
|
|
headers alone over 65535 bytes! But sending a 65535 byte Yggdrasil packet over a
|
|
link with an MTU of 1500 would require the packet to be fragmented 43 times,
|
|
right?
|
|
|
|
Instead, Yggdrasil uses TCP connections for peerings. This not only allows us to
|
|
take advantage of SOCKS proxies (and Tor) in a way that we cannot with UDP, but
|
|
it also gives us stream properties on our connections instead of being manually
|
|
forced to chunk UDP packets. In fact, we did this in the past, and it was ugly,
|
|
and actually worse than TCP performance-wise in many cases.
|
|
|
|
TCP will adjust the window size to match the lower link - in this case a
|
|
probable 1500 MTU - and will stream the large packet in chunks until it arrives
|
|
at the remote side in a single piece. For this reason, Yggdrasil never has to
|
|
care about the MTU of the real link between you and your peers. It just receives
|
|
a packet up to 65535 bytes, which it then processes and hands off to the TUN/TAP
|
|
adapter.
|
|
|
|
### But... TCP over TCP?
|
|
|
|
But that means that you are tunnelling TCP over TCP, I hear you cry. That's
|
|
crazy talk, surely? The big problem with TCP-over-TCP is that in the event of
|
|
congestion or packet loss, TCP will attempt to retransmit the failed packets,
|
|
but if TCP control packets from the inner connection are retransmitted,
|
|
reordered etc. by the encapsulating TCP connection, this results in a
|
|
substantial performance drop whilst the operating system throttles down that
|
|
inner TCP connection to cope with what it believes to be congestion.
|
|
|
|
However, by using large MTUs, and therefore larger window sizes on the inner TCP
|
|
connection, we send far less TCP control messages over the wire - possibly up to
|
|
as many as 43 times less - therefore there are far less opportunities for
|
|
control messages on the inner connection to be affected or amplified by
|
|
retransmission. This helps to stabilise performance on the inner TCP
|
|
connections.
|
|
|
|
There are also some other clever things taking place at the Yggdrasil TCP layer.
|
|
In particular, LIFO queues are used for session traffic, which results in
|
|
reordering at almost the first sign of any congestion on the encapsulating link.
|
|
TCP very eagerly backs off in this instance, which helps to deal with real
|
|
congestion on the link in a sane manner.
|
|
|
|
### Agreeing on an MTU
|
|
|
|
A big problem we came across was that different operating systems have different
|
|
maximum MTUs. Linux, macOS and Windows all allow maximum MTUs of 65535 bytes,
|
|
but FreeBSD only allows half that at 32767 bytes, OpenBSD lower again at 16384
|
|
bytes and NetBSD only allowing up to a paltry 9000 bytes! Not to mention that
|
|
there might also be good reasons for adjusting the MTU by hand, hence the
|
|
`IfMTU` option in `yggdrasil.conf`.
|
|
|
|
We need a way for any two Yggdrasil nodes to agree on a suitable MTU that works
|
|
for both parties. To get around this, each node sends its own MTU to the remote
|
|
side when establishing an encrypted traffic session. Knowing both your own MTU
|
|
and the MTU of the remote side allows you to select the lower of the two MTUs,
|
|
as the lower of the two MTUs is guaranteed to be safe for both parties. In fact,
|
|
you can even see what the agreed MTU of a session is by using `yggdrasilctl
|
|
getSessions` or by sending a `getSessions` request to the admin socket.
|
|
|
|
### Using ICMPv6 to signal userspace
|
|
|
|
So what happens if you establish a session with someone who has a smaller MTU
|
|
configured than yourself? The session negotiation and MTU exchange will allow
|
|
Yggdrasil to select a lower MTU for the session, but your TUN/TAP adapter on
|
|
your own side still has the larger of the two MTUs. How do we tell userspace
|
|
about the new MTU?
|
|
|
|
IPv6 implements a `Packet Too Big` packet type which is designed to facilitate
|
|
Path MTU Discovery on standard IPv6 networks. If you send a large packet, a node
|
|
en-route (like a router) can send back an ICMPv6 `Packet Too Big` message to the
|
|
sender, signifying that the packet that you sent exceeds the largest supported
|
|
packet size (effectively an upstream link MTU). The sending node can then
|
|
fragment the packet and resend it in smaller chunks so that it matches that
|
|
lower MTU on the path. Yggdrasil takes advantage of this behaviour by generating
|
|
`Packet Too Big` control messages and sending them back to the TUN/TAP adapter
|
|
when it detects that you have tried to send a packet across an Yggdrasil session
|
|
that is larger than the agreed MTU.
|
|
|
|
This causes the operating system to fragment the packets down to the MTU size
|
|
supported by the session and resend them. In addition, the operating system
|
|
caches this newly discovered MTU for the future so that it fragments correctly
|
|
for that destination from that point forward. In effect, Yggdrasil "tricks" the
|
|
operating system into thinking that Path MTU Discovery is *really* taking place
|
|
on the link.
|
|
|
|
### Conclusion
|
|
|
|
You will see that Yggdrasil performs a number of tricks to allow flexible MTUs
|
|
between nodes, and takes advantage of the stream nature of TCP to provide the
|
|
greatest level of performance and flexibility. There are a number of moving
|
|
parts to this - the session negotiation, ICMPv6 message generation and the
|
|
choice of TCP over UDP for peering connections. However, I hope this demystifies
|
|
some of our choices and why we have encouraged the use of larger MTUs on our
|
|
traffic sessions.
|