Path MTU Discovery

Sunday, April 6, 2014

Path MTU Discovery is a technique used to dynamically discover the path MTU from the source to the destination using the DF (Don’t Fragment) bit from the IP header. It is the smallest effective transmitted MTU along the path defined by IP source, IP destination and maybe TOS of the packets. The basic idea of the mechanism is that the source will assume the path MTU to be equal with the known first hop MTU and to send IP packets with DF bit set with that known MTU along the path. If along the path, a link has a next-hop smaller MTU the router will drop the IP datagram and send back an ICMP Destination Unreachable with the code Fragmentation needed and DF set. After receiving this ICMP message the host will reduce the path MTU for that particular link.

There are more ways to implement path MTU discovery as stated in RFC 1191 , the big differences are between router and host implementation:
 The router must include the MTU of the next-hop in the lower 16 bit of the ICMP Destination Unreachable – Fragmentation needed and DF bit set (datagram too big as RFC says). This is the most used implementation but it has its flows.
 The host implementation on the other hand might elect to reduce the path MTU to the next-hop value received on the ICMP Datagram too big message or clear the DF bit on the IP header.
 When the ICMP datagram too big message does not contain the MTU of the next-hop, the things get complicated and a lot of possible algorithms might be implemented by the hosts.

1. Router specification

The difference between the MTU of the link and the MSS can be found here IP-MTU-vs-MSS, as a summary you can check the below formula:

MSS = MTU – IP Header (+ Options if present) – TCP Header (+ Options if present)
IP Header = 20 / 60 bytes without / with Options
TCP Header = 20 / 60 bytes without / with Options

Normally, without the Path MTU Discovery implemented and used, when a router needs to send an IP datagram which is bigger than the IP MTU configured on the local interface (it is still the Maximum transmit Unit not Received one), the router will fragment the IP datagram, you can check here IP-Fragmentation. Same as for MPLS packet described here MPLS-Fragmentation.

If the Path MTU Discovery is enabled,meaning the DF bit is set on the IP datagram,if the MTU of the next-hop is smaller than the IP datagram,  than the router is required to return a Destination Unreachable – Fragmentation needed and DF set ICMP message (Type 3, Code 4) back to the source. The router MUST include in the ICMP header the MTU of the  next-hop, in the low-order 16 bits of the ICMP unused header field,  the high-order 16 bits remained unused must be set to 0(zero), you can check the ICMP modified header below:  


The value carried in the Next-Hop MTU field must have the following minimum and maximum bytes size:
Minimum forward size must be 68 bytes and minimum received size must be 576 bytes, as   specified in  RFC 791 ,  every internet module must be able to forward a datagram of 68 octet  without further fragmentation.  Thus the minimum value of this field should be 68 bytes.
 Maximum size should be the largest datagram size which can be forwarded by this router without the need of fragmentation; the size includes the IP header and data, but does not include any other lower level headers,like Ethernet header.

    2. Host specification

When a host receives a Datagram too big message, it must reduce the datagram size based on the Next-Hop MTU value from the ICMP message. The host must force the PMTU Discovery process to converge using one of the following:
– Reducing the PTU of the datagram.
– Clearing the DF bit on the IP datagram.

Considering the PMTU might be changed over time, the host has the following timers of sending new increased MTU size over the link in order to check if the MTU has been changed, even if the MTU theoretically will not be changed frequently:
– If the path MTU has been decreased, the host must take immediate action as fast as possible without waiting for any timer to expire.
 – The host must check if the path MTU has been increased by an estimated higher PMTU, but no more than 5 minute after the Datagram too big message has been received in a previous attempt of increasing the MTU of the path and no more than 1 minute after a successfully increase took place. The RFC recommends these values to be twice of these minimum values, meaning 10 and 2 minutes respectively.

3. Other possible host implementation mechanisms

The implementation methods for hosts to deal with the unmodified Next-Hop MTU field from the ICMP Datagram too big message are not standardized by this RFC, only some guidelines re described. The host again should have a mechanism to make PMTU Discovery mechanism to converge using one of the following: 
– Reducing the size of the MTU to the minimum value of 576 bytes, this might fail at link efficiency utilization or might even lead to fragmentation some times. But it is still the fastest method. 
– Clearing the DF bit set from the IP datagram, this might take a while to be activated on all link segments of the path and some Datagram too big messages could still arrive for a while.
– Some algorithm of searching the proper MTU might be implemented. Any search strategy must store the already tested MTU.

These correct MTU size searching algorithms are implying to continue to send IP datagram with DF bit set but with different MTU size. The not recommended algorithms, because of the slow convergence, are or to multiply the estimated MTU value with a constant (for example 0.75) or to do a binary search (but this requires a complex host PMTU Discovery implementation). One algorithm which might be faster is to assume that there are actually a few MTUs standard values and to search among them and also the MTU plateau, which is a power of 2. If the accurate MTU value is not present in the plateau table, than the algorithm will not underestimate the value by more than a power of 2.


4. Disabling path MTU discovery on specific operating systems

Even it is generally not a good idea to disable the path MTU discovery on the hosts or routers you can find below some of the methods to disable it based on the operating system.

4.1 Solaris 10

It is enabled by default and, for older versions, has more aggressive timers than the RFC recommends, Solaris tries to rediscover the path MTU every 30 seconds. Since the 2.5 the timer is set to 60 seconds.
ndd -set /dev/ip ip_path_mtu_discovery 0 – Disable the TCP path MTU Discovery
ip_ire_pathmtu_interval – Configure the path MTU refresh interval

4.2 HP-UX

By default, Path MTU Discovery is enabled for TCP sockets and disabled for UDP sockets. The ndd command can control three MTU related variables:
ip_ire_pathmtu_interval - Controls the probe interval for PMTU
ip_pmtu_strategy - Controls the Path MTU Discovery strategy
tcp_ignore_path_mtu - Disable setting MSS from ICMP 'Frag Needed'
nettune -s tcp_pmtu 0 – Disable the TCP path MTU Discovery
ndd -h ip_pmtu_strategy 0 – Disable the TCP path MTU Discovery
nettune -s udp_pmtu 0 – Disable the UDP path MTU Discovery

4.3 IBM AIX Unix

By default, the tcp_pmtu_discover and udp_pmtu_discover options are disabled on AIX® 4.2.1 through AIX 4.3.1, and enabled on AIX 4.3.2 and later.
no -o tcp_pmtu_discover=0 – Disable the TCP path MTU Discovery
no -o udp_pmtu_discover=0 – Disable the UDP path MTU Discovery
no -o pmtu_default_age=5 – Configure the ageing time for the path MTU value to 5 minutes(default 10 minutes)
no -o pmtu_rediscover_interval=5– Configure the rediscover interval for the path MTU value to 5 minutes(default 10 minutes)


By default, Path MTU Discovery is enabled for both TCP and UDP. Linux can be configured to handle Path MTU Discovery in the following ways:
– IP_PMTUDISC_DONT – Don’t send IP packets with DF set, therefore do not use Path MTU Discover.
– IP_PMTUDISC_DO – Do set the DF flag in the header of the packets generated on the local node (not forwarded ones), in an attempt to find the best PMTU for every transmission.
   IP_PMTUDISC_WANT – Decide whether to use path MTU Discovery on a per-route basis, this is the default.
–  IP_PMTUDISC_PROBE – Set the DF flag, but ignore the Path MTU.

You can disable it using the following command:
echo 1 > /proc/sys/net/ipv4/ip_no_pmtu_disc
/proc/sys/net/ipv4/route/min_pmtu – Configure the minimum MTU (default 552)
/proc/sys/net/ipv4/route/mtu_expires – Configure the ageing time for the path MTU value (default 600 seconds)

4.5 Windows 95/98/ME
By default, Path MTU Discovery is enabled; you can disable it using the following entry in registers:
PMTUDiscovery = 0
Data Type: DWORD

4.6 Windows 2000/XP
By default, Path MTU Discovery is enabled; you can disable it using the following entry in registers:
PMTU Discovery:  0
Data Type:  DWORD

4.7 Cisco router

You can disable only the TCP path MTU discovery using the command below, it is disabled by default and the age-time is 10 minutes.
no ip tcp path-mtu-discovery [age-timer {minutes | infinite}]

You can enable it using the command below:
ip tcp path-mtu-discovery [age-timer {minutes | infinite}]

BGP Path MTU Discovery is enabled by default on the Cisco routers for all BGP sessions, can be enabled or disabled using the following commands:
bgp transport path-mtu-discovery
no bgp transport path-mtu-discovery

Can be enabled per neighbor using the following command:
no neighbor {ip-address | peer-group-name} transport {connection-mode | path-mtu-discovery}

4.8 Juniper router

Path MTU discovery for outgoing TCP connections is enabled by default, in order to disable it you can run:
[edit system internet-options]

In Junos OS, TCP path MTU discovery is disabled by default for all BGP neighbor sessions. It can be enabled per neighbor, per group or routing-instance using the following command in the specific configuration view:

4.9 Huawei router

No public documentation available.

If you need to change the default MSS size, please check the link IP-MTU-vs-MSS where you can find more informations.

5. Dealing with broken path MTU

Between most common reasons which lead to path MTU discovery failures, including black-holes are the following:
– The router does not have implemented the path MTU Discovery and does not send back in the ICMP error with the MTU of next-hop or does not send the ICMP at all(but the router is still dropping the too big packet).
– The host, IP source, has an implementation or configuration problem and ignores the ICMP error messages.
– A router or a firewall in the way from the router to the source discards the ICMP error messages before they can reach the IP source.

     All of them can be solved if the path MTU discovery is disabled, as explained in the above chapter, even this is not the best practice solution it is a solution to avoid the black-holed traffic.

     The last one is the most common problem and most of the time appears due to configuration error, can be solved as below, solution available only for Cisco implementation, on Juniper the DF bit can be cleared only for some tunnel interfaces the other ones ca be implemented with the specific commands:
      – Packet filtering ACL should be modified to accept the most important ICMP messages and not to deny all ICMP:
      access-list 101 permit icmp any any unreachable
      access-list 101 permit icmp any any time-exceeded
      access-list 101 deny icmp any any
      access-list 101 permit ip any any

       – Clear the DF bit on the router and allow fragmentation anyway
       interface gi x/x
        ip policy route-map clear-df-bit
           route-map clear-df-bit permit 10
            match ip address 111
            set ip df 0

         access-list 111 permit tcp any any

      –  Manipulate the TCP MSS option value MSS
       int gi x/x
         ip tcp adjust-mss 1460

Path MTU Discovery is a good mechanism to have enabled in the network as long as there are no over zealous network administrators, or server or firewalls administrators whom will disabled the ICMP on the interfaces, especially  for this case the ICMP Destination Unreachable – Fragmentation needed and DF bit set must be let enabled.

      by Mihaela Paraschivu

No comments: