Recently, We described how to configure your server for high load and prevention of DDoS. Today, We will speak about the time_wait queue trouble. Those who develop services actively working with the network can step on the features of the TCP protocol: the transition of many (or all free) ports to the TIME_WAIT state. There is a lot of superficial information on the Internet and a lot of not quite correct information. We will consider what these situations are and determine the possible ways out of them.
TCP protocol: connection close
The following is a typical TCP connection life cycle diagram:
TCP lifetime scheme
We will not consider it as a whole but focus on the most important part for us – closing the connection. The party that initiated the closure of the connection is called “active”, and the second – is “passive”. And it does not matter which of them was the initiator of the connection.
From the “passive” side, everything is simple. Having received the FIN packet, the system should respond to it with the appropriate ACK packet but has the right to continue sending data. Since receiving the FIN packet, the connection at the passive side is in the CLOSE_WAIT state. When ready, a response FIN packet is sent, after which the party waits for an ACK packet to it. Upon receipt of the ACK to the response FIN, the connection for the passive side is closed.
From the point of view of the “active” side, everything is somewhat more complicated. After sending the FIN packet, the active side enters FIN_WAIT_1. Further, three situations are possible:
- Receive ACK on the FIN packet. This status is indicated by FIN_WAIT_2, data can be delivered to the side, after which a FIN response packet is expected, to which the active side responds with an ACK and puts the connection into TIME_WAIT state.
- If the passive side is ready to close the session, then the response FIN can be received with a simultaneous ACK to the original FIN packet. In this case, the active side responds with an ACK and transfers the connection to TIME_WAIT, bypassing FIN_WAIT_2.
- A situation is possible when the parties simultaneously initiated a closure. In this case, both sides are “active”, on both sides the connection goes into TIME_WAIT state.
As can be seen from the diagram and description, the active side sends the last packet in the session (ACK to the passive FIN). Since she cannot find out if this packet is received, the status is TIME_WAIT. In this state, the connection should be 2 * MSL (maximum packet lifetime): packet delivery time to the passive side + delivery time of a possible response packet back. In practice, at present, the TIME_WAIT timer is set to 1 – 2 minutes. After this timer expires, the connection is considered closed.
TIME_WAIT troubles for outbound connection
A connection in the operating system is identified by four parameters: local IP, local port, remote IP, remote port. Suppose we have a client that is actively connecting/disconnecting to a remote service. Since both the IP and the remote port remain unchanged, a new local port is allocated for each new connection. If the client was the active side of the end of the TCP session, then this connection will be blocked for a while in the TIME_WAIT state. If the connections are established faster than the ports quarantine, then the next time the connection is attempted, the client will receive an EADDRNOTAVAIL error (errno = 99).
Even if applications access different services and an error does not occur, the TIME_WAIT queue will grow, taking up system resources. Connections in TIME_WAIT state can be seen through netstat, it is convenient to look at generalized information with the ss utility (with the -s key).
What can be done:
- The Linux TIME_WAIT interval cannot be changed without recompiling the kernel. On the Internet, you can find references to the parameter net.ipv4.tcp_fin_timeout with the wording “in some systems, it affects TIME_WAIT”. However, what these systems are is unclear. According to the documentation, the parameter determines the maximum waiting time for the response FIN packet, i.e. limits the time spent by the connection in FIN_WAIT_2, but not TIME_WAIT.
- Open fewer connections. The error is most often observed during network interaction within the cluster. In this case, using Keep-Alive would be a wise decision.
- When designing a service, it might make sense to shift TIME_WAIT to the other side, for which we should refrain from initiating the closure of TCP connections if possible.
- If it is difficult to reduce the number of connections, then it makes sense to start the remote service on several ports and access them in turn.
- The kernel parameter “net.ipv4.ip_local_port_range” sets the range of ports used for outgoing connections. Greater range – more connections available to one remote service.
- A tough and extremely dangerous way: reduce the value of the parameter net.ipv4.tcp_max_tw_buckets to a value less than the number of IPs in the range from ip_local_port_range. This parameter sets the maximum size of the TIME_WAIT queue and is used to protect against DOS attacks. This “trick” can be used temporarily until a correct solution is developed.
- Enable the parameter net.ipv4.tcp_tw_reuse. This parameter allows the use of connections in the TIME_WAIT state for outgoing connections.
- Enable parameter net.ipv4.tcp_tw_recycle.
- Use SO_LINGER mode (set via setsockopt). In this case, the TCP session will not be closed (exchange of FIN packets) but discarded. The party wishing to perform a reset sends an RST packet. Upon receipt of this packet, the connection is considered terminated. However, according to the protocol, sending an RST packet should be done only in case of an error (receiving data that is clearly not related to this connection).
TIME_WAIT on servers
The main danger that the TIME_WAIT queue expands on the server is running out of resources.
Nevertheless, there can be unpleasant incidents when working with NAT clients (when a large number of server clients are located behind one IP). In the case of a small port quarantine time on the Firewall, it is likely that the server will receive a connection request from the same port, the connection with which is not yet closed (located at TIME_WAIT). In this case, two-three scenarios are possible:
- The (unlikely) customer will guess the SEQ number, which is highly unlikely. In this case, the behavior is undefined.
- The client will send the packet with the incorrect one (from the server’s point of view, the SEQ number), to which the server will respond with the last ACK packet, which the client no longer understands. The client usually sends RST to this ACK and waits a few seconds before a new connection attempt. If the “net.ipv4.tcp_rfc1337” parameter is disabled on the server (off by default), a new attempt will succeed. However, mainly due to the timeout, a drop in performance will be observed.
- If, in the situation described in p.2, the parameter net.ipv4.tcp_rfc1337 is enabled, the server will ignore the client’s RST packet. Repeated attempts to connect to the server from the same port will fail. For the client, the service will become unavailable.
What can be done on the server-side.
- Try to shift the initiation of closing the connection to the client. In doing so, reasonable timeouts must be set.
- Be careful with the net.ipv4.tcp_max_tw_buckets parameter. Setting it too large will make the server vulnerable to a DOS attack.
- Use SO_LINGER for obviously incorrect queries. If the client connects and sends “nonsense”, then it is likely that there is an attack on which it is better to spend the minimum amount of resources.
- Enable net.ipv4.tcp_tw_recycle if you are sure that clients do not go through NAT. It is important to note that net.ipv4.tcp_tw_reuse does not affect the processing of incoming connections.
- In some cases, it makes sense not to “fight” the queue, but to correctly distribute it. In particular, the following recipes can help:
- When using the L7 balancer, all packets coming from the same IP, which provokes “hits” in the TIME_WAIT connection, but in this case, you can safely enable tcp_tw_recycle.
- When using the L3 balancer, the server sees the source IP addresses. IP-HASH balancing, at the same time, will forward all connections for one NAT to one server, which also increases the likelihood of a collision. Round-Robin is more reliable in this regard.
- Avoid using NAT inside the network where possible. If necessary, it is better to prefer 1-in-1 translation.
- You can increase the number of available connections by hosting the service on several ports. For example, for a WEB server, the load can be balanced not on one 80th port, but on a pool of ports.
- If the problem is caused by NAT inside the network, you can solve the situation by reconfiguring the translation on the network device: it is necessary to ensure that the port “quarantine” time on NAT is longer than TIME_WAIT. But in this case, the risk of running out of ports on the NAT translator is increased (as an option, the translation is not to one IP, but to the pool).
Kernel parameters net.ipv4.tcp_tw_reuse and net.ipv4.tcp_tw_recycle
There are two parameters in the Linux kernel that allow you to violate the requirements of the TCP protocol, freeing connections from TIME_WAIT ahead of schedule. Both of these options are based on the extension TCP-timestamps (marking packets with relative timestamps).
net.ipv4.tcp_tw_reuse allows you to use the connection at TIME_WAIT for a new outgoing connection. In this case, the new TCP connection timestamp should be an order of magnitude larger than the last value in the previous session. In this case, the server will be able to distinguish the “late” packet from the previous connection from the current one. Using a parameter is safe in most cases. Problems can arise if there is a “tracking” firewall along a path that decides not to miss the packet in the connection, which should be in TIME_WAIT.
net.ipv4.tcp_tw_recycle reduces the connection time in the TIME_WAIT queue to the RTO (Re-Transmission Time-Out) value, which is calculated based on Round-Trip-Time (RTT) and the spread of this value. At the same time, the last TCP-timestamp value is saved in the kernel, and packets with lower value are simply discarded. This option will make the service unavailable to clients behind NAT if TCP-timestamps from clients are “skipped” during translation (if NAT removes them or replaces them with their own, there will be no problems). Since it is not possible to predict the settings of external devices, this option is strongly not recommended for inclusion on servers accessible from the Internet. Moreover, on “internal” servers where there is no NAT (or the 1-in-1 option is used), the option is safe.