4.1 TCP Three-Way Handshake and Four-Way Teardown

Basic Understanding of TCP#

What are the TCP Header Formats?#

The fields marked in color are closely related to this article:

Sequence Number: A random number generated by the computer as its initial value when establishing a connection, passed to the receiving host through the SYN packet. Each time data is sent, the size of the "data byte count" is "accumulated" once. Used to solve the problem of network packet disorder.
Acknowledgment Number: Refers to the sequence number of the next "expected" data to be received. The sender can consider that all data before this sequence number has been received normally after receiving this acknowledgment. Used to solve the problem of packet loss.
Control Bits:
- ACK: When this bit is 1, the "Acknowledgment" field becomes valid. TCP specifies that this bit must be set to 1 except for the initial connection establishment SYN packet.
- RST: When this bit is 1, it indicates that an exception has occurred in the TCP connection and the connection must be forcibly closed.
- SYN: When this bit is 1, it indicates a desire to establish a connection and sets the initial value of the sequence number in its "Sequence Number" field.
- FIN: When this bit is 1, it indicates that no more data will be sent in the future and wishes to close the connection. When the communication ends and the connection is to be closed, the hosts on both sides can exchange TCP segments with the FIN bit set to 1.

Why is the TCP Protocol Needed? At Which Layer Does TCP Operate?#

To ensure the reliability of network data packets (delivery, ordered delivery, data integrity), the TCP protocol at the transport layer is required.

Because TCP is a reliable data transmission service that operates at the transport layer, it ensures that the network packets received by the receiving end are undamaged, contiguous, non-redundant, and in order.

What is TCP?#

TCP is a connection-oriented, reliable, byte-stream transport layer communication protocol.

Connection-oriented: It must be "one-to-one" to connect; it cannot send messages from one host to multiple hosts simultaneously, meaning one-to-many is not possible.
Reliable: Regardless of the changes in the network link, TCP can ensure that a message will definitely reach the receiving end.
Byte-stream: When user messages are transmitted via the TCP protocol, the messages may be "packaged" into multiple TCP packets by the operating system. If the receiving program does not know the "boundaries of the message," it cannot read a valid user message. Moreover, TCP packets are "ordered"; if the "previous" TCP packet has not been received, even if the later TCP packets are received first, they cannot be handed over to the application layer for processing, and "duplicate" TCP packets will be automatically discarded.

What is a TCP Connection?#

RFC 793 defines "connection" as follows:

Connections: The reliability and flow control mechanisms described above require that TCPs initialize and maintain certain status information for each data stream. The combination of this information, including sockets, sequence numbers, and window sizes, is called a connection.

That is, certain status information used to ensure reliability and flow control maintenance, the combination of this information, including Socket, sequence number, and window size is called a connection.

Therefore, establishing a TCP connection requires the client and server to reach a consensus on the above three pieces of information:

Socket: Composed of IP address and port number
Sequence Number: Solves the problem of packet disorder
Window Size: Used for flow control

How to Uniquely Identify a TCP Connection?#

The TCP four-tuple can uniquely identify a connection:

Source Address
Source Port
Destination Address
Destination Port

The fields of source address and destination address (32 bits) are in the IP header, which serves to send packets to the other host via the IP protocol.

The fields of source port and destination port (16 bits) are in the TCP header, which serves to inform the TCP protocol which process the packet should be sent to.

Info

Q: If an IP server is listening on a port, what is its maximum number of TCP connections?
A: The server usually listens on a fixed local port, waiting for client connection requests. Therefore, the client IP and port are variable, and the theoretical value calculation formula is as follows:

For IPv4, the maximum number of client IPs is 2 to the power of 32, and the maximum number of client ports is 2 to the power of 16, meaning the maximum TCP connection number for a single server is approximately 2 to the power of 48.

Of course, the maximum concurrent TCP connections on the server cannot reach the theoretical limit and will be affected by the following factors:

File descriptor limits: Each TCP connection is a file, and if the file descriptors are full, a "Too many open files" error will occur. Linux has three types of limits on the number of open file descriptors:
- System level: The maximum number of files that can be opened by the current system, viewable via cat /proc/sys/fs/file-max;
- User level: The maximum number of files that a specified user can open, viewable via cat /etc/security/limits.conf;
- Process level: The maximum number of files that a single process can open, viewable via cat /proc/sys/fs/nr_open;
Memory limits: Each TCP connection occupies a certain amount of memory, and the operating system's memory is limited. If memory resources are exhausted, an OOM (Out of Memory) error will occur.

What are the Differences Between UDP and TCP? What are Their Application Scenarios?#

UDP does not provide complex control mechanisms and uses IP to provide "connectionless" communication services.

The UDP protocol is very simple, with a header of only 8 bytes (64 bits). The header format of UDP is as follows:

Destination and Source Ports: Mainly inform the UDP protocol which process the packet should be sent to.
Packet Length: This field stores the sum of the length of the UDP header and the length of the data.
Checksum: The checksum is designed to provide reliable UDP headers and data to prevent receiving UDP packets that are damaged during network transmission.

Differences Between TCP and UDP:

Connection
1. TCP is a connection-oriented transport layer protocol that must establish a connection before transmitting data.
2. UDP does not require a connection and transmits data immediately.
Service Object
1. TCP is a one-to-one two-point service, meaning a connection has only two endpoints.
2. UDP supports one-to-one, one-to-many, and many-to-many interactive communication.
Reliability
1. TCP reliably delivers data, ensuring data is error-free, not lost, not duplicated, and arrives in order.
2. UDP makes a best-effort delivery without guaranteeing reliable data delivery. However, we can implement a reliable transport protocol based on the UDP transport protocol, such as the QUIC protocol. For details, refer to this article: How to Implement Reliable Transmission Based on UDP Protocol?
Congestion Control and Flow Control
1. TCP has congestion control and flow control mechanisms to ensure the safety of data transmission.
2. UDP does not have these mechanisms, and even if the network is congested, it will not affect the sending rate of UDP.
Header Overhead
1. TCP has a longer header length, resulting in some overhead. The header is 20 bytes if the "options" field is not used; it will be longer if the "options" field is used.
2. UDP has a fixed header of only 8 bytes, resulting in less overhead.
Transmission Method
1. TCP is stream-oriented, with no boundaries but guarantees order and reliability.
2. UDP sends packets one by one, with boundaries, but may lose packets and be out of order.
Fragmentation Differences
1. If the size of TCP data exceeds the MSS (Maximum Segment Size), it will be fragmented at the transport layer. The target host will also reassemble the TCP packets at the transport layer after receiving them. If a fragment is lost during transmission, only the lost fragment needs to be retransmitted.
2. If the size of UDP data exceeds the MTU (Maximum Transmission Unit), it will be fragmented at the IP layer. The target host will reassemble the data at the IP layer before passing it to the transport layer.

Application Scenarios for TCP and UDP:
Since TCP is connection-oriented and can ensure reliable data delivery, it is often used for:

FTP file transfer
HTTP / HTTPS

Since UDP is connectionless and can send data at any time, coupled with its simple and efficient processing, it is often used for:

Communication with a small number of packets, such as DNS, SNMP, etc.;
Multimedia communication such as video and audio;
Broadcast communication;

Info

Q2: Why does the UDP header have a "packet length" field, while the TCP header does not?
A1: First, let's discuss how TCP calculates the length of the payload data:

The IP total length and IP header length are known in the IP header format. The TCP header length is known in the TCP header format, so the length of TCP data can be calculated.

This raises a question: "UDP is also based on the IP layer, so the length of UDP data can also be calculated using this formula. Why is there still a need for a 'packet length' field?"

There are two plausible explanations:

The first explanation: For the convenience of network device hardware design and processing, the header length needs to be a multiple of 4 bytes. If the "packet length" field is removed from UDP, then the UDP header length would not be a multiple of 4 bytes. Therefore, this may have been added to ensure that the UDP header length is a multiple of 4 bytes.
The second explanation: The current UDP protocol is developed based on the IP protocol, but it may not have been so in the past, relying on other network layer protocols that do not provide their own packet length or header length. Therefore, the UDP packet header needs a length field for calculation.

Can TCP and UDP Use the Same Port?#

Yes.

At the data link layer, hosts in the local area network are identified by MAC addresses.
At the network layer, hosts or routers interconnected in the network are identified by IP addresses.
At the transport layer, addressing is done through ports to identify different applications communicating simultaneously on the same computer.

Thus, the role of the transport layer's "port number" is to distinguish the data packets of different applications on the same host.

The transport layer has two transport protocols, TCP and UDP, which are two completely independent software modules in the kernel.

When a host receives a packet, it can determine whether the packet is TCP/UDP based on the "protocol number" field in the IP packet header. Therefore, it can determine which module (TCP/UDP) to process the packet based on this information, and the packets sent to the TCP/UDP module are determined by the "port number" to which application they should be sent.

Therefore, the port numbers of TCP and UDP are independent of each other. For example, if TCP has a port number 80, UDP can also have a port number 80, and the two do not conflict.

There are many knowledge points about ports that can be discussed, such as:

Can multiple TCP service processes bind to the same port simultaneously?
Why does the error message "Address in use" occur when restarting the TCP service process? How can it be avoided?
Can client ports be reused?
Will too many client TCP connections in the TIME_WAIT state exhaust port resources and prevent new connections from being established?

For these questions, you can refer to this article: Can TCP and UDP Use the Same Port?

Establishing a TCP Connection#

What is the Process of the TCP Three-Way Handshake?#

TCP is a connection-oriented protocol, so a connection must be established before using TCP, and the connection is established through a three-way handshake. The process of the three-way handshake is shown in the following diagram:

Initially, both the client and server are in the CLOSE state. The server actively listens on a certain port and is in the LISTEN state.
The client randomly initializes a sequence number (client_isn), places this sequence number in the TCP header's [Sequence Number] field, and sets the SYN flag to 1, indicating a SYN packet. The first SYN packet is then sent to the server, indicating a connection request. This packet does not contain application layer data, and the client then enters the SYN-SENT state.
The server receives the client's SYN packet, first randomly initializes its own sequence number (server_isn), fills this sequence number into the TCP header's [Sequence Number] field, and fills the TCP header's [Acknowledgment Number] field with client_isn + 1. It then sets both the SYN and ACK flags to 1. Finally, it sends this packet to the client, which does not contain application layer data, and the server enters the SYN-RCVD state.
The client receives the server's packet and must respond with the final acknowledgment packet. First, this acknowledgment packet sets the TCP header's ACK flag to 1, then fills the [Acknowledgment Number] field with server_isn + 1, and finally sends the packet to the server. This time, the packet can carry data from the client to the server, and the client then enters the ESTABLISHED state.
The server receives the client's acknowledgment packet and also enters the ESTABLISHED state.

From the above process, it can be seen that the third handshake can carry data, while the first two handshakes cannot carry data. Once the three-way handshake is completed, both parties are in the ESTABLISHED state, and the connection is established, allowing the client and server to send data to each other.

How to View TCP Status in Linux?#

In Linux, you can view the TCP connection status using netstat -napt:

Why Three Times Handshake? Why Not Two or Four?#

A common superficial answer is: “Three-way handshake ensures that both parties have the ability to send and receive.” This does not explain the main reason.
The previous introduction discussed what a TCP connection is: certain status information is used to ensure reliability and flow control maintenance. This information includes Socket, sequence number, and window size, which are collectively referred to as a connection.
Therefore, the important question is why three-way handshake can initialize the Socket, sequence number, and window size and establish a TCP connection.
Next, we will analyze the reasons for the three-way handshake from three aspects:

The three-way handshake can prevent the initialization of duplicate historical connections (the main reason).
The three-way handshake can synchronize the initial sequence numbers of both parties.
The three-way handshake can avoid resource waste.

Reason One: Avoiding Historical Connections#

RFC 793 points out that the main reason for the three-way handshake is:

The principle reason for the three-way handshake is to prevent old duplicate connection initiations from causing confusion.

In simple terms, the primary reason for the three-way handshake is to prevent confusion caused by the initialization of old duplicate connections.
Consider a scenario where the client first sends a SYN (seq = 90) packet, then the client crashes, and this SYN packet is blocked in the network, so the server does not receive it. After that, the client restarts and tries to establish a connection with the server again, sending a SYN (seq = 100) packet (Note! This is not a retransmission of SYN; the retransmitted SYN has the same sequence number).
Let's see how the three-way handshake prevents historical connections:

The client continuously sends multiple SYN packets (all with the same four-tuple) to establish a connection. In the case of network congestion:

An "old SYN packet" arrives at the server before the "latest SYN" packet, and at this point, the server will reply with a SYN + ACK packet to the client, with an acknowledgment number of 91 (90+1).
The client receives it and finds that the expected acknowledgment number should be 100 + 1, not 90 + 1, so it will reply with a RST packet.
The server receives the RST packet and releases the connection.
After the latest SYN arrives at the server, the client and server can complete the three-way handshake normally.

The "old SYN packet" mentioned above is called a historical connection, and the main reason TCP uses a three-way handshake to establish a connection is to prevent the initialization of "historical connections."

Info

Q: If two handshakes are used to connect, it cannot prevent historical connections. Why can't TCP use two handshakes to prevent historical connections?
A: The main reason is that in the case of two handshakes, the server does not have an intermediate state to prevent historical connections, which may lead to the establishment of a historical connection, causing resource waste.

In the case of two handshakes, when the server receives the SYN packet, it enters the ESTABLISHED state, meaning it can send data to the other party. However, the client has not yet entered the ESTABLISHED state. If this is a historical connection, the client will determine that this connection is a historical connection and will reply with a RST packet to disconnect. However, the server has already entered the ESTABLISHED state during the first handshake, so it can send data, but it does not know that this is a historical connection. It will only disconnect after receiving the RST packet.

It can be seen that if a two-handshake scenario is used to establish a TCP connection, the server does not prevent the establishment of a historical connection, leading to the server establishing a historical connection and unnecessarily sending data, wasting the server's resources.
Therefore, to solve this phenomenon, it is best to prevent historical connections before the server sends data, that is, before establishing a connection, and to achieve this function, a three-way handshake is required.
Thus, the main reason TCP uses a three-way handshake to establish a connection is to prevent the initialization of "historical connections."

Reason Two: Synchronizing the Initial Sequence Numbers of Both Parties#

Both parties in the TCP protocol must maintain a "sequence number," which is a key factor for reliable transmission. Its functions include:

The receiver can discard duplicate data;
The receiver can receive data packets in order based on the sequence number;
It can identify which packets sent out have been received by the other party (known through the sequence number in the ACK packet);

It can be seen that the sequence number plays a very important role in the TCP connection, so when the client sends a SYN packet carrying the "initial sequence number," the server must reply with an ACK acknowledgment packet to indicate that the client's SYN packet has been successfully received. When the server sends the "initial sequence number" to the client, it must also receive a response from the client. This back-and-forth ensures that both parties' initial sequence numbers can be reliably synchronized.

A four-way handshake can also reliably synchronize both parties' initial sequence numbers, but since the second and third steps can be optimized into one step, it results in a "three-way handshake."

A two-way handshake only ensures that one party's initial sequence number can be successfully received by the other party, but it cannot guarantee that both parties' initial sequence numbers can be confirmed as received.

Reason Three: Avoiding Resource Waste#

If there are only "two handshakes," when the client's SYN packet is blocked in the network and the client does not receive the ACK packet, it will resend the SYN. Due to the lack of a third handshake, the server does not know whether the client has received its ACK packet, so it can only actively establish a connection every time it receives a SYN. What situation might this cause?

If the client's SYN packet is blocked in the network and it resends multiple SYN packets, then when the server receives the request, it will establish multiple redundant invalid links, causing unnecessary resource waste.

In other words, two handshakes can cause message retention, and the server repeatedly receives useless connection requests (SYN packets), leading to duplicate resource allocation.

Summary#

When establishing a TCP connection, the three-way handshake can prevent the establishment of historical connections, reduce unnecessary resource overhead for both parties, and help synchronize the initial sequence numbers. The sequence number ensures that data packets are not duplicated, not discarded, and are transmitted in order.

The reasons for not using "two handshakes" and "four handshakes" are:

"Two handshakes": Cannot prevent the establishment of historical connections, leading to resource waste for both parties, and cannot reliably synchronize the sequence numbers of both parties;
"Four handshakes": The three-way handshake theoretically meets the minimum number of times required to establish a reliable connection, so there is no need to use more communication times.

Why Must the Initial Sequence Number Be Different Each Time a TCP Connection is Established?#

There are two main reasons:

To prevent historical packets from being received by the next connection with the same four-tuple (the main reason);
For security, to prevent forged TCP packets with the same sequence number from being received by the other party;

Next, let's discuss the first point in detail.
Assume that each time a connection is established, both the client and server's initial sequence numbers start from 0:

The process is as follows:

The client and server establish a TCP connection, and the client's data packet is blocked in the network, then times out and retransmits this data packet. At this point, the server's device loses power and the previously established connection with the client disappears. When the server receives the client's data packet, it sends an RST packet.
Next, the client establishes a new connection with the server using the same four-tuple.
After the new connection is established, the previously blocked data packet arrives at the server, and the sequence number of this data packet happens to be within the server's receiving window, so the server will normally receive this data packet. However, this data packet is left over from the previous connection, leading to data confusion.

It can be seen that if the initial sequence numbers are the same each time a connection is established, it is easy to encounter the problem of historical packets being received by the next connection with the same four-tuple.

If the initial sequence numbers of the client and server are not the same each time a connection is established, there is a high probability that the sequence number of the historical packet will "not be" in the receiving window of the other party, thus largely avoiding historical packets, as shown in the following diagram:

Conversely, if the initial sequence numbers of the client and server are the same each time a connection is established, there is a high probability that the sequence number of the historical packet will "just happen to be" within the receiving window of the other party, leading to the historical packet being successfully received by the new connection.

Thus, having different initial sequence numbers each time largely avoids historical packets being received by the next connection with the same four-tuple. Note that this does not completely avoid it (because of the sequence number wrap-around issue, a timestamp mechanism is needed to determine historical packets; see this article: How Does TCP Avoid Historical Packets?).

How is the Initial Sequence Number (ISN) Randomly Generated?#

The initial ISN is based on a clock, incrementing by 1 every 4 microseconds, and it takes 4.55 hours to complete one cycle.
RFC 793 mentions the algorithm for randomly generating the initial sequence number ISN: ISN = M + F(localhost, localport, remotehost, remoteport).

M is a timer that increments by 1 every 4 microseconds.
F is a hash algorithm that generates a random value based on the source IP, destination IP, source port, and destination port. The hash algorithm must not be easily predictable by external parties; using the MD5 algorithm is a good choice.

It can be seen that the random number is incremented based on the clock timer, making it virtually impossible to randomly generate the same initial sequence number.

Since the IP Layer Can Fragment, Why Does the TCP Layer Still Need MSS (Maximum Segment Size)?#

First, let's understand MTU (Maximum Transmission Unit) and MSS (Maximum Segment Size):

MTU: The maximum length of a network packet, generally 1500 bytes in Ethernet.
MSS: The maximum length of TCP data that can be accommodated in a network packet, excluding the IP and TCP headers.

Info

Q: What would happen if the entire TCP packet (header + data) were handed over to the IP layer for fragmentation?
A: When the IP layer has data (TCP header + TCP data) that exceeds the MTU size, it must fragment the data into several pieces, ensuring that each fragment is smaller than the MTU. After fragmenting an IP data packet, the target host's IP layer will reassemble the data before passing it to the upper TCP transport layer.

This seems orderly, but there is a hidden danger: if one IP fragment is lost, all fragments of the entire IP packet must be retransmitted.

Because the IP layer itself does not have a timeout retransmission mechanism; it is the TCP layer that is responsible for timeouts and retransmissions.

When an IP fragment is lost, the receiving IP layer cannot assemble a complete TCP packet (header + data), so it cannot deliver the data packet to the TCP layer. Therefore, the receiver will not respond with an ACK to the sender. Since the sender does not receive the ACK confirmation packet for a long time, it will trigger a timeout retransmission, leading to the retransmission of the "entire TCP packet (header + data)."

Thus, relying on the IP layer for fragmentation is highly inefficient.

To achieve optimal transmission efficiency, the TCP protocol typically negotiates the MSS value of both parties when establishing a connection. When the TCP layer detects that the data exceeds the MSS, it will fragment the data first, ensuring that the resulting IP packet's length does not exceed the MTU, thus avoiding IP fragmentation.

After TCP layer fragmentation, if a TCP fragment is lost, the retransmission will also be done based on the MSS, rather than retransmitting all fragments, greatly increasing the efficiency of retransmissions.

What Happens if the First Handshake is Lost?#

When the client wants to establish a TCP connection with the server, the first packet sent is the SYN packet, and it enters the SYN_SENT state.

After this, if the client does not receive the server's SYN-ACK packet (the second handshake) for a long time, it will trigger the "timeout retransmission" (RTO: Retransmission TimeOut) mechanism to retransmit the SYN packet, and the retransmitted SYN packet will have the same sequence number.

Different versions of the operating system may have different timeout durations; some are 1 second, others are 3 seconds. This timeout duration is hardcoded in the kernel, and changing it requires recompiling the kernel, which is quite cumbersome.

When the client does not receive the server's SYN-ACK packet after 1 second, it will retransmit the SYN packet. How many times will it retransmit?

In Linux, the maximum retransmission count for the client's SYN packet is controlled by the tcp_syn_retries kernel parameter, which can be customized, with a default value generally set to 5.

cat /proc/sys/net/ipv4/tcp_syn_retries

Typically, the first timeout retransmission occurs after 1 second, the second after 2 seconds, the third after 4 seconds, the fourth after 8 seconds, and the fifth after 16 seconds. Yes, each timeout duration is double that of the previous one.
After the fifth timeout retransmission, it will continue to wait for 32 seconds. If the server still does not respond with an ACK, the client will stop sending SYN packets and disconnect from the TCP connection.
Thus, the total time spent is 1+2+4+8+16+32=63 seconds, approximately 1 minute.
For example, if the tcp_syn_retries parameter value is 3, the following process occurs when the client's SYN packet is lost in the network:

Specific process: When the client times out and retransmits the SYN packet 3 times, since tcp_syn_retries is set to 3, it has reached the maximum retransmission count. It will then wait for a period (the time being double that of the last timeout), and if it still does not receive the server's second handshake (SYN-ACK packet), the client will disconnect.

What Happens if the Second Handshake is Lost?#

When the server receives the client's first handshake, it will first reply with an ACK confirmation packet. At this point, the server's connection enters the CLOSE_WAIT state.

As mentioned earlier, the ACK packet will not be retransmitted, so if the second handshake is lost, the client will trigger the timeout retransmission mechanism and retransmit the SYN packet.

Then, because the second handshake's SYN-ACK packet contains the server's SYN packet, when the client receives it, it needs to send an ACK confirmation packet (the third handshake) back to the server, which will cause the server to believe that the SYN packet has been received by the client. If the second handshake is lost, the server will not receive the third handshake, and thus the server will trigger the timeout retransmission mechanism to retransmit the SYN-ACK packet.

In Linux, the maximum retransmission count for the SYN-ACK packet is controlled by the tcp_synack_retries kernel parameter, with a default value of 5.

cat /proc/sys/net/ipv4/tcp_synack_retries

Thus, when the second handshake is lost, both the client and server will retransmit:

The client will retransmit the SYN packet, which is the first handshake, with the maximum retransmission count controlled by the tcp_syn_retries kernel parameter.
The server will retransmit the SYN-ACK packet, which is the second handshake, with the maximum retransmission count controlled by the tcp_synack_retries kernel parameter.

For example, if the tcp_syn_retries parameter value is 1 and the tcp_synack_retries parameter value is 2, the following process occurs when the second handshake is continuously lost:

Specific process:

When the client times out and retransmits the SYN packet 1 time, since tcp_syn_retries is set to 1, it has reached the maximum retransmission count. It will then wait for a period (the time being double that of the last timeout), and if it still does not receive the server's second handshake (SYN-ACK packet), the client will disconnect.
When the server times out and retransmits the SYN-ACK packet 2 times, since tcp_synack_retries is set to 2, it has reached the maximum retransmission count. It will then wait for a period (the time being double that of the last timeout), and if it still does not receive the client's third handshake (ACK packet), the server will disconnect.

What Happens if the Third Handshake is Lost?#

When the client receives the server's SYN-ACK packet, it will send an ACK packet back to the server, which is the third handshake. At this point, the client's state enters the ESTABLISHED state.

Since this third handshake's ACK is a confirmation packet for the second handshake's SYN, if the third handshake is lost, the server will not receive the confirmation packet and will trigger the timeout retransmission mechanism to retransmit the SYN-ACK packet until it receives the third handshake or reaches the maximum retransmission count.

For example, if the tcp_synack_retries parameter value is 2, the following process occurs when the third handshake is continuously lost:

Specific process: When the server times out and retransmits the SYN-ACK packet 2 times, since tcp_synack_retries is set to 2, it has reached the maximum retransmission count. It will then wait for a period (the time being double that of the last timeout), and if it still does not receive the client's third handshake (ACK packet), the server will disconnect.

What Happens if the Fourth Handshake is Lost?#

When the client receives the server's third handshake's FIN packet, it will send an ACK packet back, which is the fourth handshake. At this point, the client's connection enters the TIME_WAIT state.

In the Linux system, the TIME_WAIT state will last for 2MSL before entering the closed state.

Then, the server (the passive closing party) will remain in the LAST_ACK state until it receives the ACK packet.

If the fourth handshake's ACK packet does not reach the server, the server will retransmit the FIN packet, with the retransmission count still controlled by the tcp_orphan_retries parameter.

For example, if the tcp_orphan_retries is set to 2, the following process occurs when the fourth handshake is continuously lost:

Specific process:

When the server retransmits the third handshake packet 2 times, since tcp_orphan_retries is set to 2, it has reached the maximum retransmission count. It will then wait for a period (the time being double that of the last timeout), and if it still does not receive the client's fourth handshake (ACK packet), the server will disconnect.
The client, upon receiving the third handshake, enters the TIME_WAIT state, starting a timer for 2MSL. If it receives the server's retransmitted FIN packet during this time, it will reset the timer. After waiting for 2MSL, the client will disconnect.

What is a SYN Attack? How to Prevent SYN Attacks?#

We know that establishing a TCP connection requires three-way handshakes. Suppose an attacker quickly forges SYN packets from different IP addresses. Each time the server receives a SYN packet, it enters the SYN_RCVD state. However, the ACK + SYN packets sent by the server cannot receive an ACK response from the unknown IP hosts, and over time, this will fill the server's half-connection queue, preventing the server from serving normal users.

Let's first look at how the Linux kernel's SYN queue (half-connection queue) and Accept queue (full-connection queue) work:

Normal process:

When the server receives a SYN packet from the client, it creates a half-connection object and adds it to the kernel's "SYN queue";
Then the server sends a SYN + ACK to the client, waiting for the client to respond with an ACK packet;
When the server receives the ACK packet, it removes a half-connection object from the "SYN queue" and creates a new connection object to place in the "Accept queue";
The application retrieves the connection object from the "Accept queue" by calling the accept() socket interface.

Both the half-connection queue and full-connection queue have a maximum length limit; when this limit is exceeded, packets are generally discarded by default.

The most direct manifestation of a SYN attack is that it fills the TCP half-connection queue, causing when the TCP half-connection queue is full, subsequent SYN packets will be discarded, preventing clients from establishing connections with the server.

How to Prevent SYN Attacks#

There are four methods to avoid SYN attacks:

Increase netdev_max_backlog;
Increase the TCP half-connection queue;
Enable tcp_syncookies;
Reduce the number of SYN+ACK retransmissions.

Method One: Increase `netdev_max_backlog`#

When the network card receives packets faster than the kernel can process them, there will be a queue to hold these packets. The maximum value of this queue is controlled by the following parameter, which defaults to 1000. We should appropriately increase this parameter's value, for example, set it to 10000:

net.core.netdev_max_backlog = 10000

Method Two: Increase the TCP Half-Connection Queue#

To increase the TCP half-connection queue, the following three parameters must be increased simultaneously:

Increase net.ipv4.tcp_max_syn_backlog
Increase the backlog in the listen() function
Increase net.core.somaxconn
For details on why these three parameters determine the size of the TCP half-connection queue, refer to this article: What Happens When the TCP Half-Connection Queue and Full-Connection Queue Are Full? How to Respond?

Method Three: Enable `net.ipv4.tcp_syncookies`#

Enabling the syncookies feature allows connections to be successfully established without using the SYN half-connection queue, effectively bypassing the SYN half-connection to establish connections.

Specific process:

When the "SYN queue" is full, subsequent SYN packets received by the server will not be discarded. Instead, it will calculate a cookie value based on an algorithm;
The cookie value is placed in the "sequence number" of the second handshake packet, and then the server sends the second handshake to the client;
When the server receives the client's acknowledgment packet, it checks the validity of this ACK packet. If valid, it places the connection object into the "Accept queue".
Finally, the application retrieves the connection from the "Accept queue" using the accept() interface.

It can be seen that when tcp_syncookies is enabled, even if a SYN attack causes the SYN queue to be full, normal connections can still be successfully established.

The net.ipv4.tcp_syncookies parameter has three main values:

0: Disable this feature;
1: Enable it only when the SYN half-connection queue cannot accommodate more connections;
2: Unconditionally enable the feature;

To respond to SYN attacks, it is sufficient to set it to 1.

echo 1 > /proc/sys/net/ipv4/tcp_syncookies

Method Four: Reduce the Number of SYN+ACK Retransmissions#

When the server is under SYN attack, there will be many TCP connections in the SYN_REVC state. TCP connections in this state will retransmit SYN+ACK packets. When the retransmissions exceed the maximum count, the connections will be closed.
To address SYN attack scenarios, we can reduce the number of SYN-ACK retransmissions to speed up the disconnection of TCP connections in the SYN_REVC state.
The maximum retransmission count for SYN-ACK packets is controlled by the tcp_synack_retries kernel parameter (default value is 5), for example, reducing it to 2:

echo 2 > /proc/sys/net/ipv4/tcp_synack_retries

TCP Connection Termination#

What is the Process of the TCP Four-Way Handshake?#

There is no feast that does not end, and the same goes for TCP connections. TCP disconnection is done through a four-way handshake.

Both parties can actively close the connection, and after the disconnection, the "resources" in the host will be released. The process of the four-way handshake is shown in the following diagram:

The process is as follows:

The client intends to close the connection and sends a TCP packet with the FIN flag set to 1, which is the FIN packet. The client then enters the FIN_WAIT_1 state.
The server receives this packet and sends an ACK acknowledgment packet to the client, then enters the CLOSE_WAIT state.
The client receives the server's ACK acknowledgment packet and then enters the FIN_WAIT_2 state.
After waiting for the server to finish processing the data, it sends a FIN packet to the client, then enters the LAST_ACK state.
The client receives the server's FIN packet and replies with an ACK acknowledgment packet, then enters the TIME_WAIT state.
The server receives the ACK acknowledgment packet and enters the CLOSE state. At this point, the server has completed the connection closure.
The client automatically enters the CLOSE state after a period of 2MSL, completing the connection closure.

It can be seen that each direction requires one FIN and one ACK, hence it is commonly referred to as four-way handshake.

It is important to note that: Only the party actively closing the connection will have the TIME_WAIT state.

Why Does the Handshake Require Four Steps?#

By reviewing the process of the four-way handshake where both parties send FIN packets, we can understand why four steps are needed.

When closing the connection, the client sends a FIN to the server, indicating that the client will no longer send data but can still receive data.
When the server receives the client's FIN packet, it first replies with an ACK acknowledgment packet. The server may still have data to process and send, and only after the server has finished sending data will it send a FIN packet to the client to indicate agreement to close the connection.

From the above process, it can be seen that the server usually needs to wait until it has finished sending and processing data, so the server's ACK and FIN are generally sent separately, thus requiring four steps.

However, in specific situations, the four-way handshake can be reduced to a three-way handshake. For specific situations, refer to this article: Can the TCP Four-Way Handshake Be Reduced to Three?

What Happens if the First Handshake Fails?#

When the client (the active closing party) calls the close function, it sends a FIN packet to the server, attempting to disconnect. At this point, the client's connection enters the FIN_WAIT_1 state.

Under normal circumstances, if it can promptly receive the server's ACK, it will quickly transition to the FIN_WAIT_2 state.

If the first handshake is lost, then if the client does not receive the passive party's ACK for a long time, it will trigger the timeout retransmission mechanism to retransmit the FIN packet, with the retransmission count controlled by the tcp_orphan_retries parameter.

When the client retransmits the FIN packet more than tcp_orphan_retries times, it will no longer send the FIN packet and will wait for a period (the time being double that of the last timeout). If it still does not receive the second handshake, it will directly enter the CLOSE state.

For example, if the tcp_orphan_retries parameter value is 3, the following process occurs when the first handshake is continuously lost:

Specific process: When the client times out and retransmits the FIN packet 3 times, since tcp_orphan_retries is set to 3, it has reached the maximum retransmission count. It will then wait for a period (the time being double that of the last timeout), and if it still does not receive the server's second handshake (ACK packet), the client will disconnect.

What Happens if the Second Handshake is Lost?#

When the server receives the client's first handshake, it will first reply with an ACK confirmation packet, at which point the server's connection enters the CLOSE_WAIT state.

As mentioned earlier, the ACK packet will not be retransmitted, so if the second handshake is lost, the client will trigger the timeout retransmission mechanism and retransmit the FIN packet until it receives the server's second handshake or reaches the maximum retransmission count.

For example, if the tcp_orphan_retries parameter value is 2, the following process occurs when the second handshake is continuously lost:

Specific process: When the client times out and retransmits the FIN packet 2 times, since tcp_orphan_retries is set to 2, it has reached the maximum retransmission count. It will then wait for a period (the time being double that of the last timeout), and if it still does not receive the server's second handshake (ACK packet), the client will disconnect.

Here, when the client receives the second handshake, which is the server's ACK packet, it will enter the FIN_WAIT_2 state. In this state, it needs to wait for the server to send the third handshake, which is the server's FIN packet.

For connections closed by calling the close function, the FIN_WAIT_2 state should not last too long, and the tcp_fin_timeout controls the duration of this state, with a default value of 60 seconds.

This means that for connections closed by calling close, if the FIN packet is not received within 60 seconds, the client's (active closing party) connection will directly close, as shown in the following diagram:

However, if the active closing party uses the shutdown() function to close the connection, specifying only to close the sending direction while not closing the receiving direction, it means that the active closing party can still receive data.

In this case, if the active closing party does not receive the third handshake for a long time, the active closing party's connection will remain in the FIN_WAIT_2 state indefinitely (the tcp_fin_timeout cannot control connections closed by shutdown). As shown in the following diagram:

What Happens if the Third Handshake is Lost?#

When the server (the passive closing party) receives the client's (the active closing party) FIN packet, the kernel will automatically reply with an ACK, and the connection will enter the CLOSE_WAIT state. As the name suggests, it indicates that it is waiting for the application process to call the close function to close the connection.

At this point, the kernel does not have the authority to close the connection on behalf of the process; it must be actively closed by the process to trigger the server to send the FIN packet.

If the server is in the CLOSE_WAIT state and calls the close function, the kernel will send a FIN packet, and the connection will enter the LAST_ACK state, waiting for the client to return an ACK to confirm the connection closure.

If it does not receive this ACK for a long time, the server will retransmit the FIN packet, with the retransmission count still controlled by the tcp_orphan_retries parameter, which is the same as the retransmission count for the client's FIN packet.

For example, if tcp_orphan_retries = 3, the following process occurs when the third handshake is continuously lost:

Specific process:

When the server retransmits the third handshake packet 3 times, since tcp_orphan_retries is set to 3, it has reached the maximum retransmission count. It will then wait for a period (the time being double that of the last timeout), and if it still does not receive the client's fourth handshake (ACK packet), the server will disconnect.
The client will disconnect because it closed the connection by calling the close function, and if it does not receive the server's third handshake (FIN packet) within the tcp_fin_timeout duration, it will disconnect.

What Happens if the Fourth Handshake is Lost?#

When the client receives the server's third handshake FIN packet, it will send an ACK packet back, which is the fourth handshake. At this point, the client's connection enters the TIME_WAIT state.

In the Linux system, the TIME_WAIT state will last for 2MSL before entering the closed state.

Then, the server (the passive closing party) will remain in the LAST_ACK state until it receives the ACK packet.

If the fourth handshake's ACK packet does not reach the server, the server will retransmit the FIN packet, with the retransmission count still controlled by the tcp_orphan_retries parameter.

For example, if tcp_orphan_retries is set to 2, the following process occurs when the fourth handshake is continuously lost:

Specific process:

When the server retransmits the third handshake packet 2 times, since tcp_orphan_retries is set to 2, it has reached the maximum retransmission count. It will then wait for a period (the time being double that of the last timeout), and if it still does not receive the client's fourth handshake (ACK packet), the server will disconnect.
The client enters the TIME_WAIT state after receiving the third handshake, starting a timer for 2MSL. If it receives the server's retransmitted FIN packet during this time, it will reset the timer. After waiting for 2MSL, the client will disconnect.

Why is the TIME_WAIT Waiting Time 2MSL?#

MSL is Maximum Segment Lifetime, which is the maximum lifetime of a packet in the network. Once this time is exceeded, the packet will be discarded. TCP packets are based on the IP protocol, and the IP header contains a TTL field, which indicates the maximum number of hops a data packet can take through routers. Each time it passes through a router, this value decreases by 1. When this value reaches 0, the data packet will be discarded, and an ICMP packet will be sent to notify the source host.

The difference between MSL and TTL: MSL is measured in time, while TTL is measured in the number of hops. Therefore, MSL should be greater than or equal to the time it takes for TTL to reach 0 to ensure that the packet has naturally disappeared.

The TTL value is generally 64, and Linux sets MSL to 30 seconds, meaning Linux believes that a data packet passing through 64 routers will not exceed 30 seconds. If it does, it is assumed that the packet has disappeared from the network.

The TIME_WAIT state lasts for 2 times the MSL, and a reasonable explanation is that the network may have data packets coming from the sender. When these sender packets are processed by the receiver, they will send a response back to the sender, so the time to wait is 2 times the MSL.

For example, if the passive closing party does not receive the final ACK packet for the disconnection, it will trigger the timeout retransmission of the FIN packet. The other party will send back an ACK upon receiving the FIN, and the two-way communication will take exactly 2 MSL.

It can be seen that the 2MSL duration allows for at least one packet loss. For instance, if the ACK is lost within one MSL, the FIN retransmitted by the passive party will arrive within the second MSL, and the TIME_WAIT state can handle this situation.

Why not wait for 4 or 8 MSL? You can imagine a scenario with a 1% packet loss rate; the probability of two consecutive losses is only 0.01%, which is extremely low. Ignoring it is more cost-effective than trying to solve it.

The 2MSL duration starts counting from the moment the client receives the FIN and sends the ACK. If during the TIME-WAIT period, the client's ACK does not reach the server and the client receives the server's retransmitted FIN packet, the 2MSL timer will reset.

In the Linux system, the default duration for 2MSL is 60 seconds, meaning that one MSL is 30 seconds. The time spent in the TIME_WAIT state in the Linux system is fixed at 60 seconds.

This is defined in the Linux kernel code as TCP_TIMEWAIT_LEN:

#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT state, about 60 seconds  */

If you want to modify the duration of TIME_WAIT, you can only change the value of TCP_TIMEWAIT_LEN in the Linux kernel code and recompile the Linux kernel.

Why is the TIME_WAIT State Needed?#

Only the party that actively initiates the closure of the connection will have the TIME_WAIT state.

The main reasons for needing the TIME_WAIT state are twofold:

To prevent historical connection data from being incorrectly received by subsequent connections with the same four-tuple.
To ensure that the "passive closing connection" party can be correctly closed.

Reason One: Preventing Historical Connection Data from Being Incorrectly Received by Subsequent Connections#

To better understand this reason, it is necessary to understand the sequence number (SEQ) and the initial sequence number (ISN).

Sequence Number is a TCP header field that identifies the byte stream of data from the TCP sender to the TCP receiver. Since TCP is a reliable protocol based on byte streams, it assigns a number to each byte in each direction to ensure the order and reliability of messages. The sequence number is a 32-bit unsigned number, so it wraps back to 0 after reaching 4G.
Initial Sequence Number is generated randomly based on the clock when establishing a TCP connection. It can be viewed as a 32-bit counter that increments by 1 every 4 microseconds, completing a cycle in 4.55 hours.

The Seq in the following diagram represents the sequence number, where the red boxes indicate the initial sequence numbers generated by the client and server.

From previous discussions, we know that the sequence number and initial sequence number are not infinitely increasing and will wrap back to the initial value, meaning that it is impossible to determine new and old data based on the sequence number.

Assuming that the TIME-WAIT state does not have a waiting time or is too short, what will happen when delayed packets arrive?

As shown in the diagram:

The SEQ = 301 packet sent by the server before closing the connection is delayed in the network.
Next, the server reopens a new connection with the same four-tuple, and the previously delayed SEQ = 301 packet arrives at the server. The sequence number of this data packet happens to be within the server's receiving window, so the server will normally receive this data packet. However, this data packet is a remnant of the previous connection, leading to data confusion and other serious issues.

It can be seen that if the TIME-WAIT state does not wait long enough, the delayed packets can be received by the new connection, leading to confusion.

To prevent historical connection data from being incorrectly received by subsequent connections with the same four-tuple, the TCP protocol is designed with the TIME-WAIT state, which lasts for 2MSL. This duration is sufficient for all data packets from the original connection to naturally disappear from the network, ensuring that any subsequent packets are generated by new connections.

Reason Two: Ensuring the Passive Closing Connection Can Be Correctly Closed#

In RFC 793, it is pointed out that another important function of TIME-WAIT is:

TIME-WAIT - represents waiting for enough time to pass to be sure the remote TCP received the acknowledgment of its connection termination request.

This means that the function of TIME-WAIT is to wait for enough time to ensure that the last ACK can be received by the passive closing party, thus helping it to close properly.

If the client (the active closing party) loses its last ACK packet (the fourth handshake) in the network, then according to TCP's reliability principle, the server (the passive closing party) will retransmit the FIN packet.

Assuming that the client does not have the TIME-WAIT state and directly enters the CLOSE state after sending the last ACK packet, if this ACK packet is lost, the server will retransmit the FIN packet. At this point, the client has already entered the closed state, and upon receiving the server's retransmitted FIN packet, it will reply with a RST packet.

The server interprets this RST as an error (Connection reset by peer), which is not an elegant termination method for a reliable protocol.

To prevent this situation, the client must wait a sufficient amount of time to ensure that the server can receive the ACK. If the server does not receive the ACK, it will trigger the TCP retransmission mechanism, and the server will resend a FIN packet. This back-and-forth will take exactly two MSLs.

The client resets the TIME-WAIT state when it receives the server's retransmitted FIN packet.

What Are the Dangers of Excessive TIME_WAIT States?#

Excessive TIME-WAIT states mainly pose two dangers:

The first is occupying system resources, such as file descriptors, memory resources, CPU resources, thread resources, etc.;
The second is occupying port resources. Port resources are also limited, and the generally available port range is 32768–61000, which can also be specified through the net.ipv4.ip_local_port_range parameter.

The impact of excessive TIME_WAIT states differs for clients and servers.

If the client (the active closing party) has too many TIME_WAIT states, it will occupy all port resources, making it impossible to initiate connections to servers with the same "destination IP + destination PORT." However, the occupied ports can still connect to other servers. For details, refer to this article: Can Client Ports Be Reused?

Thus, if the client (the connection-initiating party) establishes connections with the same "destination IP + destination PORT," having too many TIME_WAIT states will limit port resources. If all port resources are occupied, it will be impossible to establish connections with servers having the same "destination IP + destination PORT."

However, even in this scenario, as long as the connection is to different servers, the ports can be reused. Therefore, the client can still initiate connections to other servers because the kernel locates a connection based on the four-tuple (source IP, source port, destination IP, destination port) and does not conflict due to the same client port.

If the server (the active closing party) has too many TIME_WAIT states, it will not lead to port resource limitations because the server only listens on one port. However, due to the uniqueness of the four-tuple, the server can theoretically establish many connections, but too many TCP connections will occupy system resources, such as file descriptors, memory resources, CPU resources, thread resources, etc.

How to Optimize TIME_WAIT?#

Here are several ways to optimize TIME-WAIT, each with its pros and cons:

Enable net.ipv4.tcp_tw_reuse and net.ipv4.tcp_timestamps options;
Adjust net.ipv4.tcp_max_tw_buckets;
Use SO_LINGER in the program to forcefully close connections with RST.

Method One: Enable `net.ipv4.tcp_tw_reuse` and `net.ipv4.tcp_timestamps` Options#

After enabling the following Linux kernel parameters, you can reuse sockets in the TIME_WAIT state for new connections.

One thing to note is that the tcp_tw_reuse feature can only be used by clients (the connection initiators). When this feature is enabled, the kernel will randomly select a connection in the TIME_WAIT state that has exceeded 1 second for reuse in the new connection.

net.ipv4.tcp_tw_reuse = 1

Using this option also requires enabling support for TCP timestamps, which is:

net.ipv4.tcp_timestamps=1 (default is already 1)

The timestamp field is in the TCP header's "options", represented by a total of 8 bytes. The first 4-byte field stores the time the packet was sent, and the second 4-byte field stores the most recent time the sender received data from the other party.

Since timestamps are introduced, the previous 2MSL issue no longer exists because duplicate packets will be naturally discarded due to the expiration of the timestamp.

Method Two: Adjust `net.ipv4.tcp_max_tw_buckets`#

This value defaults to 18000. When the number of connections in the TIME_WAIT state exceeds this value, the system will reset the subsequent TIME_WAIT connection states. This method is relatively aggressive.

Method Three: Use `SO_LINGER` in the Program#

You can set socket options to control the behavior of the close function.

struct linger so_linger;
so_linger.l_onoff = 1;
so_linger.l_linger = 0;
setsockopt(s, SOL_SOCKET, SO_LINGER, &so_linger,sizeof(so_linger));

If l_onoff is non-zero and l_linger is set to 0, calling close will send a RST flag to the other party, and the TCP connection will skip the four-way handshake and directly close.

This provides a possibility to bypass the TIME_WAIT state, but it is a very risky behavior and not recommended.

The methods mentioned above attempt to bypass the TIME_WAIT state, which is not ideal. Although the duration of the TIME_WAIT state is somewhat long and seems unfriendly, it is designed to avoid various issues.

The book "UNIX Network Programming" states: TIME_WAIT is our friend; it is beneficial to us, and we should not try to avoid this state but rather understand it.

If the server wants to avoid excessive TIME_WAIT states, it should never actively close connections, allowing clients to do so, letting the distributed clients bear the TIME_WAIT.

What Are the Reasons for the Server to Have a Large Number of TIME_WAIT States?#

First, it is important to know that the TIME_WAIT state only occurs for the party that actively closes the connection. Therefore, if the server has a large number of TIME_WAIT states, it indicates that the server has actively closed many TCP connections.

The question arises: In what scenarios will the server actively close connections?

The first scenario: HTTP does not use long connections.
The second scenario: HTTP long connections time out.
The third scenario: The number of requests for HTTP long connections reaches the upper limit.

First Scenario: HTTP Does Not Use Long Connections#

Let's first look at how the HTTP long connection (Keep-Alive) mechanism is enabled.

In HTTP/1.0, it is disabled by default. If the browser wants to enable Keep-Alive, it must add the following to the request header:

Connection: Keep-Alive

When the server receives the request and responds, it also adds this to the response header:

Connection: Keep-Alive

This way, the TCP connection will not be interrupted but will remain open. When the client sends another request, it will use the same TCP connection. This continues until either the client or server requests to close the connection.

From HTTP/1.1 onwards, Keep-Alive is enabled by default. Most browsers now use HTTP/1.1 by default, so Keep-Alive is generally enabled.

If the HTTP long connection mechanism is disabled in either the client or server's HTTP header, it means that if either party's HTTP header contains Connection:close, the HTTP long connection mechanism cannot be used.

When the HTTP short connection mechanism is used, each request goes through this process: establish TCP -> request resource -> respond to resource -> release connection. This is known as HTTP short connection, as shown in the diagram:

From previous discussions, we know that as long as either party's HTTP header contains Connection:close, the HTTP long connection mechanism cannot be used. After completing an HTTP request/processing, the connection will be actively closed.

Q1: The question arises, is it the client or the server that actively closes the connection in this case?
A1: The RFC document does not specify who should close the connection. Both the request and response parties can actively close the TCP connection.

However, according to the implementation of most web services, regardless of which party disables HTTP Keep-Alive, it is usually the server that actively closes the connection. Therefore, at this point, the server will have a large number of TIME_WAIT state connections.

Q2: If the client disables HTTP Keep-Alive and the server enables HTTP Keep-Alive, who is the active closing party?
A2: When the client disables HTTP Keep-Alive, the HTTP request header will contain Connection:close. At this point, **the server will actively close the connection after sending the HTTP response.