Latency, instead of throughput, is found as the system bottleneck more often than not. However, the TCP socket enables a so-called nagle algorithm by default, which delays an egress packet in order to coalesces it with one that could be sent in the future, into a single TCP segment. This effectively reduces the number of TCP segments and the bandwidth overhead used by the TCP headers, whilst potentially imposes latency for every network request (response) being sent.
Lock, and his temperamental brother, Block, are the two notorious villains in the world of programming. In the beginning, they always show up to assist. But sooner or later, they will kick your back-end like really hard.
When I consider about nagle algorithem, it seems to me another scenario involving block operations which are meant to be helpful. So I decide to put hands on a keyboard to test if I am wrong.
Client OS: Debian 4.9.88
Server OS (LAN & WAN): Unbutu 16.04
Hardware (or VM) setup
Server (LAN): Intel® Core™2 Duo CPU E8400 @ 3.00GHz × 2, 4GB
Server (WAN): t2.micro, 1GB
First thing first, the code of
The client code given above sends 1000 packets 4 bytes long in an interval indicated by the last command line argument. And as discussed, it adopts the default TCP behavior by default. The server is not different than a discard server, so the code is irrelevant here.
In this test, I will record the number of packets that are aggregated in different intervals, by adjusting the mentioned argument. This way, we can grasp the extent of latency the nagle algorithm can impose. The same test is conducted in both LAN (RTT < 0.6ms) and WAN (RTT ≈ 200ms).
As given in the figures, the number aggregated packets approaches to 0 when the interval is greater than the RTT. This conforms to what described in <<TCP/IP Illustrated>>
This algorithm says that a TCP connection can have only one outstanding small segment that has not yet been acknowledged. No additional small segments can be sent until the acknowledgment is received.
If looking at the
tcpdump output, we can also see that this algorithm effectively changes the sending interval to the RTT regardless of the actual
write(2) frequency of the program. And the packets between two sends are those being aggregated.
As a result, the delay imposed on every packet by the algorithm is RTT on average and 2 * RTT in worst case.
Delayed ACK is another similar algorithm, here I will just use the lines from <<TCP/IP Illustrated>> to brief the mechanism
TCP will delay an ACK up to 200 ms to see if there is data to send with the ACK.
Apperantly nagle algorithm is not happy with delayed ACK.
In some cases when the back-end do not reply instantly to a request, delayed ACK will have to wait for another request which is potentially delayed by nagle algorithm waiting for ACK. This senario where two resources waiting for each other, in another word, is called a dead-lock.
Remember the two brothers mentioned in the beginning?
Unfortunately, in my environments, seems like the delayed ACK is disabled by default and I failed to enable it by
flags = 0;
So I could not hand test the compounded impact.
At the moment when I am writing, except for telnet, most of the other applications, including those of front-end(Firefox, Chromium), back-end(nginx, memcached), and the telnet‘s substitute, ssh, disable nagle algorithm with some code like bellow,
int flags =1;
which indicates that the packets should be emitted as it is.
I think the reasons behind the prevalence of
TCP_NODELAY are as follows,
1) the increasing bandwidth makes the benefits of nagle algorithm more and more negligible - it requires hundreds of thousands of tinygrams to saturate an edge node with mediocre bandwidth nowadays; and
2) app that generate a lot of tinygram tend to demand low latency.
To conclude, technically, it’s probably not a good idea to turn a modern real-time on-line battle arena into some (200 ms) turn based 80s RPG.