setsockopt, TCP_NODELAY and Packet Aggregation I

Latency, instead of throughput, is found as the system bottleneck more often than not. However, the TCP socket enables a so-called nagle algorithm by default, which delays an egress packet in order to coalesces it with one that could be sent in the future, into a single TCP segment. This effectively reduces the number of TCP segments and the bandwidth overhead used by the TCP headers, whilst potentially imposes latency for every network request (response) being sent.

Lock, and his temperamental brother, Block, are the two notorious villains in the world of programming. In the beginning, they always show up to assist. But sooner or later, they will kick your back-end like really hard.

When I consider about nagle algorithem, it seems to me another scenario involving block operations which are meant to be helpful. So I decide to put hands on a keyboard to test if I am wrong.

Software setup
Client OS: Debian 4.9.88
Server OS (LAN & WAN): Unbutu 16.04
gcc: 6.3.0

Hardware (or VM) setup
Server (LAN): Intel® Core™2 Duo CPU E8400 @ 3.00GHz × 2, 4GB
Server (WAN): t2.micro, 1GB

The impact nagle algorithm has on latency

First thing first, the code of client:

and server:

The client code given above sends 1000 packets 4 bytes long in an interval indicated by the last command line argument. And as discussed, it adopts the default TCP behavior by default. The server is not different than a discard server, so the code is irrelevant here.

In this test, I will record the number of packets that are aggregated in different intervals, by adjusting the mentioned argument. This way, we can grasp the extent of latency the nagle algorithm can impose. The same test is conducted in both LAN (RTT < 0.6ms) and WAN (RTT ≈ 200ms).

1000 packets through LAN

1000 packets through WAN

As given in the figures, the number aggregated packets approaches to 0 when the interval is greater than the RTT. This conforms to what described in <<TCP/IP Illustrated>>

This algorithm says that a TCP connection can have only one outstanding small segment that has not yet been acknowledged. No additional small segments can be sent until the acknowledgment is received.

If looking at the tcpdump output, we can also see that this algorithm effectively changes the sending interval to the RTT regardless of the actual write(2) frequency of the program. And the packets between two sends are those being aggregated.

...
18:34:52.986972 IP debian.53700 > ******.compute.amazonaws.com.6666: Flags [P.], seq 4:12, ack 1, win 229, options [nop,nop,TS val 7541746 ecr 2617170332], length 8
18:34:53.178277 IP debian.53700 > ******.amazonaws.com.6666: Flags [P.], seq 12:20, ack 1, win 229, options [nop,nop,TS val 7541794 ecr 2617170379], length 8
18:34:53.369431 IP debian.53700 > ******.amazonaws.com.6666: Flags [P.], seq 20:32, ack 1, win 229, options [nop,nop,TS val 7541842 ecr 2617170427], length 12
18:34:53.560351 IP debian.53700 > ******.amazonaws.com.6666: Flags [P.], seq 32:40, ack 1, win 229, options [nop,nop,TS val 7541890 ecr 2617170475], length 8
18:34:54.325242 IP debian.53700 > ******.amazonaws.com.6666: Flags [P.], seq 68:80, ack 1, win 229, options [nop,nop,TS val 7542081 ecr 2617170666], length 12
...

As a result, the delay imposed on every packet by the algorithm is RTT on average and 2 * RTT in worst case.

Combined with delayed ACK

Delayed ACK is another similar algorithm, here I will just use the lines from <<TCP/IP Illustrated>> to brief the mechanism

TCP will delay an ACK up to 200 ms to see if there is data to send with the ACK.

Apperantly nagle algorithm is not happy with delayed ACK.

In some cases when the back-end do not reply instantly to a request, delayed ACK will have to wait for another request which is potentially delayed by nagle algorithm waiting for ACK. This senario where two resources waiting for each other, in another word, is called a dead-lock.

Remember the two brothers mentioned in the beginning?

Unfortunately, in my environments, seems like the delayed ACK is disabled by default and I failed to enable it by

flags = 0;
flglen = sizeof(flags);
getsockopt(sfd, SOL_TCP, TCP_QUICKACK, &flags, &flglen)

So I could not hand test the compounded impact.

Discussion

At the moment when I am writing, except for telnet, most of the other applications, including those of front-end(Firefox, Chromium), back-end(nginx, memcached), and the telnet‘s substitute, ssh, disable nagle algorithm with some code like bellow,

int flags =1;
setsockopt(sfd, SOL_TCP, TCP_NODELAY, (void *)&flags, sizeof(flags));

which indicates that the packets should be emitted as it is.

...
18:22:38.983278 IP debian.43808 > 192.168.1.71.6666: Flags [P.], seq 1:5, ack 1, win 229, options [nop,nop,TS val 7358245 ecr 6906652], length 4
18:22:38.984149 IP debian.43808 > 192.168.1.71.6666: Flags [P.], seq 5:9, ack 1, win 229, options [nop,nop,TS val 7358246 ecr 6906652], length 4
18:22:38.985028 IP debian.43808 > 192.168.1.71.6666: Flags [P.], seq 9:13, ack 1, win 229, options [nop,nop,TS val 7358246 ecr 6906653], length 4
18:22:38.985897 IP debian.43808 > 192.168.1.71.6666: Flags [P.], seq 13:17, ack 1, win 229, options [nop,nop,TS val 7358246 ecr 6906653], length 4
18:22:38.986765 IP debian.43808 > 192.168.1.71.6666: Flags [P.], seq 17:21, ack 1, win 229, options [nop,nop,TS val 7358246 ecr 6906653], length 4
...

I think the reasons behind the prevalence of TCP_NODELAY are as follows,
1) the increasing bandwidth makes the benefits of nagle algorithm more and more negligible - it requires hundreds of thousands of tinygrams to saturate an edge node with mediocre bandwidth nowadays; and
2) app that generate a lot of tinygram tend to demand low latency.

To conclude, technically, it’s probably not a good idea to turn a modern real-time on-line battle arena into some (200 ms) turn based 80s RPG.

80s RPG

References

TCP/IP Illustrated
RFC 896
Hacker news

That's it. Did I make a serious mistake? or miss out on anything important? Or you simply like the read. Link me on -- I'd be chuffed to hear your feedback.