A few days before the time of this writing, certain problems arose at a high traffic (2.5 TB/day) and high load (50-75%) site.
What were the symptoms? The original settings ultimately led to TCP Retransmits, slowed down the specific users requests by aproximately 67 seconds on MacOS X, tho only by a few seconds in windows which maybe made the problem so hard to spot in the first place.
However, LTE connections and other locations home networks seemed to have been unaffected by the issue, and an older linux router which seemed to eventually have compensated the mentioned problem until recently again made it hard to spot.
Normally, it is the other way around, but this time, a skilled customer took a deeper look into the issue (thnx, alex!) and gave us a hint having identified the probable involvement of the tcp_tw_recycle setting which can be checked by e.g.
# cat /proc/sys/net/ipv4/tcp_tw_recycle 1
What it does:
“Enable fast recycling of sockets in TIME-WAIT status. The default value is 0 (disabled). It should not be changed without advice/request of technical experts.”
Having mentioned tw_recycle, two other relevant settings are tcp_tw_reuse and tcp_max_tw_buckets which seems to be set to 262144 on newer and to as low as 16384, 65536 or 131072 on older systems. So yes, again there could be a connection to a high traffic site having roughly 77k connection states for some while, correlating w/ symptom appearance.
So, just to be safe, I would recommend setting it to a high value by
echo 262144 > /proc/sys/net/ipv4/tcp_max_tw_buckets
One could think that lots of TIME-WAIT states only arise if you have many clients w/ (too) idle connections coming from one NATed network, and that under normal circumstances the fast recycling may make sense. However, with the connection tracking table max. entries set to let’s say 512k, most systems perhaps never need to recycle the WAIT-STATES.
In the end, lets not forget about the fact that if we find problems, some trouble may have already taken place, but this is also the base for fixing it in most if not all cases.