Ok folks: we have an Answer!
On 30 Jun 2010, at 13:06, Stuart Purdie wrote:
> On 30 Jun 2010, at 12:25, Rob Fay wrote:
>
>> I then restored all settings to defaults apart from SACK and DSACK being disabled, and all transfers since then have been 100%. However, there haven't been that many transfers since then, so I don't think I can really say with certainty that SACK/DSACK are the issue, but the evidence so far would appear to indicate that may be the case, at Liverpool at least.
>
> That's Just Plain Weird! (Assuming that the problem still disappears when there is no NAT box).
>
> There's a known problem with SACK and linux, for LFN's - i.e. if the buffers get over 20Mb, then it takes too long for the kernel to search the buffers, and it misses the timeouts. (See, e.g. http://fasterdata.es.net/TCP-tuning/linux.html ). I didn't think that this would apply because SACK needs support on _both_ sides, and the target nodes will probably have it turned off (as YAIM is fond of doing). Except, of course, I'm assuming that YAIM tunes _all_ disk pool nodes, across all the SE types. That might not be a good assumption - we know that it tunes DPM pool nodes (and thus SACK is off), but if dCache and CASTOR nodes don't get the same treatment by default, that might put SACK back in the picture as the culprit.
The Castor disk server nodes for LHCb (at least) have SACK turned on. [0]
Therefore my conclusions are:
1. Unless SACK is disabled on the worker node, SACK is established.
2. SACK packets are not getting passed the NAT. [1]
3. This slows down retransmission until the SACK gives up and sends a conventional retransmission ACK.
4. This makes it all so slow that it hits the timeouts. [2]
This explains why DPM sites seemed unaffected; YAIM disables SACK on those pool nodes, but not any other SE. [3]
It also explains why bypassing the NAT fixes the problem (but that's not an option in general; too many nodes for the available IP's).
Therefore, the advice from me is to set "net.ipv4.tcp_sack = 1" for the worker nodes behind NAT.
Stuart
[0] Ignacio Reguero, personal communication. The pool nodes are on SL 5.3 defaults, I think.
[1] This is worthy of further investigation. I'll put my CS hat on for that and take it to the other office.
[2] No other cases of failed TCP/IP connections reported, so we can reasonably assume that transfers would complete, eventually.
[3] That's worth poking at - why is YAIM so different for the different SE types?
|