On 1 Jul 2010, at 18:14, Rob Fay wrote:
> On 01/07/2010 16:43, Stuart Purdie wrote:
>> And that's why I was wanting the NAT configs, so that I can do a comparison. Lanacaster are using -j MASQUERADE on POSTROUTING, we're using -j SNAT; and that might be enough. Certainly, the code for the two modules in 2.6.18 (SL5.3 default kernel) is markedly different; although I've not had time to fully digest it yet.
>
> MASQUERADE vs SNAT shouldn't affect the issue we're seeing though - MASQUERADE should really just get the IP address to use for translation dynamically and then hand off the rest of the NAT setup to generic routines. They should essentially be the same.
And SNAT shouldn't be dropping SACK packets either.
Specifically, there's a test in the MASQUERADE code of:
IP_NF_ASSERT(ct && (ctinfo == IP_CT_NEW || ctinfo == IP_CT_RELATED
|| ctinfo == IP_CT_RELATED + IP_CT_IS_REPLY));
that's missing in the SNAT code (SNAT is actually in a file called ipt_NETMAP.c in 2.6.18). My suspicion is that conntrack identifies the packets as RELATED, but also as INVALID (due to that bug). MASQERADE checks to see if it's related explicitly, whereas SNAT has no such test - so it might be being eaten as INVALID before being forwarded. Sadly, my days as a kernel hacker are too dim and distant (and wrong part of the kernel) to be certain about that interpretation yet.
That's my best theory; at any rate.
If any other site affected by LHCb transfer problems uses MASQUERADE, we can eliminate it from the enquiries. I'd run a direct test, but we've no LHCb production up here.
I guess that at Liverpool, you're using SNAT, Rob?
Anyone else using MASQUERADE?
|