Hi all,
thank you for the tips. We finally found the cause of the problems (when
issuing a lot of put/get requests to DPM, some of the transfers failed).
This was caused by a combination of asymmetric routing between WNs and
DPM pools (WNs in a private network, DPM pools in both public and
private, using router's arp proxy to make this work), jumbo frames
allowed only on private network (path from DPM pools->WNs). The
globus-url-copy sets the DF (don't fragment) bit when communicating
WN->DPM pools. This communication went through a cisco router, where
"icmp rate-limit unreachable" was enabled with a value of 500ms (default
behavior).
The result was that when there were more connections at once, the cisco
didn't send out back the "icmp unreachable - fragmentation needed and DF
bit set" in a limit and the sender never found out it has to lower the
MTU and timed out.
I am still not sure why the globus-gridftp-copy sets the DF bit and
doesn't try to use path MTU discovery, but that's a different question.
We solved that by deploying static routes to all dpmpools on WNs.
Best regards
Jiri Horky
On 09/29/2011 11:23 AM, Jiri Horky wrote:
> Hi all,
>
> we switched part (one VLAN) of our local network to jumbo frames (MTU
> payload size of 9000) last week. Since then, we are seeing more or
> less random problems in communication among worker nodes and all dpm
> pools servers when the connections time out. For example (snippet from
> gridftp.log file:)
>
> [8941] Wed Sep 28 03:53:26 2011 :: Server started in inetd mode.
> [8941] Wed Sep 28 03:53:26 2011 :: New connection from:
> saltix06.farm.particle.cz:42124
> [8941] Wed Sep 28 03:56:26 2011 :: saltix06.farm.particle.cz:42124:
> [SERVER]: 421 Idle Timeout: closing control connection.
> [8941] Wed Sep 28 03:56:26 2011 :: Closed connection from
> saltix06.farm.particle.cz:42124
>
> The problems only occur with all worker nodes (may different network
> cards) where jumbo frames were deployed. We are using SL 5.3 - SL 5.5
> with 2.6.18-238.1.1.el5 kernel, DPM-DSI-1.7.4-4sec.sl5 on dpm pool
> servers which hw ranges from IBM x3650 servers to Supermicro boxes.
>
> We think that network switches could be ruled out, as we have two
> different 10Gbit switches and are seeing these problems on both
> switches between WNs and dpm pools connected only to a single switch.
> We suspect that we need some tweak of kernel network stack parameters?
>
> I would be grateful for any tips.
>
> Regards
> Jiri Horky
> FZU AS CR
|