On 07/03/12 15:59, Matt Doidge wrote:
> Thanks for the replies Sam & Leslie,
>> So, although you've really covered it, are the periods when you get the
>> failures correlated with load (or IOwait) on the disk server (even if it's
>> not apparently "high enough to break things"?)
>> I'm wondering if you were exhausting something like the available ports in
>> the GridFTP pool or something.
> I haven't succeed in figuring out exactly when the failures occur from
> the FTS pages, but looking at the Ganglia monitoring for the node
> there's no recent periods of high load or IOwait. There are small
> increases in load when the transfers are coming in (as you'd expect),
> but I haven't seen the disk-servers 1-minute load get past 0.5 in the
> past week, and there's been no appreciable IOwait.
> The gridftp port exhaustion thing was something that I considered,
> annoyingly number of connections is not something we currently
> monitor. I'll throw together something that I can keep an eye on, this
> seems like a problem I need to catch in the act.
>> I doubt this is the cause as the pathology is slightly different, but we saw some weird gridftp errors for ATLAS transfers > from Europe to the CA-SCINET-T2 site in Toronto when we had asymmetric routes as the LHCONE infrastructure was > going in, but not all addresses were being advertized properly through the VRFs. I don't know the source of the T2K > data or if there is any LHCONE work going on in the UK yet, but thought I would mention it.
shows that QMUL had
globus_ftp_client: the server responded with an error 500 500-Command
failed. : an I/O operation was cancelled 500-globus_xio: Operation was
canceled 500 End
errors at 06-MAR-12 06.19.33.000000 PM +00:00 to 06-MAR-12
06.22.34.000000 PM +00:00
all for transfers from machines withing usatlas.bnl.gov
> There's no LHCONE work in the UK, *but* Lancaster is sitting at the
> end of its own Lightpath to RAL. If the routing down the lightpath is
> causing transfer assymmetries that could be causing a problem. It
> looks like I'll have to poke t2k and the FTS guys for some answers.
These are all atlas transfer errors, not t2k though.
The other thing you should be aware of is that Brian was going to tweak
the FTS settings to reduce the timeouts for traffic between Tier-1 and
Tier-2 sites. This was expected to result in a 1.5% failure rate.
>>> Heya guys,
>>> A good portion (about 30%) of t2k.org FTS tranfers to Lancaster have
>>> been failing over the last fortnight with this error message;
>>> globus_ftp_client: the server responded with an error 500 500-Command
>>> failed. : globus_xio: Unable to connect to 220.127.116.11:24383
>>> 500-globus_xio: System error in connect: Connection timed out
>>> 500-globus_xio: A system call failed: Connection timed out 500 End.
>>> (the port number changes, but as t2k only have access to one disk
>>> server the IP address stays the same- which of course could be the
>>> root of the problem).
>>> As can be seen here (for a limited time at least):
>>> I'm scratching my head trying to figure out this problem. The disk
>>> server in question seems to be busy but not heavily loaded. Network
>>> usage seems well within reasonable limits. There is no phenomena
>>> causing a buildup of nasty CLOSE_WAIT connections blocking ports. The
>>> globus tcp port ranges and iptables appear to be correctly set (and
>>> most configuration problems would cause all t2k.org transfers to
>>> fail). The server-side logs are empty of anything useful. If it was a
>>> LAN network problem I'd expect to see some failures on disks within
>>> the same switch, which I don't (similar with any WAN problems). Only
>>> t2k.org are seeing this problem, but then they're the only "other VO"
>>> using the FTS to transfer large amounts of data into our "other" pool.
>>> And now I've gotten a bit stuck figuring this one out. Had anyone seen
>>> a problem like this before, or have any ideas what may be the cause of
>>> the problem? I thought I'd ask you chaps before I poked the FTS guys.
>>> Thanks in advance,