JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for GRIDPP-STORAGE Archives


GRIDPP-STORAGE Archives

GRIDPP-STORAGE Archives


GRIDPP-STORAGE@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

GRIDPP-STORAGE Home

GRIDPP-STORAGE Home

GRIDPP-STORAGE  March 2012

GRIDPP-STORAGE March 2012

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Re: t2k.org FTS transfer problems to Lancaster

From:

"Christopher J.Walker" <[log in to unmask]>

Reply-To:

Christopher J.Walker

Date:

Wed, 7 Mar 2012 16:49:12 +0000

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (91 lines)

On 07/03/12 15:59, Matt Doidge wrote:
> Thanks for the replies Sam & Leslie,
>> So, although you've really covered it, are the periods when you get the
>> failures correlated with load (or IOwait) on the disk server (even if it's
>> not apparently "high enough to break things"?)
>> I'm wondering if you were exhausting something like the available ports in
>> the GridFTP pool or something.
> 
> I haven't succeed in figuring out exactly when the failures occur from
> the FTS pages, but looking at the Ganglia monitoring for the node
> there's no recent periods of high load or IOwait. There are small
> increases in load when the transfers are coming in (as you'd expect),
> but I haven't seen the disk-servers 1-minute load get past 0.5 in the
> past week, and there's been no appreciable IOwait.
> The gridftp port exhaustion thing was something that I considered,
> annoyingly number of connections is not something we currently
> monitor. I'll throw together something that I can keep an eye on, this
> seems like a problem I need to catch in the act.
> 
>> I doubt this is the cause as the pathology is slightly different, but we saw some weird gridftp errors for ATLAS transfers > from Europe to the CA-SCINET-T2 site in Toronto when we had asymmetric routes as the LHCONE infrastructure was > going in, but not all addresses were being advertized properly through the VRFs.  I don't know the source of the T2K > data or if there is any LHCONE work going on in the UK yet, but thought I would mention it.

http://lcgwww.gridpp.rl.ac.uk/cgi-bin/fts-mon/fts-mon.pl?q=transfers&p=day&v=All&c=UKILT2QMUL&s=Failed&i=&.submit=Submit+Query

shows that QMUL had

globus_ftp_client: the server responded with an error 500 500-Command
failed. : an I/O operation was cancelled 500-globus_xio: Operation was
canceled 500 End

errors at 06-MAR-12 06.19.33.000000 PM +00:00 	to 06-MAR-12
06.22.34.000000 PM +00:00

all for transfers from machines withing usatlas.bnl.gov


> 
> There's no LHCONE work in the UK, *but* Lancaster is sitting at the
> end of its own Lightpath to RAL. If the routing down the lightpath is
> causing transfer assymmetries that could be causing a problem. It
> looks like I'll have to poke t2k and the FTS guys for some answers.

These are all atlas transfer errors, not t2k though.

The other thing you should be aware of is that Brian was going to tweak
the FTS settings to reduce the timeouts for traffic between Tier-1 and
Tier-2 sites. This was expected to result in a 1.5% failure rate.

Chris

> 
> Cheers,
> Matt
> 
> 
> 
>>> Heya guys,
>>> A good portion (about 30%) of t2k.org FTS tranfers to Lancaster have
>>> been failing over the last fortnight with this error message;
>>> globus_ftp_client: the server responded with an error 500 500-Command
>>> failed. : globus_xio: Unable to connect to 194.80.35.46:24383
>>> 500-globus_xio: System error in connect: Connection timed out
>>> 500-globus_xio: A system call failed: Connection timed out 500 End.
>>>
>>> (the port number changes, but as t2k only have access to one disk
>>> server the IP address stays the same- which of course could be the
>>> root of the problem).
>>>
>>> As can be seen here (for a limited time at least):
>>>
>>> http://lcgwww.gridpp.rl.ac.uk/cgi-bin/fts-mon/fts-mon.pl?q=transfers_count&p=week&v=All&c=RALLCG2-UKINORTHGRIDLANCSHEP&s=Failed&r=All&i=&.submit=Submit+Query
>>>
>>> I'm scratching my head trying to figure out this problem. The disk
>>> server in question seems to be busy but not heavily loaded. Network
>>> usage seems well within reasonable limits. There is no phenomena
>>> causing a buildup of nasty CLOSE_WAIT connections blocking ports. The
>>> globus tcp port ranges and iptables appear to be correctly set (and
>>> most configuration problems would cause all t2k.org transfers to
>>> fail). The server-side logs are empty of anything useful. If it was a
>>> LAN network problem I'd expect to see some failures on disks within
>>> the same switch, which I don't (similar with any WAN problems). Only
>>> t2k.org are seeing this problem, but then they're the only "other VO"
>>> using the FTS to transfer large amounts of data into our "other" pool.
>>>
>>> And now I've gotten a bit stuck figuring this one out. Had anyone seen
>>> a problem like this before, or have any ideas what may be the cause of
>>> the problem? I thought I'd ask you chaps before I poked the FTS guys.
>>>
>>> Thanks in advance,
>>> Matt
>>
>>

Top of Message | Previous Page | Permalink

JISCMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004


WWW.JISCMAIL.AC.UK

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager