JISCMail - GRIDPP-STORAGE Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
GRIDPP-STORAGE Archives

GRIDPP-STORAGE@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		GRIDPP-STORAGE Home
		GRIDPP-STORAGE May 2018
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: help debugging transfer failures
From:
Matt Doidge <[log in to unmask]>
Reply-To:
Matt Doidge <[log in to unmask]>
Date:
Thu, 10 May 2018 14:32:42 +0100
Content-Type:
text/plain
Parts/Attachments:
text/plain (307 lines)
Hi Sam,
We're SL6 at Lancaster still (and only on 1.9.0. - upgrading's on my 
todo list).

Cheers,
Matt

On 10/05/18 14:23, Sam Skipsey wrote:
> Sneaking suspicion: which of you guys have IPv6 turned on your storage?
> 
> I think Lancaster's also Centos 7 / DPM 1.9.x (Matt, am I remembering 
> right?), but Matt did some Exciting Things to fix odd IPv6 problems, as 
> I recall.
> 
> On Thu, May 10, 2018 at 2:17 PM Sam Skipsey <[log in to unmask] 
> <mailto:[log in to unmask]>> wrote:
> 
>     Okay, so everyone with an issue with a ticket is on Centos 7 and DPM
>     1.9.x... (this is a head node issue, so that's the important bit).
> 
>     I'll just check the sites I know aren't SL7/Centos 7 in the
>     monitoring and see if they are different.
> 
>     Sam
> 
>     On Thu, May 10, 2018 at 11:46 AM John Bland <[log in to unmask]
>     <mailto:[log in to unmask]>> wrote:
> 
>         At Liverpool all Centos7.4, DPM 1.9.2, puppet.
> 
>         On 10/05/2018 11:37, Govind Songara wrote:
>          > Thanks Simon, headnode is configured using puppet.  Pool node
>         still uses
>          > yaim.
>          >
>          > On Thu, 10 May 2018, 11:19 a.m. George, Simon,
>         <[log in to unmask] <mailto:[log in to unmask]>
>          > <mailto:[log in to unmask] <mailto:[log in to unmask]>>> wrote:
>          >
>          >     Hi Sam,
>          >
>          >     RHUL is running DPM 1.9.0 on Centos 7.3 on the SE head node.
>          >
>          >     The storage nodes are DPM 1.8.10 on SL6.9.
>          >
>          >     Simon
>          >
>          >
>          >
>          >
>          >   
>           ------------------------------------------------------------------------
>          >     *From:* Sam Skipsey <[log in to unmask]
>         <mailto:[log in to unmask]>
>          >     <mailto:[log in to unmask]
>         <mailto:[log in to unmask]>>>
>          >     *Sent:* 10 May 2018 11:12
>          >     *To:* George, Simon
>          >     *Cc:* [log in to unmask]
>         <mailto:[log in to unmask]>
>          >     <mailto:[log in to unmask]
>         <mailto:[log in to unmask]>>
>          >     *Subject:* Re: [GRIDPP-STORAGE] help debugging transfer
>         failures
>          >     Hello:
>          >
>          >     So, it looks like Oxford and RHUL  and the new ECDF-RDF have
>          >     something in common, as all of your transfer failures
>         look similar
>          >     from the ATLAS logs (they look like SOAP errors on PUT
>         DONE (error
>          >     code 500), on otherwise successful transfers).
>          >
>          >     I know Oxford is running on SL7 with DPM 1.9.2 - is there
>         anything
>          >     in common with the other two of you?
>          >
>          >     Sam
>          >
>          >     On Sun, May 6, 2018 at 12:33 PM George, Simon
>         <[log in to unmask] <mailto:[log in to unmask]>
>          >     <mailto:[log in to unmask]
>         <mailto:[log in to unmask]>>> wrote:
>          >
>          >         We got a new ticket for the same problem this weekend:
>          >
>          > https://ggus.eu/index.php?mode=ticket_info&ticket_id=134945
>          >
>          >         How can we move forward on this?
>          >
>          >         Change FTS parameters - how?
>          >
>          >
>          >         Thanks,
>          >
>          >         Simon
>          >
>          >
>          >
>          >       
>           ------------------------------------------------------------------------
>          >         *From:* GRIDPP2: Deployment and support of SRM and
>         local storage
>          >         management <[log in to unmask]
>         <mailto:[log in to unmask]>
>          >         <mailto:[log in to unmask]
>         <mailto:[log in to unmask]>>> on behalf of John Bland
>          >         <[log in to unmask]
>         <mailto:[log in to unmask]> <mailto:[log in to unmask]
>         <mailto:[log in to unmask]>>>
>          >         *Sent:* 03 May 2018 10:52
>          >         *To:* [log in to unmask]
>         <mailto:[log in to unmask]>
>          >         <mailto:[log in to unmask]
>         <mailto:[log in to unmask]>>
>          >         *Subject:* Re: help debugging transfer failures
>          >         Hi,
>          >
>          >         This page has the majority of failed files where the
>         transfer
>          >         time is
>          >         300-600s (plus a few over that). Not one below 300s
>         that I've seen.
>          >
>          >
>         http://dashb-atlas-ddm.cern.ch/ddm2/#activity=(Data+Brokering,Data+Consolidation,Deletion,Express,Functional+Test,Group+Subscriptions,Production,Production+Input,Production+Output,Recovery,Staging,T0+Export,T0+Tape,User+Subscriptions,default,on)&d.error_code=154&d.state=(TRANSFER_FAILED)&date.from=201805021050&date.interval=0&date.to=201805021450&dst.cloud=(%22UK%22)&dst.site=(%22UKI-NORTHGRID-LIV-HEP%22)&dst.tier=(0,1,2)&dst.token=(-IPV6TEST,-DDMTEST,-CEPH,-PPSSCRATCHDISK)&grouping.dst=(cloud,site,token)&m.content=(d_dof,d_eff,d_faf,s_eff,s_err,s_suc,t_eff,t_err,t_suc)&samples=true&src.site=(-RUCIOTEST,-MWTEST,-RDF)&src.tier=(0,1,2)&src.token=(-IPV6TEST,-DDMTEST,-CEPH,-PPSSCRATCHDISK)&tab=details
>          >
>          >         John
>          >
>          >         On 03/05/2018 10:45, Duncan Rand wrote:
>          >         > John
>          >         >
>          >         > Do you have an example of one of those transfers? Here
>          >         >
>          >         >
>         https://fts106.cern.ch:8449/var/log/fts3/transfers/2018-05-03/srm.ndgf.org__se2.ppgrid1.rhul.ac.uk/2018-05-03-0856__srm.ndgf.org__se2.ppgrid1.rhul.ac.uk__761281463__e7e8646a-434c-59d0-b37f-a4d8917f1113
>          >
>          >         >
>          >         >
>          >         > I see a 10GB file taking about 42 minutes and then
>         failing. There are a
>          >         > number of FTS configurations here
>          >         >
>          >         >
>         https://fts3-pilot.cern.ch:8449/fts3/ftsmon/#/config/gfal2
>          >         >
>          >         > a couple are indeed set to 300s/5mins.
>          >         >
>          >         > Duncan
>          >         >
>          >         > On 03/05/2018 09:57, George, Simon wrote:
>          >         >> Thanks John.
>          >         >>
>          >         >> Who is able to check if FTS itself has a timeout
>         in place?
>          >         >>
>          >         >>
>          >         >>
>          >         >>
>         ------------------------------------------------------------------------
>          >         >> *From:* GRIDPP2: Deployment and support of SRM and
>         local storage
>          >         >> management <[log in to unmask]
>         <mailto:[log in to unmask]>
>          >         <mailto:[log in to unmask]
>         <mailto:[log in to unmask]>>> on behalf of John Bland
>          >         >> <[log in to unmask]
>         <mailto:[log in to unmask]> <mailto:[log in to unmask]
>         <mailto:[log in to unmask]>>>
>          >         >> *Sent:* 02 May 2018 23:10
>          >         >> *To:* [log in to unmask]
>         <mailto:[log in to unmask]>
>         <mailto:[log in to unmask]
>         <mailto:[log in to unmask]>>
>          >         >> *Subject:* Re: help debugging transfer failures
>          >         >> Looking at some of the failed transfers we see at
>         Liverpool the SRM logs
>          >         >> show a 5minute timeout of some sort. SRM Put
>         starts, the gridftp server
>          >         >> transfers perfectly, but if the transfer takes
>         more than 5minutes the
>          >         >> SRM control connection gets terminated (but not
>         the GridFTP one that
>          >         >> I've seen). The client then appears to just delete
>         the file in these
>          >         >> circumstances.
>          >         >>
>          >         >> Although it's more than possible our uni firewall
>         is doing this, given
>          >         >> that at least a handful of sites are seeing
>         similar issues and that the
>          >         >> FTS logs themselves show an INFO error of "Timeout
>         stopped" I'd also be
>          >         >> eyeing the FTS servers suspiciously as well.
>          >         >>
>          >         >> It probably only shows up with big files (any I've
>         checked are >2GB at
>          >         >> least) or if the WAN is being saturated enough to
>         take the transfer of
>          >         >> 5mins.
>          >         >>
>          >         >> John
>          >         >>
>          >         >> On 02/05/18 17:18, Govind Songara wrote:
>          >         >>> Hi All,
>          >         >>>
>          >         >>> As mentioned in today meeting, we still see this
>         error.
>          >         >>> It would be great if you can help on this problem.
>          >         >>>
>          >         >>> Thanks
>          >         >>> Govind
>          >         >>>
>          >         >>> On Tue, Apr 10, 2018 at 11:47 AM, George, Simon
>         <[log in to unmask] <mailto:[log in to unmask]>
>         <mailto:[log in to unmask] <mailto:[log in to unmask]>>
>          >         >>> <mailto:[log in to unmask]
>         <mailto:[log in to unmask]>>> wrote:
>          >         >>>
>          >         >>>      I found examples the same type of error at
>         Lancaster if you're
>          >         >>>      interested:
>          >         >>>
>          >         >>>
>          >         >>>
>          >         >>>
>         http://dashb-atlas-ddm.cern.ch/ddm2/#activity=(Data+Brokering,Data+Consolidation,Deletion,Express,Functional+Test,Group+Subscriptions,Production,Production+Input,Production+Output,Recovery,Staging,T0+Export,T0+Tape,User+Subscriptions,default)&d.dst.cloud=%22UK%22&d.dst.site=%22UKI-NORTHGRID-LANCS-HEP%22&d.dst.token=%22DATADISK%22&d.error_code=229&d.src.cloud=%22CA%22&d.state=(TRANSFER_FAILED)&date.from=201804050000&date.interval=0&date.to=201804070000&dst.cloud=(%22UK%22)&dst.site=(%22UKI-NORTHGRID-LANCS-HEP%22)&dst.tier=(0,1,2)&dst.token=(-TEST,-CEPH,-PPS,-GRIDFTP)&grouping.dst=(cloud,site,token)&m.content=(d_dof,d_eff,d_faf,s_eff,s_err,s_suc,t_eff,t_err,t_suc)&p.grouping=src&samples=true&src.site=(-TEST,-RDF,-AWS,-CEPH)&src.tier=(0,1,2)&src.token=(-TEST,-CEPH,-PPS,-GRIDFTP)&tab=details
>          >
>          >         >>>
>          >         >>>
>          >         >>>
>         <http://dashb-atlas-ddm.cern.ch/ddm2/#activity=(Data+Brokering,Data+Consolidation,Deletion,Express,Functional+Test,Group+Subscriptions,Production,Production+Input,Production+Output,Recovery,Staging,T0+Export,T0+Tape,User+Subscriptions,default)&d.dst.cloud=%22UK%22&d.dst.site=%22UKI-NORTHGRID-LANCS-HEP%22&d.dst.token=%22DATADISK%22&d.error_code=229&d.src.cloud=%22CA%22&d.state=(TRANSFER_FAILED)&date.from=201804050000&date.interval=0&date.to=201804070000&dst.cloud=(%22UK%22)&dst.site=(%22UKI-NORTHGRID-LANCS-HEP%22)&dst.tier=(0,1,2)&dst.token=(-TEST,-CEPH,-PPS,-GRIDFTP)&grouping.dst=(cloud,site,token)&m.content=(d_dof,d_eff,d_faf,s_eff,s_err,s_suc,t_eff,t_err,t_suc)&p.grouping=src&samples=true&src.site=(-TEST,-RDF,-AWS,-CEPH)&src.tier=(0,1,2)&src.token=(-TEST,-CEPH,-PPS,-GRIDFTP)&tab=details>
>          >
>          >         >>>
>          >         >>>
>          >         >>>
>          >         >>>
>          >         >>>
>          >         >>>
>         ------------------------------------------------------------------------
>          >         >>>      *From:* George, Simon
>          >         >>>      *Sent:* 06 April 2018 13:17
>          >         >>>      *To:* [log in to unmask]
>         <mailto:[log in to unmask]>
>         <mailto:[log in to unmask]
>         <mailto:[log in to unmask]>>
>          >         >>>      <mailto:[log in to unmask]
>         <mailto:[log in to unmask]>>
>          >         >>>      *Subject:* help debugging transfer failures
>          >         >>>
>          >         >>>      Dear storage experts, especially DPM
>         flavoured ones,
>          >         >>>
>          >         >>>      I'd be grateful if you could take a look at
>         this ticket and give
>          >         >>>      help and/or suggestions on how to get to the
>         bottom of it.
>          >         >>>
>          >         >>>
>         https://ggus.eu/index.php?mode=ticket_info&ticket_id=134144
>          >         >>>     
>         <https://ggus.eu/index.php?mode=ticket_info&ticket_id=134144>
>          >         >>>
>          >         >>>      Thanks,
>          >         >>>
>          >         >>>      Simon
>          >         >>>
>          >         >>>
>          >         >>
>          >         >>
>          >         >> --
>          >         >> John Bland [log in to unmask]
>         <mailto:[log in to unmask]> <mailto:[log in to unmask]
>         <mailto:[log in to unmask]>>
>          >         >> System Administrator             office: 220
>          >         >> High Energy Physics Division     tel (int): 42911
>          >         >> Oliver Lodge Laboratory          tel (ext): +44
>         (0)151 794 2911 <tel:0151%20794%202911> <tel:0151%20794%202911>
>          >         >> University of Liverpool
>         http://www.liv.ac.uk/physics/hep/
>          >         >> "I canna change the laws of physics, Captain!"
>          >
>          >
>          >         --
>          >         John Bland [log in to unmask]
>         <mailto:[log in to unmask]> <mailto:[log in to unmask]
>         <mailto:[log in to unmask]>>
>          >         Research Fellow                  office: 220
>          >         High Energy Physics Division     tel (int): 42911
>          >         Oliver Lodge Laboratory          tel (ext): +44
>         (0)151 794 2911 <tel:0151%20794%202911>
>          >         <tel:0151%20794%202911>
>          >         University of Liverpool http://www.liv.ac.uk/physics/hep/
>          >         "I canna change the laws of physics, Captain!"
>          >
> 
> 
>         -- 
>         John Bland [log in to unmask] <mailto:[log in to unmask]>
>         Research Fellow                  office: 220
>         High Energy Physics Division     tel (int): 42911
>         Oliver Lodge Laboratory          tel (ext): +44 (0)151 794 2911
>         <tel:0151%20794%202911>
>         University of Liverpool http://www.liv.ac.uk/physics/hep/
>         "I canna change the laws of physics, Captain!"
>
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options