Hi,
Since updating the library we've had no transfers fail in the way they
were previously. No new issues that we've noticed.
We'll give it another day to be sure, but I think this has fixed it.
Cheers,
John
On 29/05/2018 16:55, Sam Skipsey wrote:
> Hello everyone:
>
> The DPM devs have a (beta) fix for the problem, which you're welcome to
> test if you want. It's a modified version of the CGSI-GSOAP library for
> SL7/Centos7, which should "do the right thing" when a server gets a
> request to use a new connection from a client with an existing (but
> timed out) connection.
>
> http://dmc-repo.web.cern.ch/dmc-repo/rc/el7/x86_64/?C=M;O=D for the
> testing release. (You can roll back to the previous version safely if
> this doesn't work.)
>
> Do feel free to test it, or not, and feed back if it changes anything
> for you.
>
> Sam
>
> On Wed, May 16, 2018 at 10:02 AM Sam Skipsey <[log in to unmask]
> <mailto:[log in to unmask]>> wrote:
>
> hi Everyone: update from DPM dev - SRM has a hard-coded (no, you
> can't modify it in a configuration script) 5 minute timeout in it.
> The difference between the SL6 and SL7 behaviour is that, for some
> reason, with the SL6/RHEL6 version of gsoap, the connection timeout
> doesn't stop SRM responding to packets afterward (!), whilst for
> SL7/Centos7/RHEL7 it seems to.
> Exactly why this happens - and how to resolve it - is in progress.
>
> Sam
>
> On Tue, May 15, 2018 at 8:35 AM John Bland <[log in to unmask]
> <mailto:[log in to unmask]>> wrote:
>
> Ditto for Liverpool.
>
> Anyone got any word from the FTS guys yet?
>
> John
>
> On 14/05/2018 22:34, Govind Songara wrote:
> > Hi Sam,
> >
> > I have made changes this afternoon, but still seeing the same
> error.
> >
> > Thanks
> > Govind
> >
> > On Mon, May 14, 2018 at 4:40 PM, Sam Skipsey
> > <[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>> wrote:
> >
> > Hi Chaps,
> >
> > Andrea at DPM has discovered the following problem with
> Centos 7
> > versus SL6 - it ignores ulimit.conf settings for services.
> > As there's a few changes to the ulimit max open files
> settings for
> > srmv2.2, this could be causing the srm service to have issues
> > running out of file handles and timing out with requests.
> >
> > If you'd like to test if this is the case for your sites
> (everyone
> > with an SRM 500 error), you can apply an increased max
> open files in
> > systemd by doing:
> >
> > systemctl edit srmv2.2.service
> >
> > then adding
> >
> > [Service]
> > LimitNOFILE=65000
> >
> > in the editor,
> >
> > and restarting the service when done.
> >
> > If you do this, please let me know how it goes so I can
> feed back to
> > Andrea.
> >
> > Sam
> >
> >
> > On Mon, May 14, 2018 at 3:48 PM Kashif Mohammad
> > <[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>> wrote:
> >
> > Hi Sam____
> >
> > __ __
> >
> > We are still failing transfer and as you explained to
> me in ops
> > meeting that it is the last stage which fails. I
> looked through
> > the logs for one example and the same file being
> tried multiple
> > times____
> >
> > __ __
> >
> > It start with prepare to put____
> >
> > __ __
> >
> > [root@t2se01 srmv2.2]# grep -r
> > "DAOD_EXOT8.13867737._000090.pool.root.1" log____
> >
> > 05/14 14:45:29.075 206398,7 PrepareToPut: SRM98 -
> PrepareToPut 0
> >
> srm://t2se01.physics.ox.ac.uk/dpm/physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1 <http://t2se01.physics.ox.ac.uk/dpm/physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>
> >
> <http://t2se01.physics.ox.ac.uk/dpm/physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>____
> >
> > 05/14 15:29:54.943 206398,0 PrepareToPut: SRM98 -
> PrepareToPut 0
> >
> srm://t2se01.physics.ox.ac.uk/dpm/physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1 <http://t2se01.physics.ox.ac.uk/dpm/physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>
> >
> <http://t2se01.physics.ox.ac.uk/dpm/physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>____
> >
> > __ __
> >
> > Looking at the pool node, the file was actually
> copied ____
> >
> > __ __
> >
> > [root@t2se45 dpm-gsiftp]# grep -r
> > DAOD_EXOT8.13867737._000090.pool.root.1 gridftp.log____
> >
> > [41663] Mon May 14 14:45:30 2018 :: dmlite :: stat ::
> >
> /t2se45.physics.ox.ac.uk:/dpm/pool3/atlas/2018-05-14/DAOD_EXOT8.13867737._000090.pool.root.1.194527267.0
> > :: /DC=ch/DC=cern/OU=Organic
> > Units/OU=Users/CN=ddmadmin/CN=531497/C
> N=Robot: ATLAS
> > Data Management :: fts106.cern.ch
> <http://fts106.cern.ch> <http://fts106.cern.ch>____
> >
> > [41663] Mon May 14 14:45:31 2018 :: dmlite :: user
> error :: 2 ::
> > [#00.000002] Could not open
> >
> t2se45.physics.ox.ac.uk:/dpm/pool3/atlas/2018-05-14/DAOD_EXOT8.13867737._000090.pool.root.1.194527267.0
> > :: /DC=ch/DC=cern/OU=Organic
> > Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data
> > Management :: fts106.cern.ch <http://fts106.cern.ch>
> <http://fts106.cern.ch>____
> >
> > [41663] Mon May 14 14:45:31 2018 :: Starting to transfer
> >
> "/t2se45.physics.ox.ac.uk:/dpm/pool3/atlas/2018-05-14/DAOD_EXOT8.13867737._000090.pool.root.1.194527267.0".____
> >
> > [41663] Mon May 14 15:09:31 2018 :: Finished transferring
> >
> "/t2se45.physics.ox.ac.uk:/dpm/pool3/atlas/2018-05-14/DAOD_EXOT8.13867737._000090.pool.root.1.194527267.0".____
> >
> > __ __
> >
> > __ __
> >
> > Then it was deleted, probably put_done process failed____
> >
> > __ __
> >
> > 05/14 14:45:28.731 4714,0 Cns_srv_delete: NS098 - delete
> >
> /dpm/physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1 <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>
> >
> <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>____
> >
> > 05/14 14:45:29.239 4714,0 Cns_srv_stat: NS098 - stat 0
> >
> /dpm/physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1 <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>
> >
> <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>____
> >
> > 05/14 14:45:29.323 4714,0 Cns_srv_creat: NS098 - creat
> >
> /dpm/physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1 <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>
> >
> <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>
> > 664 22____
> >
> > 05/14 14:45:29.365 4714,0 Cns_srv_addreplica: NS098 -
> > addreplica t2se45.physics.ox.ac.uk
> <http://t2se45.physics.ox.ac.uk>
> > <http://t2se45.physics.ox.ac.uk>
> >
> t2se45.physics.ox.ac.uk:/dpm/pool3/atlas/2018-05-14/DAOD_EXOT8.13867737._000090.pool.root.1.194527267.0____
> >
> > 05/14 14:45:31.157 4714,0 Cns_srv_statg: NS098 - statg
> >
> /dpm/pool3/atlas/2018-05-14/DAOD_EXOT8.13867737._000090.pool.root.1.194527267.0____
> >
> > 05/14 14:45:31.199 4714,0 Cns_srv_statr: NS098 - statr
> >
> t2se45.physics.ox.ac.uk:/dpm/pool3/atlas/2018-05-14/DAOD_EXOT8.13867737._000090.pool.root.1.194527267.0____
> >
> > 05/14 14:45:31.489 4714,0 Cns_srv_accessr: NS098 -
> accessr 2
> >
> t2se45.physics.ox.ac.uk:/dpm/pool3/atlas/2018-05-14/DAOD_EXOT8.13867737._000090.pool.root.1.194527267.0____
> >
> > 05/14 15:09:31.362 4714,0 Cns_srv_statr: NS098 - statr
> >
> t2se45.physics.ox.ac.uk:/dpm/pool3/atlas/2018-05-14/DAOD_EXOT8.13867737._000090.pool.root.1.194527267.0____
> >
> > 05/14 15:09:31.444 4714,0 Cns_srv_getreplicax: NS098 -
> > getreplicax
> >
> /dpm/physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1 <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>
> >
> <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>____
> >
> > 05/14 15:09:34.400 4714,0 Cns_srv_delreplica: NS098 -
> > delreplica
> >
> t2se45.physics.ox.ac.uk:/dpm/pool3/atlas/2018-05-14/DAOD_EXOT8.13867737._000090.pool.root.1.194527267.0____
> >
> > 05/14 15:09:34.402 4714,0 Cns_srv_getreplicax: NS098 -
> > getreplicax
> >
> /dpm/physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1 <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>
> >
> <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>____
> >
> > 05/14 15:09:34.403 4714,0 Cns_srv_unlink: NS098 - unlink
> >
> /dpm/physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1 <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>
> >
> <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>____
> >
> > 05/14 15:09:34.546 4714,0 Cns_srv_delete: NS098 - delete
> >
> /dpm/physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1 <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>
> >
> <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>____
> >
> > 05/14 15:29:54.602 4714,1 Cns_srv_delete: NS098 - delete
> >
> /dpm/physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1 <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>
> >
> <http://physics.ox.ac.uk/home/atlas/atlasdatadisk/rucio/mc16_13TeV/3f/c8/DAOD_EXOT8.13867737._000090.pool.root.1>____
> >
> > __ __
> >
> > __ __
> >
> > Should we open a ticket with dpm developers ?____
> >
> > __ __
> >
> > Cheers____
> >
> > __ __
> >
> > Kashif ____
> >
> > __ __
> >
> > __ __
> >
> > __ __
> >
> > __ __
> >
> > __ __
> >
> > *From:*GRIDPP2: Deployment and support of SRM and
> local storage
> > management [mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>] *On Behalf Of *Sam Skipsey
> > *Sent:* 10 May 2018 15:11
> > *To:* [log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > *Subject:* Re: help debugging transfer failures____
> >
> > __ __
> >
> > Okay so:
> >
> > Working sites (with no evidence of the signature SRM
> PUT-DONE
> > SOAP errors):
> > Glasgow (SL6, DPM 1.9.0, IPv4)
> > Lancaster (SL6, DPM 1.9.0, IPv6)
> >
> > Issue Sites:
> > Liverpool (Centos 7, DPM 1.9.x, IPv4)
> > Oxford (SL7, DPM 1.9.x, IPv4)
> > RHUL (Centos 7, DPM 1.9.x, IPv4)
> > ECDF-RDF (Centos 7, DPM 1.9.x, IPv4 I think because 6 was
> > problematic)
> >
> > So, assuming this means anything, the only common
> factor seems
> > to be RHEL7-based release rather than a RHEL-6 one.
> >
> > Sam____
> >
> > __ __
> >
> > On Thu, May 10, 2018 at 2:37 PM George, Simon
> > <[log in to unmask] <mailto:[log in to unmask]>
> <mailto:[log in to unmask] <mailto:[log in to unmask]>>>
> wrote:____
> >
> > No IPV6 on storage yet, still working on the
> perfsonar :-)____
> >
> > __ __
> >
> > On 10 May 2018 14:33, Matt Doidge
> <[log in to unmask] <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>> wrote:____
> >
> > Hi Sam,
> > We're SL6 at Lancaster still (and only on 1.9.0. -
> > upgrading's on my
> > todo list).
> >
> > Cheers,
> > Matt
> >
> > On 10/05/18 14:23, Sam Skipsey wrote:
> > > Sneaking suspicion: which of you guys have IPv6
> turned on your storage?
> > >
> > > I think Lancaster's also Centos 7 / DPM 1.9.x
> (Matt, am I remembering
> > > right?), but Matt did some Exciting Things to
> fix odd IPv6 problems, as
> > > I recall.
> > >
> > > On Thu, May 10, 2018 at 2:17 PM Sam Skipsey
> <[log in to unmask] <mailto:[log in to unmask]>
> <mailto:[log in to unmask] <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>> wrote:
> > >
> > > Okay, so everyone with an issue with a
> ticket is on Centos 7 and DPM
> > > 1.9.x... (this is a head node issue, so
> that's the important bit).
> > >
> > > I'll just check the sites I know aren't
> SL7/Centos 7 in the
> > > monitoring and see if they are different.
> > >
> > > Sam
> > >
> > > On Thu, May 10, 2018 at 11:46 AM John Bland
> <[log in to unmask] <mailto:[log in to unmask]>
> <mailto:[log in to unmask] <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>> wrote:
> > >
> > > At Liverpool all Centos7.4, DPM 1.9.2,
> puppet.
> > >
> > > On 10/05/2018 11:37, Govind Songara wrote:
> > > > Thanks Simon, headnode is configured
> using puppet. Pool node
> > > still uses
> > > > yaim.
> > > >
> > > > On Thu, 10 May 2018, 11:19 a.m.
> George, Simon,
> > > <[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>><mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>><mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>>
> > wrote:
> > > >
> > > > Hi Sam,
> > > >
> > > > RHUL is running DPM 1.9.0 on
> Centos 7.3 on the SE head node.
> > > >
> > > > The storage nodes are DPM 1.8.10
> on SL6.9.
> > > >
> > > > Simon
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> ------------------------------------------------------------------------
> > > > *From:* Sam Skipsey
> <[log in to unmask] <mailto:[log in to unmask]>
> <mailto:[log in to unmask] <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]> <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>
> > > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]> <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>>>
> > > > *Sent:* 10 May 2018 11:12
> > > > *To:* George, Simon
> > > > *Cc:*
> [log in to unmask] <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>
> > > >
> <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>>
> > > > *Subject:* Re: [GRIDPP-STORAGE]
> help debugging transfer
> > > failures
> > > > Hello:
> > > >
> > > > So, it looks like Oxford and
> RHUL and the new ECDF-RDF have
> > > > something in common, as all of
> your transfer failures
> > > look similar
> > > > from the ATLAS logs (they look
> like SOAP errors on PUT
> > > DONE (error
> > > > code 500), on otherwise
> successful transfers).
> > > >
> > > > I know Oxford is running on SL7
> with DPM 1.9.2 - is there
> > > anything
> > > > in common with the other two of you?
> > > >
> > > > Sam
> > > >
> > > > On Sun, May 6, 2018 at 12:33 PM
> George, Simon
> > > <[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>><mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]> <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>> wrote:
> > > >
> > > > We got a new ticket for the
> same problem this weekend:
> > > >
> > > >
> https://ggus.eu/index.php?mode=ticket_info&ticket_id=134945
> >
> <https://ggus.eu/index.php?mode=ticket_info&ticket_id=134945>
> > > >
> > > > How can we move forward on this?
> > > >
> > > > Change FTS parameters - how?
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Simon
> > > >
> > > >
> > > >
> > > >
> > >
> ------------------------------------------------------------------------
> > > > *From:* GRIDPP2: Deployment
> and support of SRM and
> > > local storage
> > > > management
> <[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>
> > > >
> <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>>> on behalf of John
> > Bland
> > > > <[log in to unmask]
> <mailto:[log in to unmask]> <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]> <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>>>
> > > > *Sent:* 03 May 2018 10:52
> > > > *To:*
> [log in to unmask] <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>
> > > >
> <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>>
> > > > *Subject:* Re: help
> debugging transfer failures
> > > > Hi,
> > > >
> > > > This page has the majority
> of failed files where the
> > > transfer
> > > > time is
> > > > 300-600s (plus a few over
> that). Not one below 300s
> > > that I've seen.
> > > >
> > > >
> > >
> http://dashb-atlas-ddm.cern.ch/ddm2/#activity=(Data+Brokering,Data+Consolidation,Deletion,Express,Functional+Test,Group+Subscriptions,Production,Production+Input,Production+Output,Recovery,Staging,T0+Export,T0+Tape,User+Subscriptions,default,on)&d.error_code=154&d.state=(TRANSFER_FAILED)&date.from=201805021050&date.interval=0&date.to=201805021450&dst.cloud=(%22UK%22)&dst.site=(%22UKI-NORTHGRID-LIV-HEP%22)&dst.tier=(0,1,2)&dst.token=(-IPV6TEST,-DDMTEST,-CEPH,-PPSSCRATCHDISK)&grouping.dst=(cloud,site,token)&m.content=(d_dof,d_eff,d_faf,s_eff,s_err,s_suc,t_eff,t_err,t_suc)&samples=true&src.site=(-RUCIOTEST,-MWTEST,-RDF)&src.tier=(0,1,2)&src.token=(-IPV6TEST,-DDMTEST,-CEPH,-PPSSCRATCHDISK)&tab=details
> >
> <http://dashb-atlas-ddm.cern.ch/ddm2/#activity=(Data+Brokering,Data+Consolidation,Deletion,Express,Functional+Test,Group+Subscriptions,Production,Production+Input,Production+Output,Recovery,Staging,T0+Export,T0+Tape,User+Subscriptions,default,on)&d.error_code=154&d.state=(TRANSFER_FAILED)&date.from=201805021050&date.interval=0&date.to=201805021450&dst.cloud=(%22UK%22)&dst.site=(%22UKI-NORTHGRID-LIV-HEP%22)&dst.tier=(0,1,2)&dst.token=(-IPV6TEST,-DDMTEST,-CEPH,-PPSSCRATCHDISK)&grouping.dst=(cloud,site,token)&m.content=(d_dof,d_eff,d_faf,s_eff,s_err,s_suc,t_eff,t_err,t_suc)&samples=true&src.site=(-RUCIOTEST,-MWTEST,-RDF)&src.tier=(0,1,2)&src.token=(-IPV6TEST,-DDMTEST,-CEPH,-PPSSCRATCHDISK)&tab=details>
> > > >
> > > > John
> > > >
> > > > On 03/05/2018 10:45, Duncan
> Rand wrote:
> > > > > John
> > > > >
> > > > > Do you have an example of
> one of those transfers? Here
> > > > >
> > > > >
> > >
> https://fts106.cern.ch:8449/var/log/fts3/transfers/2018-05-03/srm.ndgf.org__se2.ppgrid1.rhul.ac.uk/2018-05-03-0856__srm.ndgf.org__se2.ppgrid1.rhul.ac.uk__761281463__e7e8646a-434c-59d0-b37f-a4d8917f1113
> >
> <https://fts106.cern.ch:8449/var/log/fts3/transfers/2018-05-03/srm.ndgf.org__se2.ppgrid1.rhul.ac.uk/2018-05-03-0856__srm.ndgf.org__se2.ppgrid1.rhul.ac.uk__761281463__e7e8646a-434c-59d0-b37f-a4d8917f1113>
> > > >
> > > > >
> > > > >
> > > > > I see a 10GB file taking
> about 42 minutes and then
> > > failing. There are a
> > > > > number of FTS
> configurations here
> > > > >
> > > > >
> > >
> https://fts3-pilot.cern.ch:8449/fts3/ftsmon/#/config/gfal2
> >
> <https://fts3-pilot.cern.ch:8449/fts3/ftsmon/#/config/gfal2>
> > > > >
> > > > > a couple are indeed set to
> 300s/5mins.
> > > > >
> > > > > Duncan
> > > > >
> > > > > On 03/05/2018 09:57,
> George, Simon wrote:
> > > > >> Thanks John.
> > > > >>
> > > > >> Who is able to check if
> FTS itself has a timeout
> > > in place?
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > >
> ------------------------------------------------------------------------
> > > > >> *From:* GRIDPP2:
> Deployment and support of SRM and
> > > local storage
> > > > >> management
> <[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>
> > > >
> <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>>> on behalf of John
> > Bland
> > > > >> <[log in to unmask]
> <mailto:[log in to unmask]> <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]> <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>>>
> > > > >> *Sent:* 02 May 2018 23:10
> > > > >> *To:*
> [log in to unmask] <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>>
> > > > >> *Subject:* Re: help
> debugging transfer failures
> > > > >> Looking at some of the
> failed transfers we see at
> > > Liverpool the SRM logs
> > > > >> show a 5minute timeout of
> some sort. SRM Put
> > > starts, the gridftp server
> > > > >> transfers perfectly, but
> if the transfer takes
> > > more than 5minutes the
> > > > >> SRM control connection
> gets terminated (but not
> > > the GridFTP one that
> > > > >> I've seen). The client
> then appears to just delete
> > > the file in these
> > > > >> circumstances.
> > > > >>
> > > > >> Although it's more than
> possible our uni firewall
> > > is doing this, given
> > > > >> that at least a handful
> of sites are seeing
> > > similar issues and that the
> > > > >> FTS logs themselves show
> an INFO error of "Timeout
> > > stopped" I'd also be
> > > > >> eyeing the FTS servers
> suspiciously as well.
> > > > >>
> > > > >> It probably only shows up
> with big files (any I've
> > > checked are >2GB at
> > > > >> least) or if the WAN is
> being saturated enough to
> > > take the transfer of
> > > > >> 5mins.
> > > > >>
> > > > >> John
> > > > >>
> > > > >> On 02/05/18 17:18, Govind
> Songara wrote:
> > > > >>> Hi All,
> > > > >>>
> > > > >>> As mentioned in today
> meeting, we still see this
> > > error.
> > > > >>> It would be great if you
> can help on this problem.
> > > > >>>
> > > > >>> Thanks
> > > > >>> Govind
> > > > >>>
> > > > >>> On Tue, Apr 10, 2018 at
> 11:47 AM, George, Simon
> > > <[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>><mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>><mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>
> > > > >>>
> <mailto:[log in to unmask] <mailto:[log in to unmask]>
> <mailto:[log in to unmask] <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>> wrote:
> > > > >>>
> > > > >>> I found examples
> the same type of error at
> > > Lancaster if you're
> > > > >>> interested:
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > >
> http://dashb-atlas-ddm.cern.ch/ddm2/#activity=(Data+Brokering,Data+Consolidation,Deletion,Express,Functional+Test,Group+Subscriptions,Production,Production+Input,Production+Output,Recovery,Staging,T0+Export,T0+Tape,User+Subscriptions,default)&d.dst.cloud=%22UK%22&d.dst.site=%22UKI-NORTHGRID-LANCS-HEP%22&d.dst.token=%22DATADISK%22&d.error_code=229&d.src.cloud=%22CA%22&d.state=(TRANSFER_FAILED)&date.from=201804050000&date.interval=0&date.to=201804070000&dst.cloud=(%22UK%22)&dst.site=(%22UKI-NORTHGRID-LANCS-HEP%22)&dst.tier=(0,1,2)&dst.token=(-TEST,-CEPH,-PPS,-GRIDFTP)&grouping.dst=(cloud,site,token)&m.content=(d_dof,d_eff,d_faf,s_eff,s_err,s_suc,t_eff,t_err,t_suc)&p.grouping=src&samples=true&src.site=(-TEST,-RDF,-AWS,-CEPH)&src.tier=(0,1,2)&src.token=(-TEST,-CEPH,-PPS,-GRIDFTP)&tab=details
> >
> <http://dashb-atlas-ddm.cern.ch/ddm2/#activity=(Data+Brokering,Data+Consolidation,Deletion,Express,Functional+Test,Group+Subscriptions,Production,Production+Input,Production+Output,Recovery,Staging,T0+Export,T0+Tape,User+Subscriptions,default)&d.dst.cloud=%22UK%22&d.dst.site=%22UKI-NORTHGRID-LANCS-HEP%22&d.dst.token=%22DATADISK%22&d.error_code=229&d.src.cloud=%22CA%22&d.state=(TRANSFER_FAILED)&date.from=201804050000&date.interval=0&date.to=201804070000&dst.cloud=(%22UK%22)&dst.site=(%22UKI-NORTHGRID-LANCS-HEP%22)&dst.tier=(0,1,2)&dst.token=(-TEST,-CEPH,-PPS,-GRIDFTP)&grouping.dst=(cloud,site,token)&m.content=(d_dof,d_eff,d_faf,s_eff,s_err,s_suc,t_eff,t_err,t_suc)&p.grouping=src&samples=true&src.site=(-TEST,-RDF,-AWS,-CEPH)&src.tier=(0,1,2)&src.token=(-TEST,-CEPH,-PPS,-GRIDFTP)&tab=details>
> > > >
> > > > >>>
> > > > >>>
> > > > >>>
> > >
> <http://dashb-atlas-ddm.cern.ch/ddm2/#activity=(Data+Brokering,Data+Consolidation,Deletion,Express,Functional+Test,Group+Subscriptions,Production,Production+Input,Production+Output,Recovery,Staging,T0+Export,T0+Tape,User+Subscriptions,default)&d.dst.cloud=%22UK%22&d.dst.site=%22UKI-NORTHGRID-LANCS-HEP%22&d.dst.token=%22DATADISK%22&d.error_code=229&d.src.cloud=%22CA%22&d.state=(TRANSFER_FAILED)&date.from=201804050000&date.interval=0&date.to=201804070000&dst.cloud=(%22UK%22)&dst.site=(%22UKI-NORTHGRID-LANCS-HEP%22)&dst.tier=(0,1,2)&dst.token=(-TEST,-CEPH,-PPS,-GRIDFTP)&grouping.dst=(cloud,site,token)&m.content=(d_dof,d_eff,d_faf,s_eff,s_err,s_suc,t_eff,t_err,t_suc)&p.grouping=src&samples=true&src.site=(-TEST,-RDF,-AWS,-CEPH)&src.tier=(0,1,2)&src.token=(-TEST,-CEPH,-PPS,-GRIDFTP)&tab=details
> >
> <http://dashb-atlas-ddm.cern.ch/ddm2/#activity=(Data+Brokering,Data+Consolidation,Deletion,Express,Functional+Test,Group+Subscriptions,Production,Production+Input,Production+Output,Recovery,Staging,T0+Export,T0+Tape,User+Subscriptions,default)&d.dst.cloud=%22UK%22&d.dst.site=%22UKI-NORTHGRID-LANCS-HEP%22&d.dst.token=%22DATADISK%22&d.error_code=229&d.src.cloud=%22CA%22&d.state=(TRANSFER_FAILED)&date.from=201804050000&date.interval=0&date.to=201804070000&dst.cloud=(%22UK%22)&dst.site=(%22UKI-NORTHGRID-LANCS-HEP%22)&dst.tier=(0,1,2)&dst.token=(-TEST,-CEPH,-PPS,-GRIDFTP)&grouping.dst=(cloud,site,token)&m.content=(d_dof,d_eff,d_faf,s_eff,s_err,s_suc,t_eff,t_err,t_suc)&p.grouping=src&samples=true&src.site=(-TEST,-RDF,-AWS,-CEPH)&src.tier=(0,1,2)&src.token=(-TEST,-CEPH,-PPS,-GRIDFTP)&tab=details>>
> > > >
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > >
> ------------------------------------------------------------------------
> > > > >>> *From:* George, Simon
> > > > >>> *Sent:* 06 April
> 2018 13:17
> > > > >>> *To:*
> [log in to unmask] <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>>
> > > > >>>
> <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>>
> > > > >>> *Subject:* help
> debugging transfer failures
> > > > >>>
> > > > >>> Dear storage
> experts, especially DPM
> > > flavoured ones,
> > > > >>>
> > > > >>> I'd be grateful if
> you could take a look at
> > > this ticket and give
> > > > >>> help and/or
> suggestions on how to get to the
> > > bottom of it.
> > > > >>>
> > > > >>>
> > >
> https://ggus.eu/index.php?mode=ticket_info&ticket_id=134144
> >
> <https://ggus.eu/index.php?mode=ticket_info&ticket_id=134144>
> > > > >>>
> > >
> <https://ggus.eu/index.php?mode=ticket_info&ticket_id=134144
> >
> <https://ggus.eu/index.php?mode=ticket_info&ticket_id=134144>>
> > > > >>>
> > > > >>> Thanks,
> > > > >>>
> > > > >>> Simon
> > > > >>>
> > > > >>>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> John Bland
> [log in to unmask] <mailto:[log in to unmask]>
> <mailto:[log in to unmask] <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]> <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>>
> > > > >> System
> Administrator office: 220
> > > > >> High Energy Physics
> Division tel (int): 42911
> > > > >> Oliver Lodge
> Laboratory tel (ext): +44
> > > (0)151 794 2911 <tel:0151%20794%202911>
> > <tel:0151%20794%202911><tel:0151%20794%202911>
> > <tel:0151%20794%202911>
> > > > >> University of Liverpool
> > > http://www.liv.ac.uk/physics/hep/
> > <http://www.liv.ac.uk/physics/hep/>
> > > > >> "I canna change the laws
> of physics, Captain!"
> > > >
> > > >
> > > > --
> > > > John Bland
> [log in to unmask] <mailto:[log in to unmask]>
> <mailto:[log in to unmask] <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]> <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>
> > > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>>
> > > > Research
> Fellow office: 220
> > > > High Energy Physics
> Division tel (int): 42911
> > > > Oliver Lodge
> Laboratory tel (ext): +44
> > > (0)151 794 2911 <tel:0151%20794%202911>
> <tel:0151%20794%202911><tel:0151%20794%202911>
> > > > <tel:0151%20794%202911>
> > > > University of Liverpool
> http://www.liv.ac.uk/physics/hep/
> > <http://www.liv.ac.uk/physics/hep/>
> > > > "I canna change the laws of
> physics, Captain!"
> > > >
> > >
> > >
> > > --
> > > John Bland [log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>><mailto:[log in to unmask]
> <mailto:[log in to unmask]>
> > <mailto:[log in to unmask]
> <mailto:[log in to unmask]>>>
> > > Research Fellow
> office: 220
> > > High Energy Physics Division tel
> (int): 42911
> > > Oliver Lodge Laboratory tel
> (ext): +44 (0)151 794 2911 <tel:0151%20794%202911>
> <tel:0151%20794%202911>
> > > <tel:0151%20794%202911>
> > > University of Liverpool
> http://www.liv.ac.uk/physics/hep/
> > <http://www.liv.ac.uk/physics/hep/>
> > > "I canna change the laws of physics,
> Captain!"
> > > ____
> >
> >
>
>
> --
> John Bland [log in to unmask] <mailto:[log in to unmask]>
> Research Fellow office: 220
> High Energy Physics Division tel (int): 42911
> Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2911
> <tel:0151%20794%202911>
> University of Liverpool http://www.liv.ac.uk/physics/hep/
> "I canna change the laws of physics, Captain!"
>
--
John Bland [log in to unmask]
Research Fellow office: 220
High Energy Physics Division tel (int): 42911
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2911
University of Liverpool http://www.liv.ac.uk/physics/hep/
"I canna change the laws of physics, Captain!"
To unsubscribe from the GRIDPP-STORAGE list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=GRIDPP-STORAGE&A=1
|