Print

Print


Hi Owen,

We eventually tracked down the cause of the issue, it was the mount
options on the pnfsdoors mount on the gridftp door node. It wasn't
mounted with the "noac" option so was caching some of the (p)nfs info,
hence the file created a few seconds before wasn't there whent the
gridftp door looked for it - hence the (correct) error.

What we're not sure of is what's mounting the pnfsdoors area and how we
change the mount options. I've put it in /etc/fstab for now so it will
be mounted before the d-cache services start which solves the problem
but may not be the best solution.

Yours,
Chris.

 

> -----Original Message-----
> From: Owen Synge [mailto:[log in to unmask]] 
> Sent: 12 June 2006 13:11
> To: Brew, CAJ (Chris); Patrick Fuhrmann
> Cc: [log in to unmask]
> Subject: Re: dCache SFT Failures
> 
> Hello all, 
> 
> Back in Sunny Oxfordshire, I am adding Patrick to this email 
> as I think this may aid the process of finding this issue. 
> Hopefully D-Cache can fix this issue upstream.
>  
> 
> On Thu, 25 May 2006 09:59:53 +0100
> "Brew, CAJ (Chris)" <[log in to unmask]> wrote:
> 
> > Hi,
> > 
> > The details are in the dcache user-forum but it looks to me like the
> > root cause is a combination of the many VOs (and their 
> databases) and
> > having the gridftpdoor on the pool node.
> > 
> > If I move the door to the admin node the problem goes away 
> but from my
> > previous transfer tests that limits my rates to about 200 
> Mb/s rather
> > than the 400Mb/s I can get with the door on the pool node. 
> That handicap
> > will only get worse when I add another 6-8 servers and get 
> the 10GigE
> > connection to the Tier 1.
> > 
> > It looks like I've got three options:
> > 
> > Run the door on the admin node and accept slow transfers
> 
> I cant see this as a good long term solution.
> 
> > Run the door on the pool node and accept SFT failures (or 
> try to get the
> > SFTs modified to wait between the upload and access)
> 
> This seems a pragmatic work around after all these are 
> 
> "Site Functional tests" 
> 
> and provided the site is functional and this bug may be found 
> in production but is not a good indication of functionality 
> if services bugs are being caught it should be a 
> 
> "functional regression test"
> 
> in my opinion.  RAL's ADS tape system has been in production 
> for years has exactly the same issue. This issue is part of 
> the specs for an internal D-Cache bug. Of cause we cant let 
> this issue fall away if we modify the site functional tests.
> 
> > Try to reconfigure the dCache to have fewer databases (say 
> one for each
> > of the major VOs then one for every 4-6 smaller VOs). Is it even
> > possible to eliminate databases like that? But then again I 
> used to see
> > these errors occasionally even before I increased the 
> number of VOs so
> > that probably won't be a complete fix.
> 
> I think this is probably going to cause more pain and 
> suffering in the long term, until I went to DESY for the past 
> two weeks, the D-Cache team where unaware that D-Cache setups 
> contained as many as 24 VO in a typical tier 2 install. This 
> made the quota issue easier for them to understand also.
>  
> > Does anyone have any other ideas? The tier 1 doesn't seem 
> to suffer from
> > this problem despite supporting almost as many VOs and running
> > gridftpdoors on the pools? Have they split off pnfs to it's 
> own server?
> > 
> > Thanks,
> > Chris.
> 
> I know people did have ideas about decomposing the services 
> better with separate pnfs, pool and door nodes, but I 
> understand this is a timing issue, found in an unusual 
> testing use case and am unsure if we should not just escalate 
> this bug and change the tests for now.
> 
> 
> Regards
> 
> Owen
> 
> 
> > 
> > > -----Original Message-----
> > > From: Greig A Cowan [mailto:[log in to unmask]] 
> > > Sent: 24 May 2006 18:13
> > > To: Brew, CAJ (Chris)
> > > Cc: [log in to unmask]
> > > Subject: RE: dCache SFT Failures
> > > 
> > > 
> > > Hi Chris,
> > > 
> > > I've just read your post on the user-forum. That's very 
> > > interesting what you've found. Could we be seeing a scaling 
> > > problem with dCache? I hadn't realised that you were 
> > > supporting 24 VOs, each with their own database.
> > > 
> > > I'll need to look into it, but there might be an option 
> > > within pnfs that lets you control things like this.
> > > 
> > > Greig
> > > 
> > > 
> > > On Wed, 24 May 2006, Brew, CAJ (Chris) wrote:
> > > 
> > > > Hi Grieg,
> > > > 
> > > > I've just been running  some tests and have a bit more info 
> > > which I've 
> > > > just posted to the dcache user-forum it looks like the file 
> > > info isn't 
> > > > getting into the pnfs databases quickly enough.
> > > > 
> > > > That's probably why I'm failing the sfts but haven't heard 
> > > complaints 
> > > > from users.
> > > > 
> > > > I'm not sure where to take this from here unless I can tune 
> > > the DB to 
> > > > get the info in quicker.
> > > > 
> > > > Yours,
> > > > Chris.
> > > > 
> > > > > -----Original Message-----
> > > > > From: GRIDPP2: Deployment and support of SRM and 
> local storage 
> > > > > management [mailto:[log in to unmask]] On 
> > > Behalf Of Greig 
> > > > > A Cowan
> > > > > Sent: 24 May 2006 17:56
> > > > > To: [log in to unmask]
> > > > > Subject: Re: dCache SFT Failures
> > > > > 
> > > > > Hi Chris,
> > > > > 
> > > > > I see that you are still failing the SFTs, in fact, 
> the situation 
> > > > > seems worse than before!
> > > > > 
> > > > > You are definitely using the correct pnfs mount options, 
> > > aren't you? 
> > > > > Have you tried rebooting the machine?
> > > > > 
> > > > > Greig
> > > > > 
> > > > > On Tue, 23 May 2006, Brew, CAJ (Chris) wrote:
> > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: GRIDPP2: Deployment and support of SRM and 
> > > local storage 
> > > > > > > management [mailto:[log in to unmask]] On
> > > > > Behalf Of Greig
> > > > > > > A Cowan
> > > > > > > Sent: 23 May 2006 12:14
> > > > > > > To: [log in to unmask]
> > > > > > > Subject: Re: dCache SFT Failures
> > > > > > > 
> > > > > > > Hi Chris,
> > > > > > > 
> > > > > > > what are the permissions of the generated directory that
> > > > > the SFT is
> > > > > > > trying to write into?
> > > > > > 
> > > > > > dteam001:dteam drwxr-xr-x
> > > > > > 
> > > > > > As all the dteam directories appear to be.
> > > > > > 
> > > > > > > What options are you using when mounting pnfs on 
> pool nodes?
> > > > > > 
> > > > > > Hmm, from /etc/mtab:
> > > > > > 
> > > > > > heplnx204.pp.rl.ac.uk:/pnfsdoors /pnfs/pp.rl.ac.uk nfs
> > > > > > rw,addr=130.246.47.204 0 0
> > > > > > heplnx204.pp.rl.ac.uk:/fs /pnfs/fs nfs
> > > > > > rw,hard,intr,noac,addr=130.246.47.204 0 0
> > > > > > 
> > > > > > I had a problem earlier where the /fs filesystem 
> hadn't mounted 
> > > > > > and the doors weren't working on the pool node, I ended 
> > > up fixing 
> > > > > > it by putting it in /etc/fstab. I've remounted it 
> with the same
> > > > > options as
> > > > > > the
> > > > > > pnfsdoors:
> > > > > > 
> > > > > > heplnx204.pp.rl.ac.uk:/pnfsdoors /pnfs/pp.rl.ac.uk nfs
> > > > > > rw,addr=130.246.47.204 0 0
> > > > > > heplnx204.pp.rl.ac.uk:/fs /pnfs/fs nfs 
> > > rw,addr=130.246.47.204 0 0
> > > > > > 
> > > > > > Are the dCache filesystems in your fstab? what are 
> the options?
> > > > > > 
> > > > > > Thanks,
> > > > > > Chris.
> > > > > > 
> > > > > > > Cheers,
> > > > > > > Greig
> > > > > > > 
> > > > > > > On Tue, 23 May 2006, Brew, CAJ (Chris) wrote:
> > > > > > > 
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > Removing the cron job doesn't seem to have solved the
> > > > > problem, the
> > > > > > > > load on the machine is pretty low. Any other things I
> > > > > can try, My
> > > > > > > > reliability is really low at the moment because of this.
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Chris
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Greig A Cowan [mailto:[log in to unmask]]
> > > > > > > > > Sent: 22 May 2006 12:33
> > > > > > > > > To: Brew, CAJ (Chris)
> > > > > > > > > Cc: [log in to unmask]
> > > > > > > > > Subject: RE: dCache SFT Failures
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Hmmm, yes there's a houly cron (on the hour so it's
> > > > > > > probably still
> > > > > > > > > > running if the SFT gets through the queue 
> quickly) that
> > > > > > > du's the
> > > > > > > > > > dCache area to get a per VO breakdown of usage. 
> > > > > I'll disable
> > > > > > > > > > it and see if the SFT pass rate improves.
> > > > > > > > > 
> > > > > > > > > You could run the cron at half past the hour 
> instead. Do
> > > > > > > you really
> > > > > > > > > need to run the cron every hour? The Tier-1 just run
> > > > > a similar
> > > > > > > > > command each night at 12pm.
> > > > > > > > > 
> > > > > > > > > > p.s. Anyone know of another way of getting the 
> > > information
> > > > > > > > > (A query on
> > > > > > > > > > the DB perhaps)?
> > > > > > > > > 
> > > > > > > > > Unfortunately not. I asked about this, but it's 
> > > not possible 
> > > > > > > > > with dCache at the moment. It should be 
> available in a 
> > > > > > > > > future
> > > > > > > release...
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: GRIDPP2: Deployment and support of SRM and
> > > > > > > local storage
> > > > > > > > > > > management 
> [mailto:[log in to unmask]] On
> > > > > > > > > Behalf Of Greig
> > > > > > > > > > > A Cowan
> > > > > > > > > > > Sent: 22 May 2006 12:15
> > > > > > > > > > > To: [log in to unmask]
> > > > > > > > > > > Subject: Re: dCache SFT Failures
> > > > > > > > > > > 
> > > > > > > > > > > Hi Chris,
> > > > > > > > > > > 
> > > > > > > > > > > I've seen this before, but it's unclear to me
> > > > > what causes it. 
> > > > > > > > > > > Looking at your latest SFT failure 
> (10:10), the lcg-cp
> > > > > > > > > command was
> > > > > > > > > > > successful, but the subsequent lcg-rep failed.
> > > > > > > > > > > 
> > > > > > > > > > > Is there something else running on your 
> dCache node 
> > > > > > > > > > > which
> > > > > > > > > could be
> > > > > > > > > > > interfering with pnfs? Maybe a cron job 
> of some sort?
> > > > > > > > > > > 
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Greig
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > On Mon, 22 May 2006, Brew, CAJ (Chris) wrote:
> > > > > > > > > > > 
> > > > > > > > > > > > Hi All,
> > > > > > > > > > > > 
> > > > > > > > > > > > I'm getting a lot of random failures in the 
> > > SFTs from 
> > > > > > > > > > > > my
> > > > > > > > > > > dCache where
> > > > > > > > > > > > the write of the file to the dCache appears
> > > > > successful but
> > > > > > > > > > > then when
> > > > > > > > > > > > the SFT tries to read the file back you get:
> > > > > > > > > > > > 
> > > > > > > > > > > > + lcg-cp -v --vo dteam
> > > > > > > > > > > > + 
> lfn:sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > 
> > > 
> file:///scratch/WMS_heplnx48_018249_https_3a_2f_2fgdrb02.cern.ch_3a9
> > > > > > > > > > > 00
> > > > > > > > > > > > 0_ 2fLxXmsliu9ehFjCWOYEcxQg/sft-lcg-rm-cp.txt
> > > > > > > > > > > > the server sent an error response: 553 553 
> > > Permission
> > > > > > > > > > > denied, reason:
> > > > > > > > > > > > CacheException(rc=666;msg=can't get 
> pnfsId (not a
> > > > > > > > > > > > pnfsfile))
> > > > > > > > > > > > 
> > > > > > > > > > > > lcg_cp: Permission denied Using grid 
> > > catalog type: lfc 
> > > > > > > > > > > > Using grid catalog :
> > > > > > > > > > > > prod-lfc-shared-central.cern.ch
> > > > > > > > > > > > 
> > > > > > > > > > > > It appears that the write was indeed successful 
> > > > > > > > > > > > because the
> > > > > > > > > > > same SFT
> > > > > > > > > > > > can later replicate it to CERN:
> > > > > > > > > > > > 
> > > > > > > > > > > > Replicate the file from the default SE to 
> > > > > > > > > > > > castorgrid.cern.ch
> > > > > > > > > > > > 
> > > > > > > > > > > > + lcg-rep -v --vo dteam -d castorgrid.cern.ch
> > > > > > > > > > > > 
> lfn:sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722
> > > > > > > > > > > > 
> > > > > > > > > > > >             0 bytes      0.00 KB/sec 
> avg      0.00 
> > > > > > > KB/sec inst
> > > > > > > > > > > >             0 bytes      0.00 KB/sec 
> avg      0.00 
> > > > > > > KB/sec inst
> > > > > > > > > > > >             0 bytes      0.00 KB/sec avg      
> > > > > 0.00 KB/sec
> > > > > > > > > > > instUsing grid
> > > > > > > > > > > > catalog type: lfc
> > > > > > > > > > > > Using grid catalog : 
> > > > > > > prod-lfc-shared-central.cern.ch Source URL:
> > > > > > > > > > > > 
> > > > > > > > > 
> > > > > 
> lfn:/grid/dteam/SFT/sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.060522072
> > > > > > > > > 2
> > > > > > > > > > > > File size: 233
> > > > > > > > > > > > VO name: dteam
> > > > > > > > > > > > Destination specified: castorgrid.cern.ch Source
> > > > > > > URL for copy:
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > 
> > > 
> gsiftp://heplnx204.pp.rl.ac.uk:2811//pnfs/pp.rl.ac.uk/data/dteam/gen
> > > > > > > > > > > er
> > > > > > > > > > > > at
> > > > > > > > > > > > 
> > > ed/2006-05-22/file330985b9-5368-4e67-82ec-5ee6f6fd4fa8
> > > > > > > > > > > > Destination URL for copy:
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > 
> > > 
> gsiftp://castorgrid.cern.ch/castor/cern.ch/grid/dteam/generated/2006
> > > > > > > > > > > -0
> > > > > > > > > > > > 5- 22/file8c15f735-de68-4949-aba5-33c9098462ff
> > > > > > > > > > > > # streams: 1
> > > > > > > > > > > > # set timeout to 0
> > > > > > > > > > > > 
> > > > > > > > > > > > Transfer took 2020 ms
> > > > > > > > > > > > Destination URL registered in LRC:
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > 
> > > 
> sfn://castorgrid.cern.ch/castor/cern.ch/grid/dteam/generated/2006-05
> > > > > > > > > > > -2
> > > > > > > > > > > > 2/ file8c15f735-de68-4949-aba5-33c9098462ff
> > > > > > > > > > > > + result=0
> > > > > > > > > > > > + set +x
> > > > > > > > > > > > 
> > > > > > > > > > > > List replicas to check if replication 
> was really 
> > > > > > > > > > > > successful
> > > > > > > > > > > > 
> > > > > > > > > > > > + lcg-lr --vo dteam
> > > > > > > > > > > lfn:sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > 
> > > 
> sfn://castorgrid.cern.ch/castor/cern.ch/grid/dteam/generated/2006-05
> > > > > > > > > > > -2
> > > > > > > > > > > > 2/ file8c15f735-de68-4949-aba5-33c9098462ff
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > 
> > > 
> srm://heplnx204.pp.rl.ac.uk/pnfs/pp.rl.ac.uk/data/dteam/generated/20
> > > > > > > > > > > 06
> > > > > > > > > > > > -0
> > > > > > > > > > > > 5-22/file330985b9-5368-4e67-82ec-5ee6f6fd4fa8
> > > > > > > > > > > > + set +x
> > > > > > > > > > > > 
> > > > > > > > > > > > I was always getting a few of these but 
> > > since I added 
> > > > > > > > > > > > extra
> > > > > > > > > > > VOs a week
> > > > > > > > > > > > ago I now seem to failing between 30 
> and 50% of the 
> > > > > > > > > > > > SFT
> > > > > > > > > > > runs with this
> > > > > > > > > > > > alone.
> > > > > > > > > > > > 
> > > > > > > > > > > > I haven't managed to replicate the error by
> > > > > copying files
> > > > > > > > > > > in and out
> > > > > > > > > > > > multiple times and the SFT deletes the 
> file so I 
> > > > > > > > > > > > cannot
> > > > > > > > > check the
> > > > > > > > > > > > status of the file the see the error with.
> > > > > > > > > > > > 
> > > > > > > > > > > > Googling for the error seems to show 
> that it's not
> > > > > > > > > uncommon but I
> > > > > > > > > > > > don't see and indications of cause or 
> > > solution. There
> > > > > > > > > > > doesn't seem to
> > > > > > > > > > > > be anything in the logs.
> > > > > > > > > > > > 
> > > > > > > > > > > > Anyone know what I can do about this (other than
> > > > > > > install DPM)?
> > > > > > > > > > > > 
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Chris.
> > > > > > > > > > > > 
> > > > > > > > > > > > Examples taken from:
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > 
> > > 
> https://lcg-sft.cern.ch/sft/info/heplnx201.pp.rl.ac.uk/sft_2006-05-2
> > > > > > > > > > > 2_
> > > > > > > > > > > > 07
> > > > > > > > > > > > .10.05.html#sft-lcg-rm_2006-05-22_07:22:49
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > --
> > > > > > > > > > > 
> > > > > ============================================================
> > > > > > > > > > > ==
> > > > > > > > > > > ==========
> > > > > > > > > > > Dr Greig A Cowan                         
> > > > > > > > > > > http://www.ph.ed.ac.uk/~gcowan1 School of 
> Physics, 
> > > > > > > > > > > University of Edinburgh, James
> > > > > > > Clerk Maxwell
> > > > > > > > > > > Building
> > > > > > > > > > > 
> > > > > > > > > > > TIER-2 STORAGE SUPPORT PAGES: 
> > > > > > > > > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage
> > > > > > > > > > > 
> > > > > ============================================================
> > > > > > > > > > > ==
> > > > > > > > > > > ==========
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > --
> > > > > > > > > 
> > > ============================================================
> > > > > > > > > ==
> > > > > > > > > ==========
> > > > > > > > > Dr Greig A Cowan                         
> > > > > > > > > http://www.ph.ed.ac.uk/~gcowan1 School of Physics, 
> > > > > > > > > University of Edinburgh, James
> > > > > Clerk Maxwell
> > > > > > > > > Building
> > > > > > > > > 
> > > > > > > > > TIER-2 STORAGE SUPPORT PAGES: 
> > > > > > > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage
> > > > > > > > > 
> > > ============================================================
> > > > > > > > > ==
> > > > > > > > > ==========
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > --
> > > > > > > 
> ==============================================================
> > > > > > > ==========
> > > > > > > Dr Greig A Cowan                         
> > > > > > > http://www.ph.ed.ac.uk/~gcowan1
> > > > > > > School of Physics, University of Edinburgh, James 
> > > Clerk Maxwell 
> > > > > > > Building
> > > > > > > 
> > > > > > > TIER-2 STORAGE SUPPORT PAGES: 
> > > > > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage
> > > > > > > 
> ==============================================================
> > > > > > > ==========
> > > > > > > 
> > > > > > 
> > > > > 
> > > > > --
> > > > > ==============================================================
> > > > > ==========
> > > > > Dr Greig A Cowan                         
> > > > > http://www.ph.ed.ac.uk/~gcowan1
> > > > > School of Physics, University of Edinburgh, James 
> Clerk Maxwell 
> > > > > Building
> > > > > 
> > > > > TIER-2 STORAGE SUPPORT PAGES: 
> > > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage
> > > > > ==============================================================
> > > > > ==========
> > > > > 
> > > > 
> > > 
> > > --
> > > ==============================================================
> > > ==========
> > > Dr Greig A Cowan                         
> > > http://www.ph.ed.ac.uk/~gcowan1
> > > School of Physics, University of Edinburgh, James Clerk 
> > > Maxwell Building
> > > 
> > > TIER-2 STORAGE SUPPORT PAGES: 
> > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage
> > > ==============================================================
> > > ==========
> > > 
>