Print

Print


Hi Chris,

I've just read your post on the user-forum. That's very interesting what 
you've found. Could we be seeing a scaling problem with dCache? I hadn't 
realised that you were supporting 24 VOs, each with their own database.

I'll need to look into it, but there might be an option within pnfs that 
lets you control things like this.

Greig


On Wed, 24 May 2006, Brew, CAJ (Chris) wrote:

> Hi Grieg,
> 
> I've just been running  some tests and have a bit more info which I've
> just posted to the dcache user-forum it looks like the file info isn't
> getting into the pnfs databases quickly enough.
> 
> That's probably why I'm failing the sfts but haven't heard complaints
> from users. 
> 
> I'm not sure where to take this from here unless I can tune the DB to
> get the info in quicker.
> 
> Yours,
> Chris.
> 
> > -----Original Message-----
> > From: GRIDPP2: Deployment and support of SRM and local 
> > storage management [mailto:[log in to unmask]] On 
> > Behalf Of Greig A Cowan
> > Sent: 24 May 2006 17:56
> > To: [log in to unmask]
> > Subject: Re: dCache SFT Failures
> > 
> > Hi Chris,
> > 
> > I see that you are still failing the SFTs, in fact, the 
> > situation seems worse than before!
> > 
> > You are definitely using the correct pnfs mount options, 
> > aren't you? Have you tried rebooting the machine?
> > 
> > Greig
> > 
> > On Tue, 23 May 2006, Brew, CAJ (Chris) wrote:
> > 
> > > Hi,
> > > 
> > > > -----Original Message-----
> > > > From: GRIDPP2: Deployment and support of SRM and local storage 
> > > > management [mailto:[log in to unmask]] On 
> > Behalf Of Greig 
> > > > A Cowan
> > > > Sent: 23 May 2006 12:14
> > > > To: [log in to unmask]
> > > > Subject: Re: dCache SFT Failures
> > > > 
> > > > Hi Chris,
> > > > 
> > > > what are the permissions of the generated directory that 
> > the SFT is 
> > > > trying to write into?
> > > 
> > > dteam001:dteam drwxr-xr-x
> > > 
> > > As all the dteam directories appear to be.
> > > 
> > > > What options are you using when mounting pnfs on pool nodes?
> > > 
> > > Hmm, from /etc/mtab:
> > > 
> > > heplnx204.pp.rl.ac.uk:/pnfsdoors /pnfs/pp.rl.ac.uk nfs
> > > rw,addr=130.246.47.204 0 0
> > > heplnx204.pp.rl.ac.uk:/fs /pnfs/fs nfs
> > > rw,hard,intr,noac,addr=130.246.47.204 0 0
> > > 
> > > I had a problem earlier where the /fs filesystem hadn't mounted and 
> > > the doors weren't working on the pool node, I ended up fixing it by 
> > > putting it in /etc/fstab. I've remounted it with the same 
> > options as 
> > > the
> > > pnfsdoors:
> > > 
> > > heplnx204.pp.rl.ac.uk:/pnfsdoors /pnfs/pp.rl.ac.uk nfs
> > > rw,addr=130.246.47.204 0 0
> > > heplnx204.pp.rl.ac.uk:/fs /pnfs/fs nfs rw,addr=130.246.47.204 0 0
> > > 
> > > Are the dCache filesystems in your fstab? what are the options?
> > > 
> > > Thanks,
> > > Chris.
> > > 
> > > > Cheers,
> > > > Greig
> > > > 
> > > > On Tue, 23 May 2006, Brew, CAJ (Chris) wrote:
> > > > 
> > > > > Hi,
> > > > > 
> > > > > Removing the cron job doesn't seem to have solved the 
> > problem, the 
> > > > > load on the machine is pretty low. Any other things I 
> > can try, My 
> > > > > reliability is really low at the moment because of this.
> > > > > 
> > > > > Thanks,
> > > > > Chris
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Greig A Cowan [mailto:[log in to unmask]]
> > > > > > Sent: 22 May 2006 12:33
> > > > > > To: Brew, CAJ (Chris)
> > > > > > Cc: [log in to unmask]
> > > > > > Subject: RE: dCache SFT Failures
> > > > > > 
> > > > > > 
> > > > > > > Hmmm, yes there's a houly cron (on the hour so it's
> > > > probably still
> > > > > > > running if the SFT gets through the queue quickly) that
> > > > du's the
> > > > > > > dCache area to get a per VO breakdown of usage. 
> > I'll disable 
> > > > > > > it and see if the SFT pass rate improves.
> > > > > > 
> > > > > > You could run the cron at half past the hour instead. Do
> > > > you really
> > > > > > need to run the cron every hour? The Tier-1 just run 
> > a similar 
> > > > > > command each night at 12pm.
> > > > > > 
> > > > > > > p.s. Anyone know of another way of getting the information
> > > > > > (A query on
> > > > > > > the DB perhaps)?
> > > > > > 
> > > > > > Unfortunately not. I asked about this, but it's not possible 
> > > > > > with dCache at the moment. It should be available in a future
> > > > release...
> > > > > > 
> > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: GRIDPP2: Deployment and support of SRM and
> > > > local storage
> > > > > > > > management [mailto:[log in to unmask]] On
> > > > > > Behalf Of Greig
> > > > > > > > A Cowan
> > > > > > > > Sent: 22 May 2006 12:15
> > > > > > > > To: [log in to unmask]
> > > > > > > > Subject: Re: dCache SFT Failures
> > > > > > > > 
> > > > > > > > Hi Chris,
> > > > > > > > 
> > > > > > > > I've seen this before, but it's unclear to me 
> > what causes it. 
> > > > > > > > Looking at your latest SFT failure (10:10), the lcg-cp
> > > > > > command was
> > > > > > > > successful, but the subsequent lcg-rep failed.
> > > > > > > > 
> > > > > > > > Is there something else running on your dCache node which
> > > > > > could be
> > > > > > > > interfering with pnfs? Maybe a cron job of some sort?
> > > > > > > > 
> > > > > > > > Cheers,
> > > > > > > > Greig
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On Mon, 22 May 2006, Brew, CAJ (Chris) wrote:
> > > > > > > > 
> > > > > > > > > Hi All,
> > > > > > > > > 
> > > > > > > > > I'm getting a lot of random failures in the SFTs from my
> > > > > > > > dCache where
> > > > > > > > > the write of the file to the dCache appears 
> > successful but
> > > > > > > > then when
> > > > > > > > > the SFT tries to read the file back you get:
> > > > > > > > > 
> > > > > > > > > + lcg-cp -v --vo dteam
> > > > > > > > > + lfn:sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722
> > > > > > > > > 
> > > > > > > > 
> > > > > > 
> > > > 
> > file:///scratch/WMS_heplnx48_018249_https_3a_2f_2fgdrb02.cern.ch_3a9
> > > > > > > > 00
> > > > > > > > > 0_ 2fLxXmsliu9ehFjCWOYEcxQg/sft-lcg-rm-cp.txt
> > > > > > > > > the server sent an error response: 553 553 Permission
> > > > > > > > denied, reason:
> > > > > > > > > CacheException(rc=666;msg=can't get pnfsId (not a 
> > > > > > > > > pnfsfile))
> > > > > > > > > 
> > > > > > > > > lcg_cp: Permission denied
> > > > > > > > > Using grid catalog type: lfc Using grid catalog : 
> > > > > > > > > prod-lfc-shared-central.cern.ch
> > > > > > > > > 
> > > > > > > > > It appears that the write was indeed successful because 
> > > > > > > > > the
> > > > > > > > same SFT
> > > > > > > > > can later replicate it to CERN:
> > > > > > > > > 
> > > > > > > > > Replicate the file from the default SE to 
> > > > > > > > > castorgrid.cern.ch
> > > > > > > > > 
> > > > > > > > > + lcg-rep -v --vo dteam -d castorgrid.cern.ch
> > > > > > > > > lfn:sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722
> > > > > > > > > 
> > > > > > > > >             0 bytes      0.00 KB/sec avg      0.00 
> > > > KB/sec inst
> > > > > > > > >             0 bytes      0.00 KB/sec avg      0.00 
> > > > KB/sec inst
> > > > > > > > >             0 bytes      0.00 KB/sec avg      
> > 0.00 KB/sec 
> > > > > > > > instUsing grid
> > > > > > > > > catalog type: lfc
> > > > > > > > > Using grid catalog : 
> > > > prod-lfc-shared-central.cern.ch Source URL:
> > > > > > > > > 
> > > > > > 
> > lfn:/grid/dteam/SFT/sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.060522072
> > > > > > 2
> > > > > > > > > File size: 233
> > > > > > > > > VO name: dteam
> > > > > > > > > Destination specified: castorgrid.cern.ch Source
> > > > URL for copy:
> > > > > > > > > 
> > > > > > > > 
> > > > > > 
> > > > 
> > gsiftp://heplnx204.pp.rl.ac.uk:2811//pnfs/pp.rl.ac.uk/data/dteam/gen
> > > > > > > > er
> > > > > > > > > at
> > > > > > > > > ed/2006-05-22/file330985b9-5368-4e67-82ec-5ee6f6fd4fa8
> > > > > > > > > Destination URL for copy:
> > > > > > > > > 
> > > > > > > > 
> > > > > > 
> > > > 
> > gsiftp://castorgrid.cern.ch/castor/cern.ch/grid/dteam/generated/2006
> > > > > > > > -0
> > > > > > > > > 5- 22/file8c15f735-de68-4949-aba5-33c9098462ff
> > > > > > > > > # streams: 1
> > > > > > > > > # set timeout to 0
> > > > > > > > > 
> > > > > > > > > Transfer took 2020 ms
> > > > > > > > > Destination URL registered in LRC:
> > > > > > > > > 
> > > > > > > > 
> > > > > > 
> > > > 
> > sfn://castorgrid.cern.ch/castor/cern.ch/grid/dteam/generated/2006-05
> > > > > > > > -2
> > > > > > > > > 2/ file8c15f735-de68-4949-aba5-33c9098462ff
> > > > > > > > > + result=0
> > > > > > > > > + set +x
> > > > > > > > > 
> > > > > > > > > List replicas to check if replication was really 
> > > > > > > > > successful
> > > > > > > > > 
> > > > > > > > > + lcg-lr --vo dteam
> > > > > > > > lfn:sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722
> > > > > > > > > 
> > > > > > > > 
> > > > > > 
> > > > 
> > sfn://castorgrid.cern.ch/castor/cern.ch/grid/dteam/generated/2006-05
> > > > > > > > -2
> > > > > > > > > 2/ file8c15f735-de68-4949-aba5-33c9098462ff
> > > > > > > > > 
> > > > > > > > 
> > > > > > 
> > > > 
> > srm://heplnx204.pp.rl.ac.uk/pnfs/pp.rl.ac.uk/data/dteam/generated/20
> > > > > > > > 06
> > > > > > > > > -0
> > > > > > > > > 5-22/file330985b9-5368-4e67-82ec-5ee6f6fd4fa8
> > > > > > > > > + set +x
> > > > > > > > > 
> > > > > > > > > I was always getting a few of these but since I added 
> > > > > > > > > extra
> > > > > > > > VOs a week
> > > > > > > > > ago I now seem to failing between 30 and 50% of the SFT
> > > > > > > > runs with this
> > > > > > > > > alone.
> > > > > > > > > 
> > > > > > > > > I haven't managed to replicate the error by 
> > copying files
> > > > > > > > in and out
> > > > > > > > > multiple times and the SFT deletes the file so I cannot
> > > > > > check the
> > > > > > > > > status of the file the see the error with.
> > > > > > > > > 
> > > > > > > > > Googling for the error seems to show that it's not
> > > > > > uncommon but I
> > > > > > > > > don't see and indications of cause or solution. There
> > > > > > > > doesn't seem to
> > > > > > > > > be anything in the logs.
> > > > > > > > > 
> > > > > > > > > Anyone know what I can do about this (other than
> > > > install DPM)?
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Chris.
> > > > > > > > > 
> > > > > > > > > Examples taken from:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > 
> > > > > > 
> > > > 
> > https://lcg-sft.cern.ch/sft/info/heplnx201.pp.rl.ac.uk/sft_2006-05-2
> > > > > > > > 2_
> > > > > > > > > 07
> > > > > > > > > .10.05.html#sft-lcg-rm_2006-05-22_07:22:49
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > --
> > > > > > > > 
> > ============================================================
> > > > > > > > ==
> > > > > > > > ==========
> > > > > > > > Dr Greig A Cowan                         
> > > > > > > > http://www.ph.ed.ac.uk/~gcowan1 School of Physics, 
> > > > > > > > University of Edinburgh, James
> > > > Clerk Maxwell
> > > > > > > > Building
> > > > > > > > 
> > > > > > > > TIER-2 STORAGE SUPPORT PAGES: 
> > > > > > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage
> > > > > > > > 
> > ============================================================
> > > > > > > > ==
> > > > > > > > ==========
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > --
> > > > > > ==============================================================
> > > > > > ==========
> > > > > > Dr Greig A Cowan                         
> > > > > > http://www.ph.ed.ac.uk/~gcowan1
> > > > > > School of Physics, University of Edinburgh, James 
> > Clerk Maxwell 
> > > > > > Building
> > > > > > 
> > > > > > TIER-2 STORAGE SUPPORT PAGES: 
> > > > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage
> > > > > > ==============================================================
> > > > > > ==========
> > > > > > 
> > > > > 
> > > > 
> > > > --
> > > > ==============================================================
> > > > ==========
> > > > Dr Greig A Cowan                         
> > > > http://www.ph.ed.ac.uk/~gcowan1
> > > > School of Physics, University of Edinburgh, James Clerk Maxwell 
> > > > Building
> > > > 
> > > > TIER-2 STORAGE SUPPORT PAGES: 
> > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage
> > > > ==============================================================
> > > > ==========
> > > > 
> > > 
> > 
> > --
> > ==============================================================
> > ==========
> > Dr Greig A Cowan                         
> > http://www.ph.ed.ac.uk/~gcowan1
> > School of Physics, University of Edinburgh, James Clerk 
> > Maxwell Building
> > 
> > TIER-2 STORAGE SUPPORT PAGES: 
> > http://wiki.gridpp.ac.uk/wiki/Grid_Storage
> > ==============================================================
> > ==========
> > 
> 

-- 
 =======================================================================
Dr Greig A Cowan                         http://www.ph.ed.ac.uk/~gcowan1
School of Physics, University of Edinburgh, James Clerk Maxwell Building

TIER-2 STORAGE SUPPORT PAGES: http://wiki.gridpp.ac.uk/wiki/Grid_Storage
 =======================================================================