Acknowlaged, and thanks
Owen
On Mon, 12 Jun 2006 13:20:35 +0100
"Brew, CAJ \(Chris\)" <[log in to unmask]> wrote:
> Hi Owen,
>
> We eventually tracked down the cause of the issue, it was the mount
> options on the pnfsdoors mount on the gridftp door node. It wasn't
> mounted with the "noac" option so was caching some of the (p)nfs info,
> hence the file created a few seconds before wasn't there whent the
> gridftp door looked for it - hence the (correct) error.
>
> What we're not sure of is what's mounting the pnfsdoors area and how we
> change the mount options. I've put it in /etc/fstab for now so it will
> be mounted before the d-cache services start which solves the problem
> but may not be the best solution.
>
> Yours,
> Chris.
>
>
>
> > -----Original Message-----
> > From: Owen Synge [mailto:[log in to unmask]]
> > Sent: 12 June 2006 13:11
> > To: Brew, CAJ (Chris); Patrick Fuhrmann
> > Cc: [log in to unmask]
> > Subject: Re: dCache SFT Failures
> >
> > Hello all,
> >
> > Back in Sunny Oxfordshire, I am adding Patrick to this email
> > as I think this may aid the process of finding this issue.
> > Hopefully D-Cache can fix this issue upstream.
> >
> >
> > On Thu, 25 May 2006 09:59:53 +0100
> > "Brew, CAJ (Chris)" <[log in to unmask]> wrote:
> >
> > > Hi,
> > >
> > > The details are in the dcache user-forum but it looks to me like the
> > > root cause is a combination of the many VOs (and their
> > databases) and
> > > having the gridftpdoor on the pool node.
> > >
> > > If I move the door to the admin node the problem goes away
> > but from my
> > > previous transfer tests that limits my rates to about 200
> > Mb/s rather
> > > than the 400Mb/s I can get with the door on the pool node.
> > That handicap
> > > will only get worse when I add another 6-8 servers and get
> > the 10GigE
> > > connection to the Tier 1.
> > >
> > > It looks like I've got three options:
> > >
> > > Run the door on the admin node and accept slow transfers
> >
> > I cant see this as a good long term solution.
> >
> > > Run the door on the pool node and accept SFT failures (or
> > try to get the
> > > SFTs modified to wait between the upload and access)
> >
> > This seems a pragmatic work around after all these are
> >
> > "Site Functional tests"
> >
> > and provided the site is functional and this bug may be found
> > in production but is not a good indication of functionality
> > if services bugs are being caught it should be a
> >
> > "functional regression test"
> >
> > in my opinion. RAL's ADS tape system has been in production
> > for years has exactly the same issue. This issue is part of
> > the specs for an internal D-Cache bug. Of cause we cant let
> > this issue fall away if we modify the site functional tests.
> >
> > > Try to reconfigure the dCache to have fewer databases (say
> > one for each
> > > of the major VOs then one for every 4-6 smaller VOs). Is it even
> > > possible to eliminate databases like that? But then again I
> > used to see
> > > these errors occasionally even before I increased the
> > number of VOs so
> > > that probably won't be a complete fix.
> >
> > I think this is probably going to cause more pain and
> > suffering in the long term, until I went to DESY for the past
> > two weeks, the D-Cache team where unaware that D-Cache setups
> > contained as many as 24 VO in a typical tier 2 install. This
> > made the quota issue easier for them to understand also.
> >
> > > Does anyone have any other ideas? The tier 1 doesn't seem
> > to suffer from
> > > this problem despite supporting almost as many VOs and running
> > > gridftpdoors on the pools? Have they split off pnfs to it's
> > own server?
> > >
> > > Thanks,
> > > Chris.
> >
> > I know people did have ideas about decomposing the services
> > better with separate pnfs, pool and door nodes, but I
> > understand this is a timing issue, found in an unusual
> > testing use case and am unsure if we should not just escalate
> > this bug and change the tests for now.
> >
> >
> > Regards
> >
> > Owen
> >
> >
> > >
> > > > -----Original Message-----
> > > > From: Greig A Cowan [mailto:[log in to unmask]]
> > > > Sent: 24 May 2006 18:13
> > > > To: Brew, CAJ (Chris)
> > > > Cc: [log in to unmask]
> > > > Subject: RE: dCache SFT Failures
> > > >
> > > >
> > > > Hi Chris,
> > > >
> > > > I've just read your post on the user-forum. That's very
> > > > interesting what you've found. Could we be seeing a scaling
> > > > problem with dCache? I hadn't realised that you were
> > > > supporting 24 VOs, each with their own database.
> > > >
> > > > I'll need to look into it, but there might be an option
> > > > within pnfs that lets you control things like this.
> > > >
> > > > Greig
> > > >
> > > >
> > > > On Wed, 24 May 2006, Brew, CAJ (Chris) wrote:
> > > >
> > > > > Hi Grieg,
> > > > >
> > > > > I've just been running some tests and have a bit more info
> > > > which I've
> > > > > just posted to the dcache user-forum it looks like the file
> > > > info isn't
> > > > > getting into the pnfs databases quickly enough.
> > > > >
> > > > > That's probably why I'm failing the sfts but haven't heard
> > > > complaints
> > > > > from users.
> > > > >
> > > > > I'm not sure where to take this from here unless I can tune
> > > > the DB to
> > > > > get the info in quicker.
> > > > >
> > > > > Yours,
> > > > > Chris.
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: GRIDPP2: Deployment and support of SRM and
> > local storage
> > > > > > management [mailto:[log in to unmask]] On
> > > > Behalf Of Greig
> > > > > > A Cowan
> > > > > > Sent: 24 May 2006 17:56
> > > > > > To: [log in to unmask]
> > > > > > Subject: Re: dCache SFT Failures
> > > > > >
> > > > > > Hi Chris,
> > > > > >
> > > > > > I see that you are still failing the SFTs, in fact,
> > the situation
> > > > > > seems worse than before!
> > > > > >
> > > > > > You are definitely using the correct pnfs mount options,
> > > > aren't you?
> > > > > > Have you tried rebooting the machine?
> > > > > >
> > > > > > Greig
> > > > > >
> > > > > > On Tue, 23 May 2006, Brew, CAJ (Chris) wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: GRIDPP2: Deployment and support of SRM and
> > > > local storage
> > > > > > > > management [mailto:[log in to unmask]] On
> > > > > > Behalf Of Greig
> > > > > > > > A Cowan
> > > > > > > > Sent: 23 May 2006 12:14
> > > > > > > > To: [log in to unmask]
> > > > > > > > Subject: Re: dCache SFT Failures
> > > > > > > >
> > > > > > > > Hi Chris,
> > > > > > > >
> > > > > > > > what are the permissions of the generated directory that
> > > > > > the SFT is
> > > > > > > > trying to write into?
> > > > > > >
> > > > > > > dteam001:dteam drwxr-xr-x
> > > > > > >
> > > > > > > As all the dteam directories appear to be.
> > > > > > >
> > > > > > > > What options are you using when mounting pnfs on
> > pool nodes?
> > > > > > >
> > > > > > > Hmm, from /etc/mtab:
> > > > > > >
> > > > > > > heplnx204.pp.rl.ac.uk:/pnfsdoors /pnfs/pp.rl.ac.uk nfs
> > > > > > > rw,addr=130.246.47.204 0 0
> > > > > > > heplnx204.pp.rl.ac.uk:/fs /pnfs/fs nfs
> > > > > > > rw,hard,intr,noac,addr=130.246.47.204 0 0
> > > > > > >
> > > > > > > I had a problem earlier where the /fs filesystem
> > hadn't mounted
> > > > > > > and the doors weren't working on the pool node, I ended
> > > > up fixing
> > > > > > > it by putting it in /etc/fstab. I've remounted it
> > with the same
> > > > > > options as
> > > > > > > the
> > > > > > > pnfsdoors:
> > > > > > >
> > > > > > > heplnx204.pp.rl.ac.uk:/pnfsdoors /pnfs/pp.rl.ac.uk nfs
> > > > > > > rw,addr=130.246.47.204 0 0
> > > > > > > heplnx204.pp.rl.ac.uk:/fs /pnfs/fs nfs
> > > > rw,addr=130.246.47.204 0 0
> > > > > > >
> > > > > > > Are the dCache filesystems in your fstab? what are
> > the options?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Chris.
> > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Greig
> > > > > > > >
> > > > > > > > On Tue, 23 May 2006, Brew, CAJ (Chris) wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > Removing the cron job doesn't seem to have solved the
> > > > > > problem, the
> > > > > > > > > load on the machine is pretty low. Any other things I
> > > > > > can try, My
> > > > > > > > > reliability is really low at the moment because of this.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Chris
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Greig A Cowan [mailto:[log in to unmask]]
> > > > > > > > > > Sent: 22 May 2006 12:33
> > > > > > > > > > To: Brew, CAJ (Chris)
> > > > > > > > > > Cc: [log in to unmask]
> > > > > > > > > > Subject: RE: dCache SFT Failures
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Hmmm, yes there's a houly cron (on the hour so it's
> > > > > > > > probably still
> > > > > > > > > > > running if the SFT gets through the queue
> > quickly) that
> > > > > > > > du's the
> > > > > > > > > > > dCache area to get a per VO breakdown of usage.
> > > > > > I'll disable
> > > > > > > > > > > it and see if the SFT pass rate improves.
> > > > > > > > > >
> > > > > > > > > > You could run the cron at half past the hour
> > instead. Do
> > > > > > > > you really
> > > > > > > > > > need to run the cron every hour? The Tier-1 just run
> > > > > > a similar
> > > > > > > > > > command each night at 12pm.
> > > > > > > > > >
> > > > > > > > > > > p.s. Anyone know of another way of getting the
> > > > information
> > > > > > > > > > (A query on
> > > > > > > > > > > the DB perhaps)?
> > > > > > > > > >
> > > > > > > > > > Unfortunately not. I asked about this, but it's
> > > > not possible
> > > > > > > > > > with dCache at the moment. It should be
> > available in a
> > > > > > > > > > future
> > > > > > > > release...
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: GRIDPP2: Deployment and support of SRM and
> > > > > > > > local storage
> > > > > > > > > > > > management
> > [mailto:[log in to unmask]] On
> > > > > > > > > > Behalf Of Greig
> > > > > > > > > > > > A Cowan
> > > > > > > > > > > > Sent: 22 May 2006 12:15
> > > > > > > > > > > > To: [log in to unmask]
> > > > > > > > > > > > Subject: Re: dCache SFT Failures
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Chris,
> > > > > > > > > > > >
> > > > > > > > > > > > I've seen this before, but it's unclear to me
> > > > > > what causes it.
> > > > > > > > > > > > Looking at your latest SFT failure
> > (10:10), the lcg-cp
> > > > > > > > > > command was
> > > > > > > > > > > > successful, but the subsequent lcg-rep failed.
> > > > > > > > > > > >
> > > > > > > > > > > > Is there something else running on your
> > dCache node
> > > > > > > > > > > > which
> > > > > > > > > > could be
> > > > > > > > > > > > interfering with pnfs? Maybe a cron job
> > of some sort?
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Greig
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, 22 May 2006, Brew, CAJ (Chris) wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'm getting a lot of random failures in the
> > > > SFTs from
> > > > > > > > > > > > > my
> > > > > > > > > > > > dCache where
> > > > > > > > > > > > > the write of the file to the dCache appears
> > > > > > successful but
> > > > > > > > > > > > then when
> > > > > > > > > > > > > the SFT tries to read the file back you get:
> > > > > > > > > > > > >
> > > > > > > > > > > > > + lcg-cp -v --vo dteam
> > > > > > > > > > > > > +
> > lfn:sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > file:///scratch/WMS_heplnx48_018249_https_3a_2f_2fgdrb02.cern.ch_3a9
> > > > > > > > > > > > 00
> > > > > > > > > > > > > 0_ 2fLxXmsliu9ehFjCWOYEcxQg/sft-lcg-rm-cp.txt
> > > > > > > > > > > > > the server sent an error response: 553 553
> > > > Permission
> > > > > > > > > > > > denied, reason:
> > > > > > > > > > > > > CacheException(rc=666;msg=can't get
> > pnfsId (not a
> > > > > > > > > > > > > pnfsfile))
> > > > > > > > > > > > >
> > > > > > > > > > > > > lcg_cp: Permission denied Using grid
> > > > catalog type: lfc
> > > > > > > > > > > > > Using grid catalog :
> > > > > > > > > > > > > prod-lfc-shared-central.cern.ch
> > > > > > > > > > > > >
> > > > > > > > > > > > > It appears that the write was indeed successful
> > > > > > > > > > > > > because the
> > > > > > > > > > > > same SFT
> > > > > > > > > > > > > can later replicate it to CERN:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Replicate the file from the default SE to
> > > > > > > > > > > > > castorgrid.cern.ch
> > > > > > > > > > > > >
> > > > > > > > > > > > > + lcg-rep -v --vo dteam -d castorgrid.cern.ch
> > > > > > > > > > > > >
> > lfn:sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722
> > > > > > > > > > > > >
> > > > > > > > > > > > > 0 bytes 0.00 KB/sec
> > avg 0.00
> > > > > > > > KB/sec inst
> > > > > > > > > > > > > 0 bytes 0.00 KB/sec
> > avg 0.00
> > > > > > > > KB/sec inst
> > > > > > > > > > > > > 0 bytes 0.00 KB/sec avg
> > > > > > 0.00 KB/sec
> > > > > > > > > > > > instUsing grid
> > > > > > > > > > > > > catalog type: lfc
> > > > > > > > > > > > > Using grid catalog :
> > > > > > > > prod-lfc-shared-central.cern.ch Source URL:
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > lfn:/grid/dteam/SFT/sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.060522072
> > > > > > > > > > 2
> > > > > > > > > > > > > File size: 233
> > > > > > > > > > > > > VO name: dteam
> > > > > > > > > > > > > Destination specified: castorgrid.cern.ch Source
> > > > > > > > URL for copy:
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > gsiftp://heplnx204.pp.rl.ac.uk:2811//pnfs/pp.rl.ac.uk/data/dteam/gen
> > > > > > > > > > > > er
> > > > > > > > > > > > > at
> > > > > > > > > > > > >
> > > > ed/2006-05-22/file330985b9-5368-4e67-82ec-5ee6f6fd4fa8
> > > > > > > > > > > > > Destination URL for copy:
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > gsiftp://castorgrid.cern.ch/castor/cern.ch/grid/dteam/generated/2006
> > > > > > > > > > > > -0
> > > > > > > > > > > > > 5- 22/file8c15f735-de68-4949-aba5-33c9098462ff
> > > > > > > > > > > > > # streams: 1
> > > > > > > > > > > > > # set timeout to 0
> > > > > > > > > > > > >
> > > > > > > > > > > > > Transfer took 2020 ms
> > > > > > > > > > > > > Destination URL registered in LRC:
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > sfn://castorgrid.cern.ch/castor/cern.ch/grid/dteam/generated/2006-05
> > > > > > > > > > > > -2
> > > > > > > > > > > > > 2/ file8c15f735-de68-4949-aba5-33c9098462ff
> > > > > > > > > > > > > + result=0
> > > > > > > > > > > > > + set +x
> > > > > > > > > > > > >
> > > > > > > > > > > > > List replicas to check if replication
> > was really
> > > > > > > > > > > > > successful
> > > > > > > > > > > > >
> > > > > > > > > > > > > + lcg-lr --vo dteam
> > > > > > > > > > > > lfn:sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > sfn://castorgrid.cern.ch/castor/cern.ch/grid/dteam/generated/2006-05
> > > > > > > > > > > > -2
> > > > > > > > > > > > > 2/ file8c15f735-de68-4949-aba5-33c9098462ff
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > srm://heplnx204.pp.rl.ac.uk/pnfs/pp.rl.ac.uk/data/dteam/generated/20
> > > > > > > > > > > > 06
> > > > > > > > > > > > > -0
> > > > > > > > > > > > > 5-22/file330985b9-5368-4e67-82ec-5ee6f6fd4fa8
> > > > > > > > > > > > > + set +x
> > > > > > > > > > > > >
> > > > > > > > > > > > > I was always getting a few of these but
> > > > since I added
> > > > > > > > > > > > > extra
> > > > > > > > > > > > VOs a week
> > > > > > > > > > > > > ago I now seem to failing between 30
> > and 50% of the
> > > > > > > > > > > > > SFT
> > > > > > > > > > > > runs with this
> > > > > > > > > > > > > alone.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I haven't managed to replicate the error by
> > > > > > copying files
> > > > > > > > > > > > in and out
> > > > > > > > > > > > > multiple times and the SFT deletes the
> > file so I
> > > > > > > > > > > > > cannot
> > > > > > > > > > check the
> > > > > > > > > > > > > status of the file the see the error with.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Googling for the error seems to show
> > that it's not
> > > > > > > > > > uncommon but I
> > > > > > > > > > > > > don't see and indications of cause or
> > > > solution. There
> > > > > > > > > > > > doesn't seem to
> > > > > > > > > > > > > be anything in the logs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Anyone know what I can do about this (other than
> > > > > > > > install DPM)?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Chris.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Examples taken from:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://lcg-sft.cern.ch/sft/info/heplnx201.pp.rl.ac.uk/sft_2006-05-2
> > > > > > > > > > > > 2_
> > > > > > > > > > > > > 07
> > > > > > > > > > > > > .10.05.html#sft-lcg-rm_2006-05-22_07:22:49
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > >
> > > > > > ============================================================
> > > > > > > > > > > > ==
> > > > > > > > > > > > ==========
> > > > > > > > > > > > Dr Greig A Cowan
> > > > > > > > > > > > http://www.ph.ed.ac.uk/~gcowan1 School of
> > Physics,
> > > > > > > > > > > > University of Edinburgh, James
> > > > > > > > Clerk Maxwell
> > > > > > > > > > > > Building
> > > > > > > > > > > >
> > > > > > > > > > > > TIER-2 STORAGE SUPPORT PAGES:
> > > > > > > > > > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage
> > > > > > > > > > > >
> > > > > > ============================================================
> > > > > > > > > > > > ==
> > > > > > > > > > > > ==========
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > >
> > > > ============================================================
> > > > > > > > > > ==
> > > > > > > > > > ==========
> > > > > > > > > > Dr Greig A Cowan
> > > > > > > > > > http://www.ph.ed.ac.uk/~gcowan1 School of Physics,
> > > > > > > > > > University of Edinburgh, James
> > > > > > Clerk Maxwell
> > > > > > > > > > Building
> > > > > > > > > >
> > > > > > > > > > TIER-2 STORAGE SUPPORT PAGES:
> > > > > > > > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage
> > > > > > > > > >
> > > > ============================================================
> > > > > > > > > > ==
> > > > > > > > > > ==========
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > ==============================================================
> > > > > > > > ==========
> > > > > > > > Dr Greig A Cowan
> > > > > > > > http://www.ph.ed.ac.uk/~gcowan1
> > > > > > > > School of Physics, University of Edinburgh, James
> > > > Clerk Maxwell
> > > > > > > > Building
> > > > > > > >
> > > > > > > > TIER-2 STORAGE SUPPORT PAGES:
> > > > > > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage
> > > > > > > >
> > ==============================================================
> > > > > > > > ==========
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > ==============================================================
> > > > > > ==========
> > > > > > Dr Greig A Cowan
> > > > > > http://www.ph.ed.ac.uk/~gcowan1
> > > > > > School of Physics, University of Edinburgh, James
> > Clerk Maxwell
> > > > > > Building
> > > > > >
> > > > > > TIER-2 STORAGE SUPPORT PAGES:
> > > > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage
> > > > > > ==============================================================
> > > > > > ==========
> > > > > >
> > > > >
> > > >
> > > > --
> > > > ==============================================================
> > > > ==========
> > > > Dr Greig A Cowan
> > > > http://www.ph.ed.ac.uk/~gcowan1
> > > > School of Physics, University of Edinburgh, James Clerk
> > > > Maxwell Building
> > > >
> > > > TIER-2 STORAGE SUPPORT PAGES:
> > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage
> > > > ==============================================================
> > > > ==========
> > > >
> >
|