Hi Owen, I think you will find that the problem was fixed when PNFS was properly mounted on the doors. I think there is a thread on this list and in the user-forum@dcache list that describes the solution. Cheers, Greig On Mon, 12 Jun 2006, Owen Synge wrote: > Hello all, > > Back in Sunny Oxfordshire, I am adding Patrick to this email as I think this may aid the process of finding this issue. Hopefully D-Cache can fix this issue upstream. > > > On Thu, 25 May 2006 09:59:53 +0100 > "Brew, CAJ (Chris)" <[log in to unmask]> wrote: > > > Hi, > > > > The details are in the dcache user-forum but it looks to me like the > > root cause is a combination of the many VOs (and their databases) and > > having the gridftpdoor on the pool node. > > > > If I move the door to the admin node the problem goes away but from my > > previous transfer tests that limits my rates to about 200 Mb/s rather > > than the 400Mb/s I can get with the door on the pool node. That handicap > > will only get worse when I add another 6-8 servers and get the 10GigE > > connection to the Tier 1. > > > > It looks like I've got three options: > > > > Run the door on the admin node and accept slow transfers > > I cant see this as a good long term solution. > > > Run the door on the pool node and accept SFT failures (or try to get the > > SFTs modified to wait between the upload and access) > > This seems a pragmatic work around after all these are > > "Site Functional tests" > > and provided the site is functional and this bug may be found in production but is not a good indication of functionality if services bugs are being caught it should be a > > "functional regression test" > > in my opinion. RAL's ADS tape system has been in production for years has exactly the same issue. This issue is part of the specs for an internal D-Cache bug. Of cause we cant let this issue fall away if we modify the site functional tests. > > > Try to reconfigure the dCache to have fewer databases (say one for each > > of the major VOs then one for every 4-6 smaller VOs). Is it even > > possible to eliminate databases like that? But then again I used to see > > these errors occasionally even before I increased the number of VOs so > > that probably won't be a complete fix. > > I think this is probably going to cause more pain and suffering in the long term, until I went to DESY for the past two weeks, the D-Cache team where unaware that D-Cache setups contained as many as 24 VO in a typical tier 2 install. This made the quota issue easier for them to understand also. > > > Does anyone have any other ideas? The tier 1 doesn't seem to suffer from > > this problem despite supporting almost as many VOs and running > > gridftpdoors on the pools? Have they split off pnfs to it's own server? > > > > Thanks, > > Chris. > > I know people did have ideas about decomposing the services better with separate pnfs, pool and door nodes, but I understand this is a timing issue, found in an unusual testing use case and am unsure if we should not just escalate this bug and change the tests for now. > > > Regards > > Owen > > > > > > > -----Original Message----- > > > From: Greig A Cowan [mailto:[log in to unmask]] > > > Sent: 24 May 2006 18:13 > > > To: Brew, CAJ (Chris) > > > Cc: [log in to unmask] > > > Subject: RE: dCache SFT Failures > > > > > > > > > Hi Chris, > > > > > > I've just read your post on the user-forum. That's very > > > interesting what you've found. Could we be seeing a scaling > > > problem with dCache? I hadn't realised that you were > > > supporting 24 VOs, each with their own database. > > > > > > I'll need to look into it, but there might be an option > > > within pnfs that lets you control things like this. > > > > > > Greig > > > > > > > > > On Wed, 24 May 2006, Brew, CAJ (Chris) wrote: > > > > > > > Hi Grieg, > > > > > > > > I've just been running some tests and have a bit more info > > > which I've > > > > just posted to the dcache user-forum it looks like the file > > > info isn't > > > > getting into the pnfs databases quickly enough. > > > > > > > > That's probably why I'm failing the sfts but haven't heard > > > complaints > > > > from users. > > > > > > > > I'm not sure where to take this from here unless I can tune > > > the DB to > > > > get the info in quicker. > > > > > > > > Yours, > > > > Chris. > > > > > > > > > -----Original Message----- > > > > > From: GRIDPP2: Deployment and support of SRM and local storage > > > > > management [mailto:[log in to unmask]] On > > > Behalf Of Greig > > > > > A Cowan > > > > > Sent: 24 May 2006 17:56 > > > > > To: [log in to unmask] > > > > > Subject: Re: dCache SFT Failures > > > > > > > > > > Hi Chris, > > > > > > > > > > I see that you are still failing the SFTs, in fact, the situation > > > > > seems worse than before! > > > > > > > > > > You are definitely using the correct pnfs mount options, > > > aren't you? > > > > > Have you tried rebooting the machine? > > > > > > > > > > Greig > > > > > > > > > > On Tue, 23 May 2006, Brew, CAJ (Chris) wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: GRIDPP2: Deployment and support of SRM and > > > local storage > > > > > > > management [mailto:[log in to unmask]] On > > > > > Behalf Of Greig > > > > > > > A Cowan > > > > > > > Sent: 23 May 2006 12:14 > > > > > > > To: [log in to unmask] > > > > > > > Subject: Re: dCache SFT Failures > > > > > > > > > > > > > > Hi Chris, > > > > > > > > > > > > > > what are the permissions of the generated directory that > > > > > the SFT is > > > > > > > trying to write into? > > > > > > > > > > > > dteam001:dteam drwxr-xr-x > > > > > > > > > > > > As all the dteam directories appear to be. > > > > > > > > > > > > > What options are you using when mounting pnfs on pool nodes? > > > > > > > > > > > > Hmm, from /etc/mtab: > > > > > > > > > > > > heplnx204.pp.rl.ac.uk:/pnfsdoors /pnfs/pp.rl.ac.uk nfs > > > > > > rw,addr=130.246.47.204 0 0 > > > > > > heplnx204.pp.rl.ac.uk:/fs /pnfs/fs nfs > > > > > > rw,hard,intr,noac,addr=130.246.47.204 0 0 > > > > > > > > > > > > I had a problem earlier where the /fs filesystem hadn't mounted > > > > > > and the doors weren't working on the pool node, I ended > > > up fixing > > > > > > it by putting it in /etc/fstab. I've remounted it with the same > > > > > options as > > > > > > the > > > > > > pnfsdoors: > > > > > > > > > > > > heplnx204.pp.rl.ac.uk:/pnfsdoors /pnfs/pp.rl.ac.uk nfs > > > > > > rw,addr=130.246.47.204 0 0 > > > > > > heplnx204.pp.rl.ac.uk:/fs /pnfs/fs nfs > > > rw,addr=130.246.47.204 0 0 > > > > > > > > > > > > Are the dCache filesystems in your fstab? what are the options? > > > > > > > > > > > > Thanks, > > > > > > Chris. > > > > > > > > > > > > > Cheers, > > > > > > > Greig > > > > > > > > > > > > > > On Tue, 23 May 2006, Brew, CAJ (Chris) wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > Removing the cron job doesn't seem to have solved the > > > > > problem, the > > > > > > > > load on the machine is pretty low. Any other things I > > > > > can try, My > > > > > > > > reliability is really low at the moment because of this. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Chris > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Greig A Cowan [mailto:[log in to unmask]] > > > > > > > > > Sent: 22 May 2006 12:33 > > > > > > > > > To: Brew, CAJ (Chris) > > > > > > > > > Cc: [log in to unmask] > > > > > > > > > Subject: RE: dCache SFT Failures > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hmmm, yes there's a houly cron (on the hour so it's > > > > > > > probably still > > > > > > > > > > running if the SFT gets through the queue quickly) that > > > > > > > du's the > > > > > > > > > > dCache area to get a per VO breakdown of usage. > > > > > I'll disable > > > > > > > > > > it and see if the SFT pass rate improves. > > > > > > > > > > > > > > > > > > You could run the cron at half past the hour instead. Do > > > > > > > you really > > > > > > > > > need to run the cron every hour? The Tier-1 just run > > > > > a similar > > > > > > > > > command each night at 12pm. > > > > > > > > > > > > > > > > > > > p.s. Anyone know of another way of getting the > > > information > > > > > > > > > (A query on > > > > > > > > > > the DB perhaps)? > > > > > > > > > > > > > > > > > > Unfortunately not. I asked about this, but it's > > > not possible > > > > > > > > > with dCache at the moment. It should be available in a > > > > > > > > > future > > > > > > > release... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > > > From: GRIDPP2: Deployment and support of SRM and > > > > > > > local storage > > > > > > > > > > > management [mailto:[log in to unmask]] On > > > > > > > > > Behalf Of Greig > > > > > > > > > > > A Cowan > > > > > > > > > > > Sent: 22 May 2006 12:15 > > > > > > > > > > > To: [log in to unmask] > > > > > > > > > > > Subject: Re: dCache SFT Failures > > > > > > > > > > > > > > > > > > > > > > Hi Chris, > > > > > > > > > > > > > > > > > > > > > > I've seen this before, but it's unclear to me > > > > > what causes it. > > > > > > > > > > > Looking at your latest SFT failure (10:10), the lcg-cp > > > > > > > > > command was > > > > > > > > > > > successful, but the subsequent lcg-rep failed. > > > > > > > > > > > > > > > > > > > > > > Is there something else running on your dCache node > > > > > > > > > > > which > > > > > > > > > could be > > > > > > > > > > > interfering with pnfs? Maybe a cron job of some sort? > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > Greig > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, 22 May 2006, Brew, CAJ (Chris) wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > > > > > > > > > > > I'm getting a lot of random failures in the > > > SFTs from > > > > > > > > > > > > my > > > > > > > > > > > dCache where > > > > > > > > > > > > the write of the file to the dCache appears > > > > > successful but > > > > > > > > > > > then when > > > > > > > > > > > > the SFT tries to read the file back you get: > > > > > > > > > > > > > > > > > > > > > > > > + lcg-cp -v --vo dteam > > > > > > > > > > > > + lfn:sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > file:///scratch/WMS_heplnx48_018249_https_3a_2f_2fgdrb02.cern.ch_3a9 > > > > > > > > > > > 00 > > > > > > > > > > > > 0_ 2fLxXmsliu9ehFjCWOYEcxQg/sft-lcg-rm-cp.txt > > > > > > > > > > > > the server sent an error response: 553 553 > > > Permission > > > > > > > > > > > denied, reason: > > > > > > > > > > > > CacheException(rc=666;msg=can't get pnfsId (not a > > > > > > > > > > > > pnfsfile)) > > > > > > > > > > > > > > > > > > > > > > > > lcg_cp: Permission denied Using grid > > > catalog type: lfc > > > > > > > > > > > > Using grid catalog : > > > > > > > > > > > > prod-lfc-shared-central.cern.ch > > > > > > > > > > > > > > > > > > > > > > > > It appears that the write was indeed successful > > > > > > > > > > > > because the > > > > > > > > > > > same SFT > > > > > > > > > > > > can later replicate it to CERN: > > > > > > > > > > > > > > > > > > > > > > > > Replicate the file from the default SE to > > > > > > > > > > > > castorgrid.cern.ch > > > > > > > > > > > > > > > > > > > > > > > > + lcg-rep -v --vo dteam -d castorgrid.cern.ch > > > > > > > > > > > > lfn:sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722 > > > > > > > > > > > > > > > > > > > > > > > > 0 bytes 0.00 KB/sec avg 0.00 > > > > > > > KB/sec inst > > > > > > > > > > > > 0 bytes 0.00 KB/sec avg 0.00 > > > > > > > KB/sec inst > > > > > > > > > > > > 0 bytes 0.00 KB/sec avg > > > > > 0.00 KB/sec > > > > > > > > > > > instUsing grid > > > > > > > > > > > > catalog type: lfc > > > > > > > > > > > > Using grid catalog : > > > > > > > prod-lfc-shared-central.cern.ch Source URL: > > > > > > > > > > > > > > > > > > > > > > > > > > lfn:/grid/dteam/SFT/sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.060522072 > > > > > > > > > 2 > > > > > > > > > > > > File size: 233 > > > > > > > > > > > > VO name: dteam > > > > > > > > > > > > Destination specified: castorgrid.cern.ch Source > > > > > > > URL for copy: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > gsiftp://heplnx204.pp.rl.ac.uk:2811//pnfs/pp.rl.ac.uk/data/dteam/gen > > > > > > > > > > > er > > > > > > > > > > > > at > > > > > > > > > > > > > > > ed/2006-05-22/file330985b9-5368-4e67-82ec-5ee6f6fd4fa8 > > > > > > > > > > > > Destination URL for copy: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > gsiftp://castorgrid.cern.ch/castor/cern.ch/grid/dteam/generated/2006 > > > > > > > > > > > -0 > > > > > > > > > > > > 5- 22/file8c15f735-de68-4949-aba5-33c9098462ff > > > > > > > > > > > > # streams: 1 > > > > > > > > > > > > # set timeout to 0 > > > > > > > > > > > > > > > > > > > > > > > > Transfer took 2020 ms > > > > > > > > > > > > Destination URL registered in LRC: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > sfn://castorgrid.cern.ch/castor/cern.ch/grid/dteam/generated/2006-05 > > > > > > > > > > > -2 > > > > > > > > > > > > 2/ file8c15f735-de68-4949-aba5-33c9098462ff > > > > > > > > > > > > + result=0 > > > > > > > > > > > > + set +x > > > > > > > > > > > > > > > > > > > > > > > > List replicas to check if replication was really > > > > > > > > > > > > successful > > > > > > > > > > > > > > > > > > > > > > > > + lcg-lr --vo dteam > > > > > > > > > > > lfn:sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > sfn://castorgrid.cern.ch/castor/cern.ch/grid/dteam/generated/2006-05 > > > > > > > > > > > -2 > > > > > > > > > > > > 2/ file8c15f735-de68-4949-aba5-33c9098462ff > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > srm://heplnx204.pp.rl.ac.uk/pnfs/pp.rl.ac.uk/data/dteam/generated/20 > > > > > > > > > > > 06 > > > > > > > > > > > > -0 > > > > > > > > > > > > 5-22/file330985b9-5368-4e67-82ec-5ee6f6fd4fa8 > > > > > > > > > > > > + set +x > > > > > > > > > > > > > > > > > > > > > > > > I was always getting a few of these but > > > since I added > > > > > > > > > > > > extra > > > > > > > > > > > VOs a week > > > > > > > > > > > > ago I now seem to failing between 30 and 50% of the > > > > > > > > > > > > SFT > > > > > > > > > > > runs with this > > > > > > > > > > > > alone. > > > > > > > > > > > > > > > > > > > > > > > > I haven't managed to replicate the error by > > > > > copying files > > > > > > > > > > > in and out > > > > > > > > > > > > multiple times and the SFT deletes the file so I > > > > > > > > > > > > cannot > > > > > > > > > check the > > > > > > > > > > > > status of the file the see the error with. > > > > > > > > > > > > > > > > > > > > > > > > Googling for the error seems to show that it's not > > > > > > > > > uncommon but I > > > > > > > > > > > > don't see and indications of cause or > > > solution. There > > > > > > > > > > > doesn't seem to > > > > > > > > > > > > be anything in the logs. > > > > > > > > > > > > > > > > > > > > > > > > Anyone know what I can do about this (other than > > > > > > > install DPM)? > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Chris. > > > > > > > > > > > > > > > > > > > > > > > > Examples taken from: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://lcg-sft.cern.ch/sft/info/heplnx201.pp.rl.ac.uk/sft_2006-05-2 > > > > > > > > > > > 2_ > > > > > > > > > > > > 07 > > > > > > > > > > > > .10.05.html#sft-lcg-rm_2006-05-22_07:22:49 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > ============================================================ > > > > > > > > > > > == > > > > > > > > > > > ========== > > > > > > > > > > > Dr Greig A Cowan > > > > > > > > > > > http://www.ph.ed.ac.uk/~gcowan1 School of Physics, > > > > > > > > > > > University of Edinburgh, James > > > > > > > Clerk Maxwell > > > > > > > > > > > Building > > > > > > > > > > > > > > > > > > > > > > TIER-2 STORAGE SUPPORT PAGES: > > > > > > > > > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage > > > > > > > > > > > > > > > > ============================================================ > > > > > > > > > > > == > > > > > > > > > > > ========== > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > ============================================================ > > > > > > > > > == > > > > > > > > > ========== > > > > > > > > > Dr Greig A Cowan > > > > > > > > > http://www.ph.ed.ac.uk/~gcowan1 School of Physics, > > > > > > > > > University of Edinburgh, James > > > > > Clerk Maxwell > > > > > > > > > Building > > > > > > > > > > > > > > > > > > TIER-2 STORAGE SUPPORT PAGES: > > > > > > > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage > > > > > > > > > > > > ============================================================ > > > > > > > > > == > > > > > > > > > ========== > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > ============================================================== > > > > > > > ========== > > > > > > > Dr Greig A Cowan > > > > > > > http://www.ph.ed.ac.uk/~gcowan1 > > > > > > > School of Physics, University of Edinburgh, James > > > Clerk Maxwell > > > > > > > Building > > > > > > > > > > > > > > TIER-2 STORAGE SUPPORT PAGES: > > > > > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage > > > > > > > ============================================================== > > > > > > > ========== > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > ============================================================== > > > > > ========== > > > > > Dr Greig A Cowan > > > > > http://www.ph.ed.ac.uk/~gcowan1 > > > > > School of Physics, University of Edinburgh, James Clerk Maxwell > > > > > Building > > > > > > > > > > TIER-2 STORAGE SUPPORT PAGES: > > > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage > > > > > ============================================================== > > > > > ========== > > > > > > > > > > > > > > > -- > > > ============================================================== > > > ========== > > > Dr Greig A Cowan > > > http://www.ph.ed.ac.uk/~gcowan1 > > > School of Physics, University of Edinburgh, James Clerk > > > Maxwell Building > > > > > > TIER-2 STORAGE SUPPORT PAGES: > > > http://wiki.gridpp.ac.uk/wiki/Grid_Storage > > > ============================================================== > > > ========== > > > > -- ======================================================================= Dr Greig A Cowan http://www.ph.ed.ac.uk/~gcowan1 School of Physics, University of Edinburgh, James Clerk Maxwell Building TIER-2 STORAGE SUPPORT PAGES: http://wiki.gridpp.ac.uk/wiki/Grid_Storage =======================================================================