Hi Chris,
I've seen this before, but it's unclear to me what causes it. Looking at
your latest SFT failure (10:10), the lcg-cp command was successful, but
the subsequent lcg-rep failed.
Is there something else running on your dCache node which could be
interfering with pnfs? Maybe a cron job of some sort?
Cheers,
Greig
On Mon, 22 May 2006, Brew, CAJ (Chris) wrote:
> Hi All,
>
> I'm getting a lot of random failures in the SFTs from my dCache where
> the write of the file to the dCache appears successful but then when the
> SFT tries to read the file back you get:
>
> + lcg-cp -v --vo dteam lfn:sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722
> file:///scratch/WMS_heplnx48_018249_https_3a_2f_2fgdrb02.cern.ch_3a9000_
> 2fLxXmsliu9ehFjCWOYEcxQg/sft-lcg-rm-cp.txt
> the server sent an error response: 553 553 Permission denied, reason:
> CacheException(rc=666;msg=can't get pnfsId (not a pnfsfile))
>
> lcg_cp: Permission denied
> Using grid catalog type: lfc
> Using grid catalog : prod-lfc-shared-central.cern.ch
>
> It appears that the write was indeed successful because the same SFT can
> later replicate it to CERN:
>
> Replicate the file from the default SE to castorgrid.cern.ch
>
> + lcg-rep -v --vo dteam -d castorgrid.cern.ch
> lfn:sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722
>
> 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst
> 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst
> 0 bytes 0.00 KB/sec avg 0.00 KB/sec instUsing grid
> catalog type: lfc
> Using grid catalog : prod-lfc-shared-central.cern.ch
> Source URL:
> lfn:/grid/dteam/SFT/sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722
> File size: 233
> VO name: dteam
> Destination specified: castorgrid.cern.ch
> Source URL for copy:
> gsiftp://heplnx204.pp.rl.ac.uk:2811//pnfs/pp.rl.ac.uk/data/dteam/generat
> ed/2006-05-22/file330985b9-5368-4e67-82ec-5ee6f6fd4fa8
> Destination URL for copy:
> gsiftp://castorgrid.cern.ch/castor/cern.ch/grid/dteam/generated/2006-05-
> 22/file8c15f735-de68-4949-aba5-33c9098462ff
> # streams: 1
> # set timeout to 0
>
> Transfer took 2020 ms
> Destination URL registered in LRC:
> sfn://castorgrid.cern.ch/castor/cern.ch/grid/dteam/generated/2006-05-22/
> file8c15f735-de68-4949-aba5-33c9098462ff
> + result=0
> + set +x
>
> List replicas to check if replication was really successful
>
> + lcg-lr --vo dteam lfn:sft-lcg-rm-cr-heplnx48.pp.rl.ac.uk.0605220722
> sfn://castorgrid.cern.ch/castor/cern.ch/grid/dteam/generated/2006-05-22/
> file8c15f735-de68-4949-aba5-33c9098462ff
> srm://heplnx204.pp.rl.ac.uk/pnfs/pp.rl.ac.uk/data/dteam/generated/2006-0
> 5-22/file330985b9-5368-4e67-82ec-5ee6f6fd4fa8
> + set +x
>
> I was always getting a few of these but since I added extra VOs a week
> ago I now seem to failing between 30 and 50% of the SFT runs with this
> alone.
>
> I haven't managed to replicate the error by copying files in and out
> multiple times and the SFT deletes the file so I cannot check the status
> of the file the see the error with.
>
> Googling for the error seems to show that it's not uncommon but I don't
> see and indications of cause or solution. There doesn't seem to be
> anything in the logs.
>
> Anyone know what I can do about this (other than install DPM)?
>
> Thanks,
> Chris.
>
> Examples taken from:
>
> https://lcg-sft.cern.ch/sft/info/heplnx201.pp.rl.ac.uk/sft_2006-05-22_07
> .10.05.html#sft-lcg-rm_2006-05-22_07:22:49
>
--
=======================================================================
Dr Greig A Cowan http://www.ph.ed.ac.uk/~gcowan1
School of Physics, University of Edinburgh, James Clerk Maxwell Building
TIER-2 STORAGE SUPPORT PAGES: http://wiki.gridpp.ac.uk/wiki/Grid_Storage
=======================================================================
|