17% jobs failed Sheffield mostly with errors
Error details: pilot: Copy command returned error code 256 and output:
/opt/lcg/bin/lcg-cr lcg_util-1.7.2-1 GFAL-client-1.11.4-1 Using grid
catalog type: lfc Using grid catalog : lfc0448.gridpp.rl.ac.uk SE type:
SRMv2 Destination SURL : srm://lcgse0.shef.ac.uk:8446/srm/ma
there are also Staging input file failed and Failed to get LFC
replicas errors.
All of thes errors are related to network load. When we tried to reproduce
these errors even when the system was loaded, we were always successful of
finishing commands that failed when the jobs were running.
Such kinds of errors even appeared for production jobs during STEP09 and
just gone when the cluster were not heavily loaded.
Your ideas what can lead to the these errors are greatly aprreciated.
We have 100 WNs with 2 CPUs and 1G LAN. I've also turned OFF RFIO buffer
dduring the last days of STEP09.
Cheers
Elena
On Wed, 24 Jun 2009, Graeme Stewart wrote:
> So, a quick look here:
>
> http://panda.cern.ch:25980/server/pandamon/query?dash=analysis
>
> suggests
>
> Lancaster: Some internal storage problems:
>
> http://voatlas19.cern.ch:25980/server/pandamon/query?job=1012759079
> "Get error: rfcp failed: 512,
> /dpm/lancs.ac.uk/home/atlas/atlasmcdisk/mc08/AOD/mc08.106453.AMSB4_jimmy_susy.merge.AOD.e357_s462_r635_t53_tid068283/AOD.068283._00001.pool.root.1
> : No route to host"
>
>
> Manc-2: Some WNs short of scratch space:
>
> http://voatlas19.cern.ch:25980/server/pandamon/query?job=1012742162
> Too little space left on local disk to run job: 2051072 kB (need > 2097152 kB)
>
>
> Oxford: Some problems I don't understand
>
> http://panda.cern.ch:25980/server/pandamon/query?mode=archive&type=analysis&computingSite=ANALY_OX&jobStatus=failed&hours=24
>
> "task buffer expired"
>
> Are their analysis pilots running?
>
>
> Sheffield: LFC lookup problems and stage-in/out problems (network issues?):
>
> http://panda.cern.ch:25980/server/pandamon/query?job=1012762913
> http://panda.cern.ch:25980/server/pandamon/query?job=1012750235
> http://panda.cern.ch:25980/server/pandamon/query?job=1012742714
>
>
> Glasgow is ahead for now, but QMUL is coming up fast with their
> supercharged lustre system....
>
> Graeme
>
> On Wed, Jun 24, 2009 at 11:38, Daniel van der
> Ster<[log in to unmask]> wrote:
>> Test 479 set to start at 12:00 today.
>> Cheers,
>> Dan
>>
>>
>> 2009/6/24 Graeme Stewart <[log in to unmask]>:
>>> Hi Dan/Johannes
>>>
>>> We want to test file:/// access at QMUL, which is now setup in panda.
>>> Could you start a panda hammercloud cloud for the UK, to last until
>>> midnight tonight? This should allow the site(s) to do a good sweep
>>> through the number of running jobs in their systems.
>>>
>>> I'm anticipating general interest so please send to all the UK ANALY
>>> queues. (Any site which is in a bad shape for testing can shut off
>>> pilots or apply severe batch system limits.)
>>>
>>> Thanks
>>>
>>> Graeme
>>>
>>> PS. Sorry for the short notice, but RAL LFC is down tomorrow, so we'd
>>> like to get one test done today.
>>>
>>> --
>>> Dr Graeme Stewart http://www.physics.gla.ac.uk/~graeme/
>>> Department of Physics and Astronomy, University of Glasgow, Scotland
>>> DEATH TO MEETINGS!
>>>
>>
>
>
>
> --
> Dr Graeme Stewart http://www.physics.gla.ac.uk/~graeme/
> Department of Physics and Astronomy, University of Glasgow, Scotland
> DEATH TO MEETINGS!
>
____________________________________________________________________________
Dr Elena Korolkova
Email: [log in to unmask]
Tel.: +44 (0)114 2223553
Fax: +44 (0)114 2223555
Department of Physics and Astronomy
University of Sheffield
Sheffield, S3 7RH, United Kingdom
|