For your delight and delectation:
ANALY_BHAM: Bad. CE misconfigured and not accepting Role=pilot pilots?
ANALY_CAM: Bad. SE broken, "pilot: Get error: Copy command returned
error code 256 and output: globus_ftp_client: the server responded
with an error 530 Login incorrect. : Could not get virtual id!". Check
/etc/sysconfig/dpm-gsiftpd has DPM/DPNS hosts properly defined (we got
burned at Glasgow by that last week, DPM 1.7.2 cares about this).
ANALY_GLASGOW: Bad. DPM SRMv2.2 daemon fell over at 0100UT, restarted
at 0530UT, now fine. Damn! Lost many slots to LHCb so we'll probably
struggle to get the running numbers back up...
ANALY_LANCS: Fair. No major issues - not many jobs running though.
ANALY_LIV: Excellent.
ANALY_MANC1: Bad. Build job problems, which seem to be on stage-out
after the code compiles, e.g.,
http://panda.cern.ch:25980/server/pandamon/query?job=1016502474.
ANALY_MANC2: Fair. Running out of local stage space? "Error details:
pilot: Too little space left on local disk to run job: 2050048 kB
(need > 2097152 kB)" - maybe need to clean up disks or run less jobs
on these nodes? Otherwise ok.
ANALY_OX: Good, but again running short on local disk space: "Error
details: pilot: Too little space left on local disk to run job:
1555456 kB (need > 2097152 kB)"
ANALY_QMUL: Good, but some files not available in Lustre, first time I
have seen that. Lots of examples, e.g.,
http://panda.cern.ch:25980/server/pandamon/query?job=1016616476
22 Jul 2009 08:48:13| Mover is preparing to copy file #1/2 (lfn:
AOD.069635._00001.pool.root.1 guid:
DC096BD2-4651-DE11-92CF-0015C5E3EB36)
22 Jul 2009 08:48:13|
gpfn=srm://se03.esc.qmul.ac.uk/atlas/atlasmcdisk/mc08/AOD/mc08.107363.AlpgenQcdbbJ2Np3_SUSYfilt_pt20.merge.AOD.e389_s462_r635_t53_tid069635/AOD.069635._00001.pool.root.1
22 Jul 2009 08:48:13| Extracted fsize: 67253098 fchecksum: 9585ef09
for guid: DC096BD2-4651-DE11-92CF-0015C5E3EB36
22 Jul 2009 08:48:13| csumtype: adler32, checksum: 9585ef09, fsize: 67253098
22 Jul 2009 08:48:13| Copying
srm://se03.esc.qmul.ac.uk/atlas/atlasmcdisk/mc08/AOD/mc08.107363.AlpgenQcdbbJ2Np3_SUSYfilt_pt20.merge.AOD.e389_s462_r635_t53_tid069635/AOD.069635._00001.pool.root.1
to /scratch/tmp/condorg_kSY14560/pilot3/Panda_Pilot_14940_1248252462/PandaJob_1016616476_1248252462
(LRC checksum: \"9585ef09\", fsize: 67253098) using storm ()
22 Jul 2009 08:48:13| Calling get_data in Mover
22 Jul 2009 08:48:13| Using envsetup since envsetupin is not set
22 Jul 2009 08:48:13| Proxy verification turned off
22 Jul 2009 08:48:13| Executing command: export
X509_USER_PROXY=/tmp/globus-tmp.cn463.12378.0; lcg-gt
srm://se03.esc.qmul.ac.uk/atlas/atlasmcdisk/mc08/AOD/mc08.107363.AlpgenQcdbbJ2Np3_SUSYfilt_pt20.merge.AOD.e389_s462_r635_t53_tid069635/AOD.069635._00001.pool.root.1
file
22 Jul 2009 08:48:24| Command finished after 11.540000 s
22 Jul 2009 08:48:24| Creating link from
/mnt/lustre_0/storm_3/atlas/atlasmcdisk/mc08/AOD/mc08.107363.AlpgenQcdbbJ2Np3_SUSYfilt_pt20.merge.AOD.e389_s462_r635_t53_tid069635/AOD.069635._00001.pool.root.1
to /scratch/tmp/condorg_kSY14560/pilot3/Panda_Pilot_14940_1248252462/PandaJob_1016616476_1248252462/AOD.069635._00001.pool.root.1
Ah, they all fail on cn463, so it looks like lustre got unmounted here
(http://panda.cern.ch:25980/server/pandamon/query?overview=wnlist&type=analysis&hours=24&site=ANALY_QMUL&reload=yes).
Write a nagios test :-)
ANALY_RALPP: Good. Real user is running jobs which cause an athena
crash, which is most of the errors. Will change to dccp staging today.
ANALY_RHUL: Fair. Quite a lot of stage errors: "Error details: pilot:
Get error: rfcp failed: 512,
/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasmcdisk/mc08/AOD/mc08.106048.PythiaB_cce5X.merge.AOD.e401_s462_r635_t53_tid065339/AOD.065339._00013.pool.root.1__DQ2-1242825789
: No route to host" and so on.
ANALY_SHEF: Fair. Stage in/out problems and some LFC issues
(SE/network overloaded now?).
Definitions:
Excellent: 95%+
Good: 90-95%
Fair: 80-90%
Bad: < 80%
--
Dr Graeme Stewart http://www.physics.gla.ac.uk/~graeme/
Department of Physics and Astronomy, University of Glasgow, Scotland
DEATH TO MEETINGS!
|