Hi
Just to clarify that
WARNING: [IDLE->Cancelled/Purged [timeout/dropped]]
360 min timeout in 'IDLE' exceeded. Cancelling the job.
is not necessarily a problem. It means that job stayed in queue for more than 6 hours and then cancelled by nagios instance. Nagios submits t2k jobs as a normal user so it does not have any priority over other t2k jobs.
Cheers
Kashif
________________________________________
From: Testbed Support for GridPP member institutes [[log in to unmask]] on behalf of Christopher J. Walker [[log in to unmask]]
Sent: Monday, February 11, 2013 3:19 PM
To: [log in to unmask]
Subject: T2K.org monitoring
T2K have recently reported job failures, but neither I, nor they, have
had the time to chase this.
However, it is perhaps a good opportunity to make use of the monitoring
provided and make sure the problem is not at our end. Can sites please
have a quick look at:
https://t2wlcgnagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?servicegroup=VO_t2k.org&style=detail
I see problems with Manchester, Imperial, RalPP, Sheffield and Oxford:
ce02.tier2.hep.manchester.ac.uk
WARNING: [IDLE->Cancelled/Purged [timeout/dropped]]
gfe02.grid.hep.ph.ic.ac.uk
CRITICAL: METRIC FAILED [org.sam.SRM-Put]: CRITICAL: File was NOT
copied to SRM.
heplnx204.pp.rl.ac.uk
CRITICAL: METRIC FAILED [org.sam.SRM-Put]: CRITICAL: File was NOT
copied to SRM. [ErrDB:[('lcg_util', 'server', 'CRITICAL')]]
lcgce2.shef.ac.uk
CRITICAL: ABORTED
lcgce3.shef.ac.uk
CRITICAL: ABORTED
t2ce02.physics.ox.ac.uk
WARNING: [IDLE->Cancelled/Purged [timeout/dropped]]
t2wlcgnagios.physics.ox.ac.uk
HealthyNodes CRITICAL - No healthy hosts found.
Thanks,
Chris
|