We see this sometimes here too, e.g.,
http://scotgrid.blogspot.com/2007/04/intervention-at-edinburgh.html
As this job was causing all ops jobs to stall I had to qdel it, but
the next time it happens I will try and investigate more closely.
I suspect the gatekeeper process had somehow managed to scramble/
remove some of the job's input files (possibly when cancelling the
job) but hadn't managed to qdel it from torque.
Torque does seem to get upset by this - it doesn't handle it very
well at all.
Cheers
Graeme
On 2 May 2007, at 16:42, Ben Waugh wrote:
> Hello All,
>
> We're having some trouble and wondering if anyone else has an idea
> what is going on.
>
> From time to time we get a job that Maui apparently cannot
> schedule. These jobs are all from ATLAS so far, but they are by
> far the largest user of UCL-HEP so it may not be a VO-specific
> problem.
>
> There are repeated entries in /var/spool/maui/log/maui.log as in
> the attached extract. The reason for not starting the job is
> "Cannot execute at specified host because of checkpoint or stagein
> files" but I don't know what it means!
>
> Sometimes it appears that no other jobs are scheduled until the
> offending job is deleted, although that doesn't appear to be
> happening at the moment.
>
> Any ideas?
>
> Ben
>
> --
> Dr Ben Waugh Tel. +44 (0)20 7679
> 7223
> Dept of Physics and Astronomy Internal: 37223
> University College London
> London WC1E 6BT
> 05/02 16:39:59 INFO: 100 feasible tasks found for job 897092:0
> in partition DEFAULT (1 Needed)
> 05/02 16:39:59 INFO: tasks located for job 897092: 1 of 1
> required (65 feasible)
> 05/02 16:39:59 MJobStart(897092)
> 05/02 16:39:59 MJobDistributeTasks
> (897092,PC72.HEP.UCL.AC.UK,NodeList,TaskMap)
> 05/02 16:39:59 MAMAllocJReserve(897092,RIndex,ErrMsg)
> 05/02 16:39:59 MRMJobStart(897092,Msg,SC)
> 05/02 16:39:59 MPBSJobStart(897092,PC72.HEP.UCL.AC.UK,Msg,SC)
> 05/02 16:39:59 MPBSJobModify(897092,Resource_List,Resource,farm9)
> 05/02 16:39:59 ERROR: job '897092' cannot be started: (rc:
> 15057 errmsg: 'Cannot execute at specified host because of
> checkpoint or stagein files' hostlist: 'farm9')
> 05/02 16:39:59 MPBSJobModify(897092,Resource_List,Resource,1)
> 05/02 16:39:59 ALERT: cannot start job 897092 (RM
> 'PC72.HEP.UCL.AC.UK' failed in function 'jobstart')
> 05/02 16:39:59 WARNING: cannot start job '897092' through resource
> manager
> 05/02 16:39:59 ALERT: job '897092' deferred after 305 failed
> start attempts (API failure on last attempt)
> 05/02 16:39:59 MJobSetHold(897092,16,00:00:00,RMFailure,cannot
> start job - RM failure, rc: 15057, msg: 'Cannot execute at
> specified host because of checkpoint or stagein files')
> 05/02 16:39:59 INFO: defer disabled
> 05/02 16:39:59 ERROR: cannot start job '897092' in partition
> DEFAULT
> 05/02 16:39:59 MJobPReserve(897092,DEFAULT,ResCount,ResCountRej)
> 05/02 16:39:59 MJobReserve(897092,Priority)
> 05/02 16:39:59 INFO: 100 feasible tasks found for job 897092:0
> in partition DEFAULT (1 Needed)
> 05/02 16:39:59 INFO: 100 feasible tasks found for job 897092:0
> in partition DEFAULT (1 Needed)
> 05/02 16:39:59 INFO: located resources for 1 tasks (65) in best
> partition DEFAULT for job 897092 at time 00:00:01
> 05/02 16:39:59 INFO: tasks located for job 897092: 1 of 1
> required (65 feasible)
> 05/02 16:39:59 MJobDistributeTasks
> (897092,PC72.HEP.UCL.AC.UK,NodeList,TaskMap)
> 05/02 16:39:59 MResJCreate(897092,MNodeList,00:00:01,Priority,Res)
> 05/02 16:39:59 INFO: job '897092' reserved 1 tasks (partition
> DEFAULT) to start in 00:00:01 on Wed May 2 16:40:00
> (WC: 345600)
--
Dr Graeme Stewart - http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
|