We would welcome any suggestions regarding how to investigate further
next time this happens, if anyone can offer any. How can we find out
what files are missing if this is indeed the case?
Cheers,
Ben
Graeme Stewart wrote:
> We see this sometimes here too, e.g.,
>
> http://scotgrid.blogspot.com/2007/04/intervention-at-edinburgh.html
>
> As this job was causing all ops jobs to stall I had to qdel it, but the
> next time it happens I will try and investigate more closely.
>
> I suspect the gatekeeper process had somehow managed to scramble/remove
> some of the job's input files (possibly when cancelling the job) but
> hadn't managed to qdel it from torque.
>
> Torque does seem to get upset by this - it doesn't handle it very well
> at all.
>
> Cheers
>
> Graeme
>
> On 2 May 2007, at 16:42, Ben Waugh wrote:
>
>> Hello All,
>>
>> We're having some trouble and wondering if anyone else has an idea
>> what is going on.
>>
>> From time to time we get a job that Maui apparently cannot schedule.
>> These jobs are all from ATLAS so far, but they are by far the largest
>> user of UCL-HEP so it may not be a VO-specific problem.
>>
>> There are repeated entries in /var/spool/maui/log/maui.log as in the
>> attached extract. The reason for not starting the job is "Cannot
>> execute at specified host because of checkpoint or stagein files" but
>> I don't know what it means!
>>
>> Sometimes it appears that no other jobs are scheduled until the
>> offending job is deleted, although that doesn't appear to be happening
>> at the moment.
>>
>> Any ideas?
>>
>> Ben
>>
>> --Dr Ben Waugh Tel. +44 (0)20 7679 7223
>> Dept of Physics and Astronomy Internal: 37223
>> University College London
>> London WC1E 6BT
>> 05/02 16:39:59 INFO: 100 feasible tasks found for job 897092:0 in
>> partition DEFAULT (1 Needed)
>> 05/02 16:39:59 INFO: tasks located for job 897092: 1 of 1
>> required (65 feasible)
>> 05/02 16:39:59 MJobStart(897092)
>> 05/02 16:39:59
>> MJobDistributeTasks(897092,PC72.HEP.UCL.AC.UK,NodeList,TaskMap)
>> 05/02 16:39:59 MAMAllocJReserve(897092,RIndex,ErrMsg)
>> 05/02 16:39:59 MRMJobStart(897092,Msg,SC)
>> 05/02 16:39:59 MPBSJobStart(897092,PC72.HEP.UCL.AC.UK,Msg,SC)
>> 05/02 16:39:59 MPBSJobModify(897092,Resource_List,Resource,farm9)
>> 05/02 16:39:59 ERROR: job '897092' cannot be started: (rc: 15057
>> errmsg: 'Cannot execute at specified host because of checkpoint or
>> stagein files' hostlist: 'farm9')
>> 05/02 16:39:59 MPBSJobModify(897092,Resource_List,Resource,1)
>> 05/02 16:39:59 ALERT: cannot start job 897092 (RM
>> 'PC72.HEP.UCL.AC.UK' failed in function 'jobstart')
>> 05/02 16:39:59 WARNING: cannot start job '897092' through resource
>> manager
>> 05/02 16:39:59 ALERT: job '897092' deferred after 305 failed start
>> attempts (API failure on last attempt)
>> 05/02 16:39:59 MJobSetHold(897092,16,00:00:00,RMFailure,cannot start
>> job - RM failure, rc: 15057, msg: 'Cannot execute at specified host
>> because of checkpoint or stagein files')
>> 05/02 16:39:59 INFO: defer disabled
>> 05/02 16:39:59 ERROR: cannot start job '897092' in partition DEFAULT
>> 05/02 16:39:59 MJobPReserve(897092,DEFAULT,ResCount,ResCountRej)
>> 05/02 16:39:59 MJobReserve(897092,Priority)
>> 05/02 16:39:59 INFO: 100 feasible tasks found for job 897092:0 in
>> partition DEFAULT (1 Needed)
>> 05/02 16:39:59 INFO: 100 feasible tasks found for job 897092:0 in
>> partition DEFAULT (1 Needed)
>> 05/02 16:39:59 INFO: located resources for 1 tasks (65) in best
>> partition DEFAULT for job 897092 at time 00:00:01
>> 05/02 16:39:59 INFO: tasks located for job 897092: 1 of 1
>> required (65 feasible)
>> 05/02 16:39:59
>> MJobDistributeTasks(897092,PC72.HEP.UCL.AC.UK,NodeList,TaskMap)
>> 05/02 16:39:59 MResJCreate(897092,MNodeList,00:00:01,Priority,Res)
>> 05/02 16:39:59 INFO: job '897092' reserved 1 tasks (partition
>> DEFAULT) to start in 00:00:01 on Wed May 2 16:40:00
>> (WC: 345600)
>
> --
> Dr Graeme Stewart - http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
> ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
--
Dr Ben Waugh Tel. +44 (0)20 7679 7223
Dept of Physics and Astronomy Internal: 37223
University College London
London WC1E 6BT
|