Hello All,
We're having some trouble and wondering if anyone else has an idea what
is going on.
From time to time we get a job that Maui apparently cannot schedule.
These jobs are all from ATLAS so far, but they are by far the largest
user of UCL-HEP so it may not be a VO-specific problem.
There are repeated entries in /var/spool/maui/log/maui.log as in the
attached extract. The reason for not starting the job is "Cannot execute
at specified host because of checkpoint or stagein files" but I don't
know what it means!
Sometimes it appears that no other jobs are scheduled until the
offending job is deleted, although that doesn't appear to be happening
at the moment.
Any ideas?
Ben
--
Dr Ben Waugh Tel. +44 (0)20 7679 7223
Dept of Physics and Astronomy Internal: 37223
University College London
London WC1E 6BT
05/02 16:39:59 INFO: 100 feasible tasks found for job 897092:0 in partition DEFAULT (1 Needed)
05/02 16:39:59 INFO: tasks located for job 897092: 1 of 1 required (65 feasible)
05/02 16:39:59 MJobStart(897092)
05/02 16:39:59 MJobDistributeTasks(897092,PC72.HEP.UCL.AC.UK,NodeList,TaskMap)
05/02 16:39:59 MAMAllocJReserve(897092,RIndex,ErrMsg)
05/02 16:39:59 MRMJobStart(897092,Msg,SC)
05/02 16:39:59 MPBSJobStart(897092,PC72.HEP.UCL.AC.UK,Msg,SC)
05/02 16:39:59 MPBSJobModify(897092,Resource_List,Resource,farm9)
05/02 16:39:59 ERROR: job '897092' cannot be started: (rc: 15057 errmsg: 'Cannot execute at specified host because of checkpoint or stagein files' hostlist: 'farm9')
05/02 16:39:59 MPBSJobModify(897092,Resource_List,Resource,1)
05/02 16:39:59 ALERT: cannot start job 897092 (RM 'PC72.HEP.UCL.AC.UK' failed in function 'jobstart')
05/02 16:39:59 WARNING: cannot start job '897092' through resource manager
05/02 16:39:59 ALERT: job '897092' deferred after 305 failed start attempts (API failure on last attempt)
05/02 16:39:59 MJobSetHold(897092,16,00:00:00,RMFailure,cannot start job - RM failure, rc: 15057, msg: 'Cannot execute at specified host because of checkpoint or stagein files')
05/02 16:39:59 INFO: defer disabled
05/02 16:39:59 ERROR: cannot start job '897092' in partition DEFAULT
05/02 16:39:59 MJobPReserve(897092,DEFAULT,ResCount,ResCountRej)
05/02 16:39:59 MJobReserve(897092,Priority)
05/02 16:39:59 INFO: 100 feasible tasks found for job 897092:0 in partition DEFAULT (1 Needed)
05/02 16:39:59 INFO: 100 feasible tasks found for job 897092:0 in partition DEFAULT (1 Needed)
05/02 16:39:59 INFO: located resources for 1 tasks (65) in best partition DEFAULT for job 897092 at time 00:00:01
05/02 16:39:59 INFO: tasks located for job 897092: 1 of 1 required (65 feasible)
05/02 16:39:59 MJobDistributeTasks(897092,PC72.HEP.UCL.AC.UK,NodeList,TaskMap)
05/02 16:39:59 MResJCreate(897092,MNodeList,00:00:01,Priority,Res)
05/02 16:39:59 INFO: job '897092' reserved 1 tasks (partition DEFAULT) to start in 00:00:01 on Wed May 2 16:40:00
(WC: 345600)
|