Print

Print


Hi,

I'm getting intermittant job aborts with this error:

Got a job held event, reason: Globus error 94: the jobmanager does not
accept any new requests (shutting down)

The GOC Wiki suggests that the most likely cause of this is a problem in
the batch system, either the CE cannot submit the job or fails to track
it properly. Since it is only intermittant I am guess it is not a
gerneral configuration problem.

Looking at the batch system accounting logs I can see the jobs being
submitted fine but then something on the CE is deleteing them before
they get chance to run:

05/30/2008 11:37:51;Q;1578437.heplnx201.pp.rl.ac.uk;queue=dteam
05/30/2008 11:37:54;S;1578437.heplnx201.pp.rl.ac.uk;user=dteam003
group=dteam jobname=STDIN queue=dteam ctime=1212143871 qtime=1212143871
etime=1212143871 start=1212143874 exec_host=heplnc109.pp.rl.ac.uk/0
Resource_List.cput=24:00:00 Resource_List.mem=985mb
Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1
Resource_List.walltime=24:00:00 
05/30/2008 11:38:00;Q;1578438.heplnx201.pp.rl.ac.uk;queue=dteam
05/30/2008 11:38:01;Q;1578439.heplnx201.pp.rl.ac.uk;queue=dteam
05/30/2008 11:38:51;Q;1578440.heplnx201.pp.rl.ac.uk;queue=dteam
05/30/2008
11:38:51;D;1578436.heplnx201.pp.rl.ac.uk;[log in to unmask]
.rl.ac.uk
05/30/2008 11:38:52;Q;1578441.heplnx201.pp.rl.ac.uk;queue=dteam
05/30/2008
11:39:00;D;1578438.heplnx201.pp.rl.ac.uk;[log in to unmask]
.rl.ac.uk
05/30/2008 11:39:01;Q;1578442.heplnx201.pp.rl.ac.uk;queue=dteam
05/30/2008 11:39:01;Q;1578443.heplnx201.pp.rl.ac.uk;queue=dteam
05/30/2008
11:39:01;D;1578439.heplnx201.pp.rl.ac.uk;[log in to unmask]
.rl.ac.uk
05/30/2008 11:39:02;Q;1578444.heplnx201.pp.rl.ac.uk;queue=dteam
05/30/2008 11:39:03;E;1578437.heplnx201.pp.rl.ac.uk;user=dteam003
group=dteam jobname=STDIN queue=dteam ctime=1212143871 qtime=1212143871
etime=1212143871 start=1212143874 exec_host=heplnc109.pp.rl.ac.uk/0
Resource_List.cput=24:00:00 Resource_List.mem=985mb
Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1
Resource_List.walltime=24:00:00 session=15554 end=1212143943
Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=17324kb
resources_used.vmem=695080kb resources_used.walltime=00:01:13

So my question is what is deleting the jobs and how can I find out what
is cauing it to do that?

I can manually run, showq, qstat, qstat -f etc. multiple times manually
without any failure or long delays on returning output and the load on
the CEs and Torque server is low.

Any help appreciated.

Thanks,
Chris.