JISCMail - LCG-ROLLOUT Archives

Hi Chris,

> I'm getting intermittant job aborts with this error:
> 
> Got a job held event, reason: Globus error 94: the jobmanager does not
> accept any new requests (shutting down)
> 
> The GOC Wiki suggests that the most likely cause of this is a problem in
> the batch system, either the CE cannot submit the job or fails to track
> it properly. Since it is only intermittant I am guess it is not a
> gerneral configuration problem.
> 
> Looking at the batch system accounting logs I can see the jobs being
> submitted fine but then something on the CE is deleteing them before
> they get chance to run:

The lcgpbs job manager will delete jobs reported with 'W' status.

Torque will put a job into that state when the stagein failed,
e.g. because there were too many concurrent ssh sessions on the CE.