Hi Maarten
Thanks, that was enough to help me find it - turned out to be a WN with
a full /home partition so it couldn't stage in the job files.
I wish torque would error better on this I ended up having to grep
/var/log/messages on 100 odd worker nodes to find the one with the
problem.
Still it shouldn't have happened - our monitoring wasn't looking for it
as a problem (fixed now) but we also have quotas set up on the WN home
disks to stop users doing this and I still need to work out why that
didn't prevent it.
Thanks,
Chris.
> -----Original Message-----
> From: LHC Computer Grid - Rollout
> [mailto:[log in to unmask]] On Behalf Of Maarten Litmaath
> Sent: 30 May 2008 12:10
> To: [log in to unmask]
> Subject: Re: [LCG-ROLLOUT] Job Manager shutting down errors
>
> Hi Chris,
>
> > I'm getting intermittant job aborts with this error:
> >
> > Got a job held event, reason: Globus error 94: the
> jobmanager does not
> > accept any new requests (shutting down)
> >
> > The GOC Wiki suggests that the most likely cause of this is
> a problem in
> > the batch system, either the CE cannot submit the job or
> fails to track
> > it properly. Since it is only intermittant I am guess it is not a
> > gerneral configuration problem.
> >
> > Looking at the batch system accounting logs I can see the jobs being
> > submitted fine but then something on the CE is deleteing them before
> > they get chance to run:
>
> The lcgpbs job manager will delete jobs reported with 'W' status.
>
> Torque will put a job into that state when the stagein failed,
> e.g. because there were too many concurrent ssh sessions on the CE.
>
|