On Thu, 3 Aug 2006, Steve Thorn wrote:
> We have 914 jobs queued waiting to run on 6 worker nodes and more
> coming. 809 are from a single LHCb user submitted over the last four
> days.
We get this sort of thing a lot on the small cluster at Brunel too - often
when the queues are first opened and the reported ERT is still low.
> Is there as sensible way to handle this or diagnose why we are
> getting so many?
In PBS you can set "max_queuable" for the queue to a sensible number (N.B.
includes the running jobs) - this will then refuse further jobs. This
solves the problem for you ... but unfortunately the Grid middleware
doesn't understand it and gives the generic "Unspecified gridmanager
error" message, so the submitter may not realise what the problem is.
I've suggested before that bulk job submission should be done in blocks of
max_running jobs (which is published in the infosystem) with a cool-off
(e.g. 15 mins) in between to let the new ERT value get through the
information system.
Thanks
Henry
--
Dr. Henry Nebrensky [log in to unmask]
http://people.brunel.ac.uk/~eesrjjn
"The opossum is a very sophisticated animal.
It doesn't even get up until 5 or 6 p.m."
|