On Thu, 20 Mar 2003, Ian Stokes-Rees wrote:
> I am able to globus-job-run, ldap search, and qsub on the CE. Furthermore,
> the CE has been continuously running jobs for the last several weeks. Right
> now every CPU is running at 100% with jobs from
>
> C=fr,o=cnrs,ou=cppm,cn=vincent garonne,[log in to unmask]
The way the LHCb system works is to start up user-mode daemons on the WNs
which then suck in jobs to run, so they may stay there a long time if you
let them ... there are 15 jobs idle at Oxford, they could well be all
the monitoring jobs!
> From looking at it more carefully (qstat -f), it appears that these jobs
> were submitted on Monday and have been "hogging" the queue for 4 days. How
> do other people deal with this sort of problem?
You seem to have only one queue with no limits on cpu time or wallclock
time. It's probably more useful to have several queues with different
limits - even an "infinite" queue should probably have some sort of limit.
If you put maximum job limits on the long queues you limit the hogging
from long jobs (although with 4 cpus you don't have a lot to play with).
Also you should be sure to define wallclock time limits as well as cpu
limits. For one thing that will kill jobs which go into some kind of
blocking state. Also the EstimatedTraversalTime algorithm uses it;
at the moment your published ETT is about 8 hours which may well be
very wrong if those jobs are never going to let go.
Stephen
|