JISCMail - TB-SUPPORT Archives

> The way the LHCb system works is to start up user-mode
> daemons on the WNs which then suck in jobs to run, so they
> may stay there a long time if you let them ... there are 15
> jobs idle at Oxford, they could well be all the monitoring jobs!

Gotcha.

> You seem to have only one queue with no limits on cpu time or
> wallclock time. It's probably more useful to have several
> queues with different limits - even an "infinite" queue
> should probably have some sort of limit. If you put maximum
> job limits on the long queues you limit the hogging from long
> jobs (although with 4 cpus you don't have a lot to play with).

PBS is setup the way the EDG installation guide told me to set it up.  I am
now trying to get a better configuration.  What I would like is:

1) To leave workq as a long queue but with an 8 hour wallclock limit.  Is
this sufficient given that the EDG sites are still in early stages
"development"?

2) Not allow more than two jobs from the same user (since we only have four
worker nodes at the moment), to avoid resource hogging.

3) Have a "test" queue with a CPU time limit of 1 minute, and wallclock
limit of 10 minutes.

4) Allow at any given time up to 10 test queue jobs to execute, regardless
of the number of active "workq" jobs.

5) Allow up to 4 "workq" jobs active at any one time, where 4 would be
updated to be the number of processors available.

The idea being that "test" queue jobs would have a very high chance of
executing right away, and could be used for monitoring purposes.  These
would be allowed to execute concurrently (on the same worker node) as the
long "workq" jobs.  "workq" jobs would be limited to one active job per
available worker node.

Now, is there any kind soul who will give an outline of the qmgr PBS
commands to set that environment up?

I am confused as to why the worker nodes show 1.0 load continuously for
several days, PBS reports that the jobs were submitted on Monday or Tuesday,
and yet qstat says they have only consumed 30 minutes (or even 30 hours...)
of CPU time.

Finally, what is the best way to get rid of these LHCb production jobs which
are probably going to stay there forever?

Cheers,

Ian.