Print

Print


Hi Baptiste,

On 05/11/2015 23:18, Maarten Litmaath wrote:
>> At one of our UMD 3 sites using Cream CE with torque and maui, we have some
>> Scientific Linux 5 Worker Nodes that are overloaded by some jobs using too
>> much CPU. It can even sometimes take Worker Nodes down.
> 
> Can't they let Torque kill those jobs based on queue parameters?

In case that has already been done, you may also be suffering
from the fact that jobs actually have more concurrent threads
then the number of CPU cores requested. SO a job
stating it needs a single CPU actually forks and then occupies
many more. This is the one scenario that would actually
bring a worker node down.

The solution there is to pin each job to a specific set of cores,
corresponding to the CPUs actually allocated through torque.
The script (by RonaldS) at
  https://wiki.nikhef.nl/grid/images/a/af/Mom-taskset.txt
does that for you when installed as a prologue (by default in
/var/spool/pbs/mom_priv/prologue, runs as root).

Take care of the assumption stated at the top of the script: you
should not 'oversubscribe' the cores on a WN, or jobs will be
denied boarding ;-)

	Cheers,
	DavidG.

> These example commands for "qmgr" are taken from YAIM:
> 
>     set queue $QUEUE resources_max.cput = 48:00:00
>     set queue $QUEUE resources_max.walltime = 72:00:00
> 
>> We are already managing/limiting memory usage and are now aiming at
>> configuring some (default) resource limit policies for the CPU usage with
>> torque/maui (or anything else that could help us) and would like too know if
>> there are some known good practices, advices or experience feedback on this
>> subject.
> 
> If Torque's functionality is not sufficient for some reason,
> the WN could enforce hard limits e.g. via a script in /etc/profile.d
> (the correct values would depend on what queues are served by the WN).
> 
> Example for a hard limit on the CPU time:
> 
> $ ulimit -t 7
> $ time -p yes > /dev/null
> Killed
> real 6.72
> user 6.93
> sys 0.06
> 


-- 
David Groep

** Nikhef, Dutch National Institute for Sub-atomic Physics,PDP/Grid group **
** Room: H1.50 Phone: +31 20 5922179, PObox 41882, NL-1009DB Amsterdam NL **