Bonjour Baptiste,
> At one of our UMD 3 sites using Cream CE with torque and maui, we have some
> Scientific Linux 5 Worker Nodes that are overloaded by some jobs using too
> much CPU. It can even sometimes take Worker Nodes down.
Can't they let Torque kill those jobs based on queue parameters?
These example commands for "qmgr" are taken from YAIM:
set queue $QUEUE resources_max.cput = 48:00:00
set queue $QUEUE resources_max.walltime = 72:00:00
> We are already managing/limiting memory usage and are now aiming at
> configuring some (default) resource limit policies for the CPU usage with
> torque/maui (or anything else that could help us) and would like too know if
> there are some known good practices, advices or experience feedback on this
> subject.
If Torque's functionality is not sufficient for some reason,
the WN could enforce hard limits e.g. via a script in /etc/profile.d
(the correct values would depend on what queues are served by the WN).
Example for a hard limit on the CPU time:
$ ulimit -t 7
$ time -p yes > /dev/null
Killed
real 6.72
user 6.93
sys 0.06
|