Le 06/11/15 à 08:57, David Groep téléscripta :
>Hi Baptiste,
Hi David, Marteen,
>On 05/11/2015 23:18, Maarten Litmaath wrote:
>>> At one of our UMD 3 sites using Cream CE with torque and maui, we have some
>>> Scientific Linux 5 Worker Nodes that are overloaded by some jobs using too
>>> much CPU. It can even sometimes take Worker Nodes down.
>> Can't they let Torque kill those jobs based on queue parameters?
>In case that has already been done, you may also be suffering
>from the fact that jobs actually have more concurrent threads
>then the number of CPU cores requested. SO a job
>stating it needs a single CPU actually forks and then occupies
>many more. This is the one scenario that would actually
>bring a worker node down.
Yes we already configured those cput and walltime torque parameters (as
well as passing memory requirements to MAUI and having MAUI killing jobs
eating more memory than requested), it helped adresses some problems,
but I think that we are now more or less facing the problem you are
describing (most of our jobs are "pipelines scripts" calling multiple
binaries).
>The solution there is to pin each job to a specific set of cores,
>corresponding to the CPUs actually allocated through torque.
>The script (by RonaldS) at
> https://wiki.nikhef.nl/grid/images/a/af/Mom-taskset.txt
>does that for you when installed as a prologue (by default in
>/var/spool/pbs/mom_priv/prologue, runs as root).
>Take care of the assumption stated at the top of the script: you
>should not 'oversubscribe' the cores on a WN, or jobs will be
>denied boarding ;-)
David, thanks a lot for this link, we will look at this carefully.
Marteen thanks for the torque suggestion and limits example, we will
also check this.
Is someone else using those solutions? Or other alternatives?
> Cheers,
> DavidG.
Cheers,
Baptiste
--
Baptiste Grenier gnúbila France Software architect/developer
http://gnubila.fr Mobile: +33 786 341 687
|