On 15-03-12 17:12, Eygene Ryabinkin wrote:
> Wed, Mar 14, 2012 at 11:20:25AM +0100, Ronald Starink wrote:
>> - In Torque, we defined per queue a limit for the maximum physical memory
>> being used (pmem) by the job and a per-process limit on the virtual memory
>> (pvmem):
>>
>> set queue <QUEUE> resources_max.pmem = 3000mb
>> set queue <QUEUE> resources_max.pvmem = 3800mb
>>
>> The nice thing about the pvmem limitation is that it limits the virtual
>> memory available to each process: ulimit -v returns 3891200 (/ 1024 = 3800).
>> Consequently, individual processes cannot allocate more memory and get the
>> opportunity to deal with allocation failures. The batch system does not
>> actually kill the jobs.
> [...]
>> These changes do not protect against jobs that happily spawn tons of
>> memory-hungry child processes.
>
> 'set queue <QUE> resource_limits.vmem = <amount>' will protect against
> this; and violating jobs will be killed. That's what we use at our
> cluster.
We also did that in the beginning. It was very effective ;-) but caused
confusion with the users. Their jobs failed with reason "job cancelled by
admin" and that generated support tickets etc. Another point was that the
users did not know in advance how much memory their jobs were allowed to use.
The limitation on pvmem prevents that problem for the user and effectively
solved the problem for us. It is of course possible to add a (higher) vmem
limit in addition to the pvmem limit, so that the case where multiple
processes use much memory is properly handled. Then it is probably needed to
use a larger value for Maui's memory overcommit factor to prevent unused job
slots.
Cheers,
Ronald
|