some update on job memory limits from site experience and chat
on the ATLAS cloud mailing list (GraemeS replies were
particularly informative), most of this will be already familiar:
* Some ILC 'Mokka' jobs here grew to many GB eventually
overwhelming some of our WNs.
* So I looked at mandatory memory limits. I am using Torque.
* The 'resources_default' values for Torque queues are
for jobs that don't explicitly specify a reservation.
For those jobs these seem to be enforced.
* The 'resources_max' limit values explicitly specified by
jobs, and the latter also also enforced.
* Torque makes a difference between real and virtual memory
reservations, and per-job and per-process ones. In my case
the latter two are the same.
* Torque enforces these by setting 'ulimit -m' and '-v'.
ATLAS job wrappers have their own limits (see Feb chat on
memory limits on this mailing list).
* Conventionally WNs are sized to have 2GiB physical memory
per core.
Some subtleties:
* ATLAS require ~2GB of physical memory and ~4GB of
virtual memory per job:
>
https://twiki.cern.ch/twiki/bin/view/Atlas/SL5Migration#Virtual_Memory
The VM has to be larger because of roundups in
memory mappings.
* MAUI takes into account reservations when scheduling, so
if 8 jobs reserve each 2GiB of physical memory and 4GiB of
VM then the WN must have at least 16GiB of memory and
32GiB of swap to run them all.
* In particular a WN running 8x jobs with a 2GiB physical
memory reservation *must* have more than 16GiB because
several hundred MiBs of memory are used by the kernel and
tables and page cache etc. Similarly for swap.
In practice I have rounded down the ATLAS requirement to 2000MiB
of physical memory and 4000MiB of virtual memory (instead of
2048 and 4096) to allow running 8 jobs on a 16GiB WN or 12 on a
24GiB one. Example from one of the latter, extracted from
'diagnose -n | less -S':
> Name State Procs Memory Disk Swap
> n99.dur.scotgrid.ac. Busy 0:12 98:24098 384246:440815 33392:79392
Note the 98MiB free of memory out of 24098: that's because the
reserved memory is 12x2000MiB. Note that 24GiB is 24576MiB, so
478MiB are unavailable to jobs. Most jobs don't actually use
2GiB, this is the current memory usage on the same WN:
> total used free shared buffers cached
> Mem: 24676608 9421048 15255560 0 574576 4519376
> -/+ buffers/cache: 4327096 20349512
> Swap: 56621748 5584 56616164
And the average memory used per process is around 0.5GiB (they
are mostly ATLAS production jobs).
One thing that I had to change is that the WNs were originally
configured with around 20-28GiB of swap space, but if ATLAS
require ~4GB of virtual memory space reserved per jobs that is
not sufficient and jobs won't fill the available cores.
Since swap files in 2.6 are fairly efficient, especially if
fairly contiguous, I have added a 50GB swap file to all WNs.
My current Torque settings for a typical queue are:
> set queue q2d resources_default.mem =2000mb
> set queue q2d resources_default.pmem =2000mb
> set queue q2d resources_default.pvmem =4000mb
> set queue q2d resources_max.mem =16000mb
> set queue q2d resources_max.pmem =16000mb
> set queue q2d resources_max.pvmem =16000mb
I am somewhat unsure as to whether I should set 'vmem' too, or
set 'mem' at all, or viceversa.
Any the "Shell Process Limits" section here seems to match:
> http://svr017.gla.scotgrid.ac.uk/factory/auto/logs/2011-06-21/UKI-SCOTGRID-DURHAM-ce01/261112.0.out
|