Hi, so far i've used separate more private threads but this is becoming confusing so I'll start a single one on TB-SUPPORT. Apologies if a speak a bit of atlas-ese. So far I've enabled the new rss(+swap) scheme [1] on a multicore queue at RAL, Glasgow and Brunel where the combination ARC/HTcondor made it straightforward because it is basically a renaming of an AGIS parameter maxmemory -> maxrss even though I have the impression each site has a slightly different way to tackle jobs that exceed the memory and as discussed at an ops meeting we should document that in the wiki. I worked on the CREAM/torque side too and Manchester has rss+swap enabled on all the queues and the parameters are passed to one cluster. I attached the script for torque. Some observations * I left the _Min entries in case other users use them (biomed every now end then sends jobs with parameters set) but atlas doesn't. * The parameters can be set in the AGIS PandaQueues: maxrss,maxswap,maxtime if anyone of the 3 is not defined the old scheme will be used. * We have used Glue1 at the end, no point in changing that * cputime and walltime in Glue1 are in minutes and need to be converted back to seconds that's why there is a factor 60 there * cputime=ncores*walltime if you want to use it, if you want to keep a 48h (or whatever you have) you need to set maxtime=6h on the multicore queue. * GlueHostMainMemoryRAMSize is assigned to mem but torque/maui cannot kill anymore on mem as RLIMIT_RSS isn't used by the kernel anymore. If you want to use this to limit the jobs you need to use vmem. * GlueHostMainMemoryVirtualSize=maxrss+maxswap is assigned to vmem but vmem for a process is not rss+swap anymore without cgroups and it's the address space which is slightly larger than the nominal 4GB atlas asks. overcommitting works only for the mem parameter not for vmem. Said that is you have a large unused swap on you nodes you can get away with increasing maxswap. So far only single core jobs needed this treatment. So on my queues I now have * single core short analysis: maxrss=2GB, maxswap=2GB, maxtime=4h * single core analysis and production: maxrss=2GB, maxswap=3GB, maxtime=48h * multicore production: maxrss=16GB, maxswap=16GB, maxtime=6h I don't know how many sites using torque/maui would benefit from this work in the UK since we are trying to eliminate it but if you want the jobs to pass the parameters now you can do that. Also this was done with an eye to the SoGE sites which may not move to another batch system but may have similar memory problems. Possible other steps * Would RAL like to go ahead with other queues? * Would any other ARC-CE/Htcondor site try? * Would any site which still have cream/torque want to try this script? * Would any SGE site want to adapt it to their site (I asked Matt but he has UGE with the possibility to enable cgroups so Sussex is in a situation more akin to ARC/Htcondor sites)? cheers alessandra [1] https://drive.google.com/file/d/0B_tp6usAhDinWDFzU1F1dXk0b0U/view?usp=sharing -- Respect is a rational process. \\//