JISCMail - TB-SUPPORT Archives

Hi,

so far i've used separate more private threads but this is becoming 
confusing so I'll start a single one on TB-SUPPORT. Apologies if a speak 
a bit of atlas-ese.

So far I've enabled the new rss(+swap) scheme [1] on a multicore queue 
at RAL, Glasgow and Brunel where the combination ARC/HTcondor made it 
straightforward because it is basically a renaming of an AGIS parameter 
maxmemory -> maxrss even though I have the impression each site has a 
slightly different way to tackle jobs that exceed the memory and as 
discussed at an ops meeting we should document that in the wiki.

I worked on the CREAM/torque side too and Manchester has rss+swap 
enabled on all the queues and the parameters are passed to one cluster. 
I attached the script  for torque. Some observations

* I left the _Min entries in case other users use them (biomed every now 
end then sends jobs with parameters set) but atlas doesn't.
* The parameters can be set in the AGIS PandaQueues: 
maxrss,maxswap,maxtime if anyone of the 3 is not defined the old scheme 
will be used.
* We have used Glue1 at the end, no point in changing that
* cputime and walltime in Glue1 are in minutes and need to be converted 
back to seconds that's why there is a factor 60 there
* cputime=ncores*walltime if you want to use it, if you want to keep a 
48h (or whatever you have) you need to set maxtime=6h on the multicore 
queue.
* GlueHostMainMemoryRAMSize is assigned to mem but torque/maui cannot 
kill anymore on mem as RLIMIT_RSS isn't used by the kernel anymore. If 
you want to use this to limit the jobs you need to use vmem.
* GlueHostMainMemoryVirtualSize=maxrss+maxswap is assigned to vmem but 
vmem for a process is not rss+swap anymore without cgroups and it's the 
address space which is slightly larger than the nominal 4GB atlas asks. 
overcommitting works only for the mem parameter not for vmem. Said that 
is you have a large unused swap on you nodes you can get away with 
increasing maxswap. So far only single core jobs needed this treatment. 
So on my queues I now have

* single core short analysis: maxrss=2GB, maxswap=2GB, maxtime=4h
* single core analysis and production: maxrss=2GB, maxswap=3GB, maxtime=48h
* multicore production: maxrss=16GB, maxswap=16GB, maxtime=6h

I don't know how many sites using torque/maui would benefit from this 
work in the UK since we are trying to eliminate it but if you want the 
jobs to pass the parameters now you can do that. Also this was done with 
an eye to the SoGE sites which may not move to another batch system but 
may have similar memory problems.

Possible other steps

* Would RAL like to go ahead with other queues?
* Would any other ARC-CE/Htcondor site try?
* Would any site which still have cream/torque want to try this script?
* Would any SGE site want to adapt it to their site (I asked Matt but he 
has UGE with the possibility to enable cgroups so Sussex is in a 
situation more akin to ARC/Htcondor sites)?

cheers
alessandra

[1] 
https://drive.google.com/file/d/0B_tp6usAhDinWDFzU1F1dXk0b0U/view?usp=sharing

-- 
Respect is a rational process. \\//