Hi Matt,
I understand your circumspection. Said that you have one of the few
batch systems that can allegedly handle it (note until you test it it'll
remain allegedly). These are the sites that can handle this if they have
cgroups enabled:
1) ARC-CE/Htcondor sites: RAL, RALPP, Oxford, Liverpool, Glasgow, Brunel
--> tested on some queues would like to extend use only rss and walltime
2) ARC-CE/Slurm sites: Durham --> not tested but NDGF sites work --> use
only rss and walltime
3) CREAM/UGE: Sussex --> not tested but enabled on local queues --> can
use rss,vmem,walltime,cputime(?)
At CREAM/Torque and CREAM/SoGE sites rss, vmem, cputime, walltime can
all be passed and used for scheduling purposes or limiting but my
reccomendation is still to avoid vmem. Yesterday I was ambivalent about
it but today I'm a bit less even with 5GB (2GBmaxrss+3GBmaxswap)
manchester still killed 7 jobs yesterday that shouldn't have been
killed. Fair enough it was 7/2748 total prod jobs but still it could
have been 0/2748. Of course you may have the runaway job that brings
down a machine with 24/32/48/... jobs in it and this is why it is
important to be able to limit the jobs but these batch systems don't do
the right thing as they cut on the wrong value.
cheers
alessandra
On 26/03/2015 09:54, Matt Rásó-Barnett wrote:
> On Wed, Mar 25, 2015 at 01:23:42PM +0000, Alessandra Forti wrote:
>> * Would any site which still have cream/torque want to try this script?
>> * Would any SGE site want to adapt it to their site (I asked Matt but
>> he has UGE with the possibility to enable cgroups so Sussex is in a
>> situation more akin to ARC/Htcondor sites)?
>
> I am interested to try this with UGE, however I've been hesitant to
> change anything this month as I just want to have March be uneventful
> and Sussex not to be on the low-reliability list again :) And I
> remember having a lot of problems when I first started because of jobs
> being killed due to the h_vmem limit in the SGE script until Chris W
> kindly informed me about the changes QMUL made to this script.
>
> So I'm going to be a coward and look at this in a couple of weeks once
> the March figures are in, but will definitely do it before GridPP34.
>
> Sorry,
> Matt
--
Respect is a rational process. \\//
|