Hi I am pretty sure that here at NIKHEF, MAXPROC=160 really means MAXPROC=160 ... I will keep an eye on it. hmm, just checked: according to the following page http://www.clusterresources.com/products/maui/docs/6.2throttlingpolicies.shtml see "hard and soft limits" near the bottom, if only one number is specified, it's a hard limit. Indeed you can specify SOFT,HARD if you like. pretty cool, the first google hit for 'maxproc maui' is a NIKHEF page ;-) J "google is your friend" T Antun Balaz wrote: > Hi, > I thought that MAXPROC defines soft limit if you write it as MAXPROC=160, > i.e. if the rest of the farm is empty, jobs can be executed on more than 160 > CPUs, depending on the decisions made by maui. You can also define hard > limit in the form MAXPROC=160,230. Does this work in practice? > > Regards, Antun > > ----- > E-mail: [log in to unmask] > Web: http://scl.phy.bg.ac.yu/ > > Phone: +381 11 3160260, Ext. 152 > Fax: +381 11 3162190 > > Scientific Computing Laboratory > Institute of Physics, Belgrade > Serbia and Montenegro > ----- > > > ---------- Original Message ----------- > From: Jeff Templon <[log in to unmask]> > To: [log in to unmask] > Sent: Fri, 9 Sep 2005 23:54:42 +0200 > Subject: Re: [LCG-ROLLOUT] Atlas with atlas, dteam with dteam, Kodak with > Kodak, etc. > > >>Hi, >> >>MAXPROC indeed means no more than 160. We tend to adjust these >>things depending on what's happening on the farm, like if we notice >>that things are essentially empty except for one group. We wind up >>adjusting it roughly once every two weeks. Not too bad. >> >> JT >> >>Dan Schrager wrote: >> >>>Does MAXPROC mean that you won't be able to run more than 160 atlas jobs >>>? Even if the rest of the farm is empty ? >>>And does it mean that if 230 simulation jobs would come in the same time >>>(no free slot left) the next SFT dteam job will run 48 hours later ? >>> >>>I would still keep one dteam reserved node. >>> >>>If having (almost) all dteam jobs run on the same node is not good, then >>>add a minus sign after the reserved class name: >>> >>>SRCFG[dteam] PERIOD=INFINITY HOSTLIST=eio99.pp.weizmann.ac.il >>>CLASSLIST=dteam- >>> >>>This way all nodes will be selected for dteam jobs and the reserved node >>>will be used just as a last resort -- but (almost) always there. >>> >>> >>> >>> >>> >>>Jeff Templon wrote: >>> >>> >>>>Yo >>>> >>>>Thinking about this, i tentatively conclude that it's a bad idea to >>>>dedicate a single worker node to dteam jobs. The reason is that this >>>>WN may not be representative of your whole farm. We've seen often >>>>enough in the past that some worker nodes are fine while others have >>>>problems. >>>> >>>>Go for fair shares. >>>> >>>>A problem worker node will eat jobs, thus there is a reasonable chance >>>>that if it is open to dteam, it will eat a dteam job too ... which is >>>>what you want to happen. If you have a node dedicated to dteam jobs, >>>>its utilization will likely be lower than the rest of your farm, so >>>>things that die under stress will not die as quickly on this node ... >>>>you get the picture. >>>> >>>>Something else: smaller sites should be careful about making long >>>>queues. In the best case, the number of jobs you should expect to be >>>>ending in any period t will only be >>>> >>>> N * t / T >>>> >>>>where N is the number of jobs you have running, and T is how long >>>>these jobs run on average. This assumes these N jobs have all started >>>>at random times during the last period T (not before, since they would >>>>have by definition already finished, and not after, since then they >>>>would not have started yet ;-) >>>> >>>>10 CPUs, ten minutes waiting for a job to end, 24-hour jobs ... expect >>>>0.07 jobs to end in this period ... in other words you should expect >>>>on average a slot open up every two hours or so. In reality it will >>>>be worse since jobs tend to come in batches. >>>> >>>> J "Friday night Grid Philosophy" T >>>> >>>> >>>>Jeff Templon wrote: >>>> >>>> >>>>>yo, >>>>> >>>>>we use process caps. here is an abbreviated example: >>>>> >>>>>GROUPCFG[dteam] FSTARGET=2 PRIORITY=5000 MAXPROC=32 >>>>>GROUPCFG[alice] FSTARGET=15 PRIORITY=100 MAXPROC=100 ADEF=lhc >>>>>GROUPCFG[atlas] FSTARGET=50 PRIORITY=100 MAXPROC=160 ADEF=lhc >>>>>GROUPCFG[atlsgm] FSTARGET=50 PRIORITY=100 MAXPROC=160 ADEF=lhc >>>>>GROUPCFG[lhcb] FSTARGET=35 PRIORITY=100 MAXPROC=230 ADEF=lhc >>>>>GROUPCFG[lhcbsgm] FSTARGET=35 PRIORITY=100 MAXPROC=230 ADEF=lhc >>>>>GROUPCFG[cms] FSTARGET=1- PRIORITY=1 MAXPROC=10 ADEF=lhc >>>>> >>>>>GROUPCFG[esr] FSTARGET=5 PRIORITY=50 MAXPROC=32 > > ADEF=nlgrid > >>>>>GROUPCFG[ncf] FSTARGET=40 PRIORITY=100 MAXPROC=120 > > ADEF=nlgrid > >>>>>GROUPCFG[asci] FSTARGET=40 PRIORITY=100 MAXPROC=120 > > ADEF=nlgrid > >>>>>GROUPCFG[pvier] FSTARGET=5 PRIORITY=100 MAXPROC=12 > > ADEF=nlgrid > >>>>> >>>>>ACCOUNTCFG[lhc] FSTARGET=50 MAXPROC=230 >>>>>ACCOUNTCFG[nlgrid] FSTARGET=50 MAXPROC=110 >>>>> >>>>>Note that we give dteam a very high priority but a very low fair >>>>>share and a rather severe process cap. On the other hand, the LHC >>>>>groups all have a rather high fair share, and are limited to 230 >>>>>processes in total. Right now we have 246 CPUs in the farm, so it is >>>>>impossible for just LHC to take all our CPUs. Sometimes they are all >>>>>full, but this is during times that we have e.g. 180 LHC jobs >>>>>running, 50 from biomed, >>>>>and 16 from dzero. But in most cases we are not full, so dteam jobs >>>>>run immediately. >>>>> >>>>>Even when we are full it's not a problem. For a big site, being full >>>>>isn't so bad because with lots of jobs, you have a relatively large >>>>>number of jobs ending during any given time period. >>>>> >>>>> JT >>>>> >>>>>Mario David wrote: >>>>> >>>>> >>>>>>Hi Dan >>>>>>how do you set a WN only to dteam with pbs/maui? >>>>>> >>>>>>we are having problems because all nodes are full of atlas and cms > > jobs > >>>>>>and dteam sft doesn't enter. Despite fairshares in maui.conf >>>>>>in the past I had tried to set specific nodes to specific groups in >>>>>>the qmgr but was not successfull. >>>>>> >>>>>>cheers >>>>>> >>>>>>Mario >>>>>> >>>>>>Quoting Dan Schrager <[log in to unmask]>: >>>>>> >>>>>> >>>>>> >>>>>>>Dear Christine, >>>>>>> >>>>>>>I have deleted your simulation(?) job run as user dteam at my site >>>>>>>because it was blocking the unique WN reserved for short dteam (SFT >>>>>>>kind) jobs. >>>>>>>Use in the future an atlas certificate for such purposes. >>>>>>> >>>>>>>Regards, >>>>>>>Dan >>>>>>> >>>>>>> >>> > ------- End of Original Message -------