Hi
I am pretty sure that here at NIKHEF, MAXPROC=160 really means
MAXPROC=160 ... I will keep an eye on it.
hmm, just checked: according to the following page
http://www.clusterresources.com/products/maui/docs/6.2throttlingpolicies.shtml
see "hard and soft limits" near the bottom, if only one number is
specified, it's a hard limit. Indeed you can specify SOFT,HARD if you like.
pretty cool, the first google hit for 'maxproc maui' is a NIKHEF page ;-)
J "google is your friend" T
Antun Balaz wrote:
> Hi,
> I thought that MAXPROC defines soft limit if you write it as MAXPROC=160,
> i.e. if the rest of the farm is empty, jobs can be executed on more than 160
> CPUs, depending on the decisions made by maui. You can also define hard
> limit in the form MAXPROC=160,230. Does this work in practice?
>
> Regards, Antun
>
> -----
> E-mail: [log in to unmask]
> Web: http://scl.phy.bg.ac.yu/
>
> Phone: +381 11 3160260, Ext. 152
> Fax: +381 11 3162190
>
> Scientific Computing Laboratory
> Institute of Physics, Belgrade
> Serbia and Montenegro
> -----
>
>
> ---------- Original Message -----------
> From: Jeff Templon <[log in to unmask]>
> To: [log in to unmask]
> Sent: Fri, 9 Sep 2005 23:54:42 +0200
> Subject: Re: [LCG-ROLLOUT] Atlas with atlas, dteam with dteam, Kodak with
> Kodak, etc.
>
>
>>Hi,
>>
>>MAXPROC indeed means no more than 160. We tend to adjust these
>>things depending on what's happening on the farm, like if we notice
>>that things are essentially empty except for one group. We wind up
>>adjusting it roughly once every two weeks. Not too bad.
>>
>> JT
>>
>>Dan Schrager wrote:
>>
>>>Does MAXPROC mean that you won't be able to run more than 160 atlas jobs
>>>? Even if the rest of the farm is empty ?
>>>And does it mean that if 230 simulation jobs would come in the same time
>>>(no free slot left) the next SFT dteam job will run 48 hours later ?
>>>
>>>I would still keep one dteam reserved node.
>>>
>>>If having (almost) all dteam jobs run on the same node is not good, then
>>>add a minus sign after the reserved class name:
>>>
>>>SRCFG[dteam] PERIOD=INFINITY HOSTLIST=eio99.pp.weizmann.ac.il
>>>CLASSLIST=dteam-
>>>
>>>This way all nodes will be selected for dteam jobs and the reserved node
>>>will be used just as a last resort -- but (almost) always there.
>>>
>>>
>>>
>>>
>>>
>>>Jeff Templon wrote:
>>>
>>>
>>>>Yo
>>>>
>>>>Thinking about this, i tentatively conclude that it's a bad idea to
>>>>dedicate a single worker node to dteam jobs. The reason is that this
>>>>WN may not be representative of your whole farm. We've seen often
>>>>enough in the past that some worker nodes are fine while others have
>>>>problems.
>>>>
>>>>Go for fair shares.
>>>>
>>>>A problem worker node will eat jobs, thus there is a reasonable chance
>>>>that if it is open to dteam, it will eat a dteam job too ... which is
>>>>what you want to happen. If you have a node dedicated to dteam jobs,
>>>>its utilization will likely be lower than the rest of your farm, so
>>>>things that die under stress will not die as quickly on this node ...
>>>>you get the picture.
>>>>
>>>>Something else: smaller sites should be careful about making long
>>>>queues. In the best case, the number of jobs you should expect to be
>>>>ending in any period t will only be
>>>>
>>>> N * t / T
>>>>
>>>>where N is the number of jobs you have running, and T is how long
>>>>these jobs run on average. This assumes these N jobs have all started
>>>>at random times during the last period T (not before, since they would
>>>>have by definition already finished, and not after, since then they
>>>>would not have started yet ;-)
>>>>
>>>>10 CPUs, ten minutes waiting for a job to end, 24-hour jobs ... expect
>>>>0.07 jobs to end in this period ... in other words you should expect
>>>>on average a slot open up every two hours or so. In reality it will
>>>>be worse since jobs tend to come in batches.
>>>>
>>>> J "Friday night Grid Philosophy" T
>>>>
>>>>
>>>>Jeff Templon wrote:
>>>>
>>>>
>>>>>yo,
>>>>>
>>>>>we use process caps. here is an abbreviated example:
>>>>>
>>>>>GROUPCFG[dteam] FSTARGET=2 PRIORITY=5000 MAXPROC=32
>>>>>GROUPCFG[alice] FSTARGET=15 PRIORITY=100 MAXPROC=100 ADEF=lhc
>>>>>GROUPCFG[atlas] FSTARGET=50 PRIORITY=100 MAXPROC=160 ADEF=lhc
>>>>>GROUPCFG[atlsgm] FSTARGET=50 PRIORITY=100 MAXPROC=160 ADEF=lhc
>>>>>GROUPCFG[lhcb] FSTARGET=35 PRIORITY=100 MAXPROC=230 ADEF=lhc
>>>>>GROUPCFG[lhcbsgm] FSTARGET=35 PRIORITY=100 MAXPROC=230 ADEF=lhc
>>>>>GROUPCFG[cms] FSTARGET=1- PRIORITY=1 MAXPROC=10 ADEF=lhc
>>>>>
>>>>>GROUPCFG[esr] FSTARGET=5 PRIORITY=50 MAXPROC=32
>
> ADEF=nlgrid
>
>>>>>GROUPCFG[ncf] FSTARGET=40 PRIORITY=100 MAXPROC=120
>
> ADEF=nlgrid
>
>>>>>GROUPCFG[asci] FSTARGET=40 PRIORITY=100 MAXPROC=120
>
> ADEF=nlgrid
>
>>>>>GROUPCFG[pvier] FSTARGET=5 PRIORITY=100 MAXPROC=12
>
> ADEF=nlgrid
>
>>>>>
>>>>>ACCOUNTCFG[lhc] FSTARGET=50 MAXPROC=230
>>>>>ACCOUNTCFG[nlgrid] FSTARGET=50 MAXPROC=110
>>>>>
>>>>>Note that we give dteam a very high priority but a very low fair
>>>>>share and a rather severe process cap. On the other hand, the LHC
>>>>>groups all have a rather high fair share, and are limited to 230
>>>>>processes in total. Right now we have 246 CPUs in the farm, so it is
>>>>>impossible for just LHC to take all our CPUs. Sometimes they are all
>>>>>full, but this is during times that we have e.g. 180 LHC jobs
>>>>>running, 50 from biomed,
>>>>>and 16 from dzero. But in most cases we are not full, so dteam jobs
>>>>>run immediately.
>>>>>
>>>>>Even when we are full it's not a problem. For a big site, being full
>>>>>isn't so bad because with lots of jobs, you have a relatively large
>>>>>number of jobs ending during any given time period.
>>>>>
>>>>> JT
>>>>>
>>>>>Mario David wrote:
>>>>>
>>>>>
>>>>>>Hi Dan
>>>>>>how do you set a WN only to dteam with pbs/maui?
>>>>>>
>>>>>>we are having problems because all nodes are full of atlas and cms
>
> jobs
>
>>>>>>and dteam sft doesn't enter. Despite fairshares in maui.conf
>>>>>>in the past I had tried to set specific nodes to specific groups in
>>>>>>the qmgr but was not successfull.
>>>>>>
>>>>>>cheers
>>>>>>
>>>>>>Mario
>>>>>>
>>>>>>Quoting Dan Schrager <[log in to unmask]>:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>Dear Christine,
>>>>>>>
>>>>>>>I have deleted your simulation(?) job run as user dteam at my site
>>>>>>>because it was blocking the unique WN reserved for short dteam (SFT
>>>>>>>kind) jobs.
>>>>>>>Use in the future an atlas certificate for such purposes.
>>>>>>>
>>>>>>>Regards,
>>>>>>>Dan
>>>>>>>
>>>>>>>
>>>
> ------- End of Original Message -------
|