JISCMail - LCG-ROLLOUT Archives

Email discussion lists for the UK Education and Research communities

Subscriber's Corner

Email Lists

LCG-ROLLOUT Archives

LCG-ROLLOUT@JISCMAIL.AC.UK

View:

Message:

[

First

Last

]

By Topic:

[

First

Last

]

By Author:

[

First

Last

]

Font:

Monospaced Font

		LISTSERV Archives
		LCG-ROLLOUT Home
		LCG-ROLLOUT 2005

Options

Subscribe or Unsubscribe

Get Password

Subject:

Re: Atlas with atlas, dteam with dteam, Kodak with Kodak, etc.

From:

Jeff Templon <[log in to unmask]>

Reply-To:

LHC Computer Grid - Rollout <[log in to unmask]>

Date:

Sat, 10 Sep 2005 21:14:10 +0200

Content-Type:

text/plain

Parts/Attachments:

text/plain (204 lines)

Hi

I am pretty sure that here at NIKHEF, MAXPROC=160 really means
MAXPROC=160 ... I will keep an eye on it.

hmm, just checked: according to the following page

http://www.clusterresources.com/products/maui/docs/6.2throttlingpolicies.shtml

see "hard and soft limits" near the bottom, if only one number is
specified, it's a hard limit. Indeed you can specify SOFT,HARD if you like.

pretty cool, the first google hit for 'maxproc maui' is a NIKHEF page ;-)

J "google is your friend" T

Antun Balaz wrote:
> Hi,
> I thought that MAXPROC defines soft limit if you write it as MAXPROC=160,
> i.e. if the rest of the farm is empty, jobs can be executed on more than 160
> CPUs, depending on the decisions made by maui. You can also define hard
> limit in the form MAXPROC=160,230. Does this work in practice?
>
> Regards, Antun
>
> -----
> E-mail: [log in to unmask]
> Web: http://scl.phy.bg.ac.yu/
>
> Phone: +381 11 3160260, Ext. 152
> Fax: +381 11 3162190
>
> Scientific Computing Laboratory
> Institute of Physics, Belgrade
> Serbia and Montenegro
> -----
>
>
> ---------- Original Message -----------
> From: Jeff Templon <[log in to unmask]>
> To: [log in to unmask]
> Sent: Fri, 9 Sep 2005 23:54:42 +0200
> Subject: Re: [LCG-ROLLOUT] Atlas with atlas, dteam with dteam, Kodak with
> Kodak, etc.
>
>
>>Hi,
>>
>>MAXPROC indeed means no more than 160. We tend to adjust these
>>things depending on what's happening on the farm, like if we notice
>>that things are essentially empty except for one group. We wind up
>>adjusting it roughly once every two weeks. Not too bad.
>>
>> JT
>>
>>Dan Schrager wrote:
>>
>>>Does MAXPROC mean that you won't be able to run more than 160 atlas jobs
>>>? Even if the rest of the farm is empty ?
>>>And does it mean that if 230 simulation jobs would come in the same time
>>>(no free slot left) the next SFT dteam job will run 48 hours later ?
>>>
>>>I would still keep one dteam reserved node.
>>>
>>>If having (almost) all dteam jobs run on the same node is not good, then
>>>add a minus sign after the reserved class name:
>>>
>>>SRCFG[dteam] PERIOD=INFINITY HOSTLIST=eio99.pp.weizmann.ac.il
>>>CLASSLIST=dteam-
>>>
>>>This way all nodes will be selected for dteam jobs and the reserved node
>>>will be used just as a last resort -- but (almost) always there.
>>>
>>>
>>>
>>>
>>>
>>>Jeff Templon wrote:
>>>
>>>
>>>>Yo
>>>>
>>>>Thinking about this, i tentatively conclude that it's a bad idea to
>>>>dedicate a single worker node to dteam jobs. The reason is that this
>>>>WN may not be representative of your whole farm. We've seen often
>>>>enough in the past that some worker nodes are fine while others have
>>>>problems.
>>>>
>>>>Go for fair shares.
>>>>
>>>>A problem worker node will eat jobs, thus there is a reasonable chance
>>>>that if it is open to dteam, it will eat a dteam job too ... which is
>>>>what you want to happen. If you have a node dedicated to dteam jobs,
>>>>its utilization will likely be lower than the rest of your farm, so
>>>>things that die under stress will not die as quickly on this node ...
>>>>you get the picture.
>>>>
>>>>Something else: smaller sites should be careful about making long
>>>>queues. In the best case, the number of jobs you should expect to be
>>>>ending in any period t will only be
>>>>
>>>> N * t / T
>>>>
>>>>where N is the number of jobs you have running, and T is how long
>>>>these jobs run on average. This assumes these N jobs have all started
>>>>at random times during the last period T (not before, since they would
>>>>have by definition already finished, and not after, since then they
>>>>would not have started yet ;-)
>>>>
>>>>10 CPUs, ten minutes waiting for a job to end, 24-hour jobs ... expect
>>>>0.07 jobs to end in this period ... in other words you should expect
>>>>on average a slot open up every two hours or so. In reality it will
>>>>be worse since jobs tend to come in batches.
>>>>
>>>> J "Friday night Grid Philosophy" T
>>>>
>>>>
>>>>Jeff Templon wrote:
>>>>
>>>>
>>>>>yo,
>>>>>
>>>>>we use process caps. here is an abbreviated example:
>>>>>
>>>>>GROUPCFG[dteam] FSTARGET=2 PRIORITY=5000 MAXPROC=32
>>>>>GROUPCFG[alice] FSTARGET=15 PRIORITY=100 MAXPROC=100 ADEF=lhc
>>>>>GROUPCFG[atlas] FSTARGET=50 PRIORITY=100 MAXPROC=160 ADEF=lhc
>>>>>GROUPCFG[atlsgm] FSTARGET=50 PRIORITY=100 MAXPROC=160 ADEF=lhc
>>>>>GROUPCFG[lhcb] FSTARGET=35 PRIORITY=100 MAXPROC=230 ADEF=lhc
>>>>>GROUPCFG[lhcbsgm] FSTARGET=35 PRIORITY=100 MAXPROC=230 ADEF=lhc
>>>>>GROUPCFG[cms] FSTARGET=1- PRIORITY=1 MAXPROC=10 ADEF=lhc
>>>>>
>>>>>GROUPCFG[esr] FSTARGET=5 PRIORITY=50 MAXPROC=32
>
> ADEF=nlgrid
>
>>>>>GROUPCFG[ncf] FSTARGET=40 PRIORITY=100 MAXPROC=120
>
> ADEF=nlgrid
>
>>>>>GROUPCFG[asci] FSTARGET=40 PRIORITY=100 MAXPROC=120
>
> ADEF=nlgrid
>
>>>>>GROUPCFG[pvier] FSTARGET=5 PRIORITY=100 MAXPROC=12
>
> ADEF=nlgrid
>
>>>>>
>>>>>ACCOUNTCFG[lhc] FSTARGET=50 MAXPROC=230
>>>>>ACCOUNTCFG[nlgrid] FSTARGET=50 MAXPROC=110
>>>>>
>>>>>Note that we give dteam a very high priority but a very low fair
>>>>>share and a rather severe process cap. On the other hand, the LHC
>>>>>groups all have a rather high fair share, and are limited to 230
>>>>>processes in total. Right now we have 246 CPUs in the farm, so it is
>>>>>impossible for just LHC to take all our CPUs. Sometimes they are all
>>>>>full, but this is during times that we have e.g. 180 LHC jobs
>>>>>running, 50 from biomed,
>>>>>and 16 from dzero. But in most cases we are not full, so dteam jobs
>>>>>run immediately.
>>>>>
>>>>>Even when we are full it's not a problem. For a big site, being full
>>>>>isn't so bad because with lots of jobs, you have a relatively large
>>>>>number of jobs ending during any given time period.
>>>>>
>>>>> JT
>>>>>
>>>>>Mario David wrote:
>>>>>
>>>>>
>>>>>>Hi Dan
>>>>>>how do you set a WN only to dteam with pbs/maui?
>>>>>>
>>>>>>we are having problems because all nodes are full of atlas and cms
>
> jobs
>
>>>>>>and dteam sft doesn't enter. Despite fairshares in maui.conf
>>>>>>in the past I had tried to set specific nodes to specific groups in
>>>>>>the qmgr but was not successfull.
>>>>>>
>>>>>>cheers
>>>>>>
>>>>>>Mario
>>>>>>
>>>>>>Quoting Dan Schrager <[log in to unmask]>:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>Dear Christine,
>>>>>>>
>>>>>>>I have deleted your simulation(?) job run as user dteam at my site
>>>>>>>because it was blocking the unique WN reserved for short dteam (SFT
>>>>>>>kind) jobs.
>>>>>>>Use in the future an atlas certificate for such purposes.
>>>>>>>
>>>>>>>Regards,
>>>>>>>Dan
>>>>>>>
>>>>>>>
>>>
> ------- End of Original Message -------

Top of Message | Previous Page | Permalink

JiscMail Tools

Files Area | help

RSS Feeds and Sharing

Search Archives

Advanced Options

Archives

April 2024
March 2024
November 2023
June 2023
May 2023
April 2023
March 2023
February 2023
September 2022
June 2022
May 2022
April 2022
February 2022
December 2021
November 2021
October 2021
September 2021
July 2021
June 2021
May 2021
February 2021
January 2021
November 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
February 2018
January 2018
November 2017
October 2017
September 2017
July 2017
June 2017
May 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003

JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk