Does MAXPROC mean that you won't be able to run more than 160 atlas jobs
? Even if the rest of the farm is empty ?
And does it mean that if 230 simulation jobs would come in the same time
(no free slot left) the next SFT dteam job will run 48 hours later ?
I would still keep one dteam reserved node.
If having (almost) all dteam jobs run on the same node is not good, then
add a minus sign after the reserved class name:
SRCFG[dteam] PERIOD=INFINITY HOSTLIST=eio99.pp.weizmann.ac.il
CLASSLIST=dteam-
This way all nodes will be selected for dteam jobs and the reserved node
will be used just as a last resort -- but (almost) always there.
Jeff Templon wrote:
> Yo
>
> Thinking about this, i tentatively conclude that it's a bad idea to
> dedicate a single worker node to dteam jobs. The reason is that this
> WN may not be representative of your whole farm. We've seen often
> enough in the past that some worker nodes are fine while others have
> problems.
>
> Go for fair shares.
>
> A problem worker node will eat jobs, thus there is a reasonable chance
> that if it is open to dteam, it will eat a dteam job too ... which is
> what you want to happen. If you have a node dedicated to dteam jobs,
> its utilization will likely be lower than the rest of your farm, so
> things that die under stress will not die as quickly on this node ...
> you get the picture.
>
> Something else: smaller sites should be careful about making long
> queues. In the best case, the number of jobs you should expect to be
> ending in any period t will only be
>
> N * t / T
>
> where N is the number of jobs you have running, and T is how long
> these jobs run on average. This assumes these N jobs have all started
> at random times during the last period T (not before, since they would
> have by definition already finished, and not after, since then they
> would not have started yet ;-)
>
> 10 CPUs, ten minutes waiting for a job to end, 24-hour jobs ... expect
> 0.07 jobs to end in this period ... in other words you should expect
> on average a slot open up every two hours or so. In reality it will
> be worse since jobs tend to come in batches.
>
> J "Friday night Grid Philosophy" T
>
>
> Jeff Templon wrote:
>
>> yo,
>>
>> we use process caps. here is an abbreviated example:
>>
>> GROUPCFG[dteam] FSTARGET=2 PRIORITY=5000 MAXPROC=32
>> GROUPCFG[alice] FSTARGET=15 PRIORITY=100 MAXPROC=100 ADEF=lhc
>> GROUPCFG[atlas] FSTARGET=50 PRIORITY=100 MAXPROC=160 ADEF=lhc
>> GROUPCFG[atlsgm] FSTARGET=50 PRIORITY=100 MAXPROC=160 ADEF=lhc
>> GROUPCFG[lhcb] FSTARGET=35 PRIORITY=100 MAXPROC=230 ADEF=lhc
>> GROUPCFG[lhcbsgm] FSTARGET=35 PRIORITY=100 MAXPROC=230 ADEF=lhc
>> GROUPCFG[cms] FSTARGET=1- PRIORITY=1 MAXPROC=10 ADEF=lhc
>>
>> GROUPCFG[esr] FSTARGET=5 PRIORITY=50 MAXPROC=32 ADEF=nlgrid
>> GROUPCFG[ncf] FSTARGET=40 PRIORITY=100 MAXPROC=120 ADEF=nlgrid
>> GROUPCFG[asci] FSTARGET=40 PRIORITY=100 MAXPROC=120 ADEF=nlgrid
>> GROUPCFG[pvier] FSTARGET=5 PRIORITY=100 MAXPROC=12 ADEF=nlgrid
>>
>>
>> ACCOUNTCFG[lhc] FSTARGET=50 MAXPROC=230
>> ACCOUNTCFG[nlgrid] FSTARGET=50 MAXPROC=110
>>
>> Note that we give dteam a very high priority but a very low fair
>> share and a rather severe process cap. On the other hand, the LHC
>> groups all have a rather high fair share, and are limited to 230
>> processes in total. Right now we have 246 CPUs in the farm, so it is
>> impossible for just LHC to take all our CPUs. Sometimes they are all
>> full, but this is during times that we have e.g. 180 LHC jobs
>> running, 50 from biomed,
>> and 16 from dzero. But in most cases we are not full, so dteam jobs
>> run immediately.
>>
>> Even when we are full it's not a problem. For a big site, being full
>> isn't so bad because with lots of jobs, you have a relatively large
>> number of jobs ending during any given time period.
>>
>> JT
>>
>> Mario David wrote:
>>
>>> Hi Dan
>>> how do you set a WN only to dteam with pbs/maui?
>>>
>>> we are having problems because all nodes are full of atlas and cms jobs
>>> and dteam sft doesn't enter. Despite fairshares in maui.conf
>>> in the past I had tried to set specific nodes to specific groups in
>>> the qmgr but was not successfull.
>>>
>>> cheers
>>>
>>> Mario
>>>
>>> Quoting Dan Schrager <[log in to unmask]>:
>>>
>>>
>>>> Dear Christine,
>>>>
>>>> I have deleted your simulation(?) job run as user dteam at my site
>>>> because it was blocking the unique WN reserved for short dteam (SFT
>>>> kind) jobs.
>>>> Use in the future an atlas certificate for such purposes.
>>>>
>>>> Regards,
>>>> Dan
>>>>
>>>>
|