Yo
Thinking about this, i tentatively conclude that it's a bad idea to
dedicate a single worker node to dteam jobs. The reason is that this WN
may not be representative of your whole farm. We've seen often enough
in the past that some worker nodes are fine while others have problems.
Go for fair shares.
A problem worker node will eat jobs, thus there is a reasonable chance
that if it is open to dteam, it will eat a dteam job too ... which is
what you want to happen. If you have a node dedicated to dteam jobs,
its utilization will likely be lower than the rest of your farm, so
things that die under stress will not die as quickly on this node ...
you get the picture.
Something else: smaller sites should be careful about making long
queues. In the best case, the number of jobs you should expect to be
ending in any period t will only be
N * t / T
where N is the number of jobs you have running, and T is how long these
jobs run on average. This assumes these N jobs have all started at
random times during the last period T (not before, since they would have
by definition already finished, and not after, since then they would not
have started yet ;-)
10 CPUs, ten minutes waiting for a job to end, 24-hour jobs ... expect
0.07 jobs to end in this period ... in other words you should expect on
average a slot open up every two hours or so. In reality it will be
worse since jobs tend to come in batches.
J "Friday night Grid Philosophy" T
Jeff Templon wrote:
> yo,
>
> we use process caps. here is an abbreviated example:
>
> GROUPCFG[dteam] FSTARGET=2 PRIORITY=5000 MAXPROC=32
> GROUPCFG[alice] FSTARGET=15 PRIORITY=100 MAXPROC=100 ADEF=lhc
> GROUPCFG[atlas] FSTARGET=50 PRIORITY=100 MAXPROC=160 ADEF=lhc
> GROUPCFG[atlsgm] FSTARGET=50 PRIORITY=100 MAXPROC=160 ADEF=lhc
> GROUPCFG[lhcb] FSTARGET=35 PRIORITY=100 MAXPROC=230 ADEF=lhc
> GROUPCFG[lhcbsgm] FSTARGET=35 PRIORITY=100 MAXPROC=230 ADEF=lhc
> GROUPCFG[cms] FSTARGET=1- PRIORITY=1 MAXPROC=10 ADEF=lhc
>
> GROUPCFG[esr] FSTARGET=5 PRIORITY=50 MAXPROC=32 ADEF=nlgrid
> GROUPCFG[ncf] FSTARGET=40 PRIORITY=100 MAXPROC=120 ADEF=nlgrid
> GROUPCFG[asci] FSTARGET=40 PRIORITY=100 MAXPROC=120 ADEF=nlgrid
> GROUPCFG[pvier] FSTARGET=5 PRIORITY=100 MAXPROC=12 ADEF=nlgrid
>
>
> ACCOUNTCFG[lhc] FSTARGET=50 MAXPROC=230
> ACCOUNTCFG[nlgrid] FSTARGET=50 MAXPROC=110
>
> Note that we give dteam a very high priority but a very low fair share
> and a rather severe process cap. On the other hand, the LHC groups all
> have a rather high fair share, and are limited to 230 processes in
> total. Right now we have 246 CPUs in the farm, so it is impossible for
> just LHC to take all our CPUs. Sometimes they are all full, but this is
> during times that we have e.g. 180 LHC jobs running, 50 from biomed,
> and 16 from dzero. But in most cases we are not full, so dteam jobs run
> immediately.
>
> Even when we are full it's not a problem. For a big site, being full
> isn't so bad because with lots of jobs, you have a relatively large
> number of jobs ending during any given time period.
>
> JT
>
> Mario David wrote:
>
>> Hi Dan
>> how do you set a WN only to dteam with pbs/maui?
>>
>> we are having problems because all nodes are full of atlas and cms jobs
>> and dteam sft doesn't enter. Despite fairshares in maui.conf
>> in the past I had tried to set specific nodes to specific groups in
>> the qmgr but was not successfull.
>>
>> cheers
>>
>> Mario
>>
>> Quoting Dan Schrager <[log in to unmask]>:
>>
>>
>>> Dear Christine,
>>>
>>> I have deleted your simulation(?) job run as user dteam at my site
>>> because it was blocking the unique WN reserved for short dteam (SFT
>>> kind) jobs.
>>> Use in the future an atlas certificate for such purposes.
>>>
>>> Regards,
>>> Dan
>>>
>>>
|