Hi Jeff and all
my coleague has implemented one of the solutions given in thuis thread
the point you make is quite right, we have presently 16 cpus (small site)
we configured the queues according to the time needed for atlas and cms jobs to
run.
so, up to now, the dteam sft jobs sometimes fail, this means JL or JS
not specific tests.
atlas and cms jobs seem to run quite well
I see them on the status of pbs, I see them with hight running time
and (hoppefully) most important of all, nobody from atlas or
cms ever complained about our site or stopped sending jobs here
(I mean blacklisted the site).
and thats it for status
cheers
Mario David
Quoting Jeff Templon <[log in to unmask]>:
> Yo
>
> Thinking about this, i tentatively conclude that it's a bad idea to
> dedicate a single worker node to dteam jobs. The reason is that this WN
> may not be representative of your whole farm. We've seen often enough
> in the past that some worker nodes are fine while others have problems.
>
> Go for fair shares.
>
> A problem worker node will eat jobs, thus there is a reasonable chance
> that if it is open to dteam, it will eat a dteam job too ... which is
> what you want to happen. If you have a node dedicated to dteam jobs,
> its utilization will likely be lower than the rest of your farm, so
> things that die under stress will not die as quickly on this node ...
> you get the picture.
>
> Something else: smaller sites should be careful about making long
> queues. In the best case, the number of jobs you should expect to be
> ending in any period t will only be
>
> N * t / T
>
> where N is the number of jobs you have running, and T is how long these
> jobs run on average. This assumes these N jobs have all started at
> random times during the last period T (not before, since they would have
> by definition already finished, and not after, since then they would not
> have started yet ;-)
>
> 10 CPUs, ten minutes waiting for a job to end, 24-hour jobs ... expect
> 0.07 jobs to end in this period ... in other words you should expect on
> average a slot open up every two hours or so. In reality it will be
> worse since jobs tend to come in batches.
>
> J "Friday night Grid Philosophy" T
>
>
> Jeff Templon wrote:
> > yo,
> >
> > we use process caps. here is an abbreviated example:
> >
> > GROUPCFG[dteam] FSTARGET=2 PRIORITY=5000 MAXPROC=32
> > GROUPCFG[alice] FSTARGET=15 PRIORITY=100 MAXPROC=100 ADEF=lhc
> > GROUPCFG[atlas] FSTARGET=50 PRIORITY=100 MAXPROC=160 ADEF=lhc
> > GROUPCFG[atlsgm] FSTARGET=50 PRIORITY=100 MAXPROC=160 ADEF=lhc
> > GROUPCFG[lhcb] FSTARGET=35 PRIORITY=100 MAXPROC=230 ADEF=lhc
> > GROUPCFG[lhcbsgm] FSTARGET=35 PRIORITY=100 MAXPROC=230 ADEF=lhc
> > GROUPCFG[cms] FSTARGET=1- PRIORITY=1 MAXPROC=10 ADEF=lhc
> >
> > GROUPCFG[esr] FSTARGET=5 PRIORITY=50 MAXPROC=32 ADEF=nlgrid
> > GROUPCFG[ncf] FSTARGET=40 PRIORITY=100 MAXPROC=120 ADEF=nlgrid
> > GROUPCFG[asci] FSTARGET=40 PRIORITY=100 MAXPROC=120 ADEF=nlgrid
> > GROUPCFG[pvier] FSTARGET=5 PRIORITY=100 MAXPROC=12 ADEF=nlgrid
> >
> >
> > ACCOUNTCFG[lhc] FSTARGET=50 MAXPROC=230
> > ACCOUNTCFG[nlgrid] FSTARGET=50 MAXPROC=110
> >
> > Note that we give dteam a very high priority but a very low fair share
> > and a rather severe process cap. On the other hand, the LHC groups all
> > have a rather high fair share, and are limited to 230 processes in
> > total. Right now we have 246 CPUs in the farm, so it is impossible for
> > just LHC to take all our CPUs. Sometimes they are all full, but this is
> > during times that we have e.g. 180 LHC jobs running, 50 from biomed,
> > and 16 from dzero. But in most cases we are not full, so dteam jobs run
> > immediately.
> >
> > Even when we are full it's not a problem. For a big site, being full
> > isn't so bad because with lots of jobs, you have a relatively large
> > number of jobs ending during any given time period.
> >
> > JT
> >
> > Mario David wrote:
> >
> >> Hi Dan
> >> how do you set a WN only to dteam with pbs/maui?
> >>
> >> we are having problems because all nodes are full of atlas and cms jobs
> >> and dteam sft doesn't enter. Despite fairshares in maui.conf
> >> in the past I had tried to set specific nodes to specific groups in
> >> the qmgr but was not successfull.
> >>
> >> cheers
> >>
> >> Mario
> >>
> >> Quoting Dan Schrager <[log in to unmask]>:
> >>
> >>
> >>> Dear Christine,
> >>>
> >>> I have deleted your simulation(?) job run as user dteam at my site
> >>> because it was blocking the unique WN reserved for short dteam (SFT
> >>> kind) jobs.
> >>> Use in the future an atlas certificate for such purposes.
> >>>
> >>> Regards,
> >>> Dan
> >>>
> >>>
>
|