> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:TB-
>
> > I think it's harmless, but aren't the factories supposed to
> > limit the number so there's only a modest amount queuing at
> > any one time?
> >
>
> ANALY_OX: 587 activated jobs, 597 inactive pilots queued
>
> This is normal, if we don't queue 'em up we won't get through the
work.
>
I have a few problems with this:
- Firstly, it's not true; provided you have at least one queued pilot on
the site when a suitable job slot comes free, the work will run. You
don't
do any better by having lots sat in the queue.
- Secondly, the last time we were on this topic Graeme said "It's
dangerous
to send too many pilots to sites, they get stale."
- Thirdly, I suspect you're scaring off other jobs. At the moment we're
limiting the total number of running pilot jobs to a fraction of the
site
so as to avoid saturating the storage, an approach that seems to be
working.
However, that means that we can have empty slots on the cluster and
still
not run atlas pilots in them; under normal circumstances we'd expect to
fill those slots with other work, but as of yesterday afternoon we
didn't
have any. I suspect that this is because the large backlog of queued
pilots
was making the site's general expected response time very high, so the
WMSes were sending things elsewhere (and yes, that's clearly a bug in
the
calculation - if there are free slots the expected response time should
be zero). I restricted our CEs to taking pilots from the Glasgow
factory,
which does throttle itself back to maintain a sensible queue size, our
expected response time went down, and we picked up a load of WMS
allocated
work.
Ewan
|