On Thu, Aug 6, 2009 at 01:53, Ewan MacMahon<[log in to unmask]> wrote:
>> -----Original Message-----
>> From: Testbed Support for GridPP member institutes [mailto:TB-
>>
>> > I think it's harmless, but aren't the factories supposed to
>> > limit the number so there's only a modest amount queuing at
>> > any one time?
>> >
>>
>> ANALY_OX: 587 activated jobs, 597 inactive pilots queued
>>
>> This is normal, if we don't queue 'em up we won't get through the
> work.
>>
> I have a few problems with this:
>
> - Firstly, it's not true; provided you have at least one queued pilot on
> the site when a suitable job slot comes free, the work will run. You
> don't
> do any better by having lots sat in the queue.
Agreed - I think Peter is using the last but one version of the
factory which had a tendency to pile up pilots a bit too much.
Peter, you should upgrade to the Totoro release:
http://svr017.gla.scotgrid.ac.uk/factory/release/pyfactory-totoro.tgz
>
> - Secondly, the last time we were on this topic Graeme said "It's
> dangerous
> to send too many pilots to sites, they get stale."
Less of a problem with 40min analysis jobs, but generally true.
>
> - Thirdly, I suspect you're scaring off other jobs. At the moment we're
> limiting the total number of running pilot jobs to a fraction of the
> site
> so as to avoid saturating the storage, an approach that seems to be
> working.
> However, that means that we can have empty slots on the cluster and
> still
> not run atlas pilots in them; under normal circumstances we'd expect to
> fill those slots with other work, but as of yesterday afternoon we
> didn't
> have any. I suspect that this is because the large backlog of queued
> pilots
> was making the site's general expected response time very high, so the
> WMSes were sending things elsewhere (and yes, that's clearly a bug in
> the
> calculation - if there are free slots the expected response time should
> be zero). I restricted our CEs to taking pilots from the Glasgow
> factory,
> which does throttle itself back to maintain a sensible queue size, our
> expected response time went down, and we picked up a load of WMS
> allocated
> work.
This one is not our problem. I thought that the WMS used the VO view
which defended against this brokering error? Or was it ATLAS user jobs
via WMS which you got?
Graeme
--
Dr Graeme Stewart http://www.physics.gla.ac.uk/~graeme/
Department of Physics and Astronomy, University of Glasgow, Scotland
DEATH TO MEETINGS!
|