Hi David,
Here are some figures and plots on scheduling efficiency at Liverpool
using (say) ARC/Condor (see attached.)
The last month of operations is covered in the first plot. The
scheduling efficiency is represented by the difference between the blue
line (cpus available) and the red line (running jobs). It's efficient
enough - but there is a gap. It turns out that the efficiency was about
93% overall. It was dragged down by lengthy periods with no single core
jobs, to fill the gaps left by the multi-core jobs (I have some ideas to
improve on that.)
Anyway, things get much better in the second plot, covering the last
(almost) week of operations. There was an almost constant good queue of
both multi-core and single-core jobs (which is essential). It's more
efficient, because of that. The efficiency in this period is about 98
%. The gap between red and blue is tiny.
When the supply of single-core jobs is only briefly interrupted (e.g. on
the 16th April) , you can see that the scheduling efficiency drops off,
then recovers (the lack of single core jobs is represented when the
black line, total queued, and the yellow line, multi-core queued, touch
each other, i.e. no single cores.) Even so, the draining system (called
Fallow) reacts and order is soon restored in the system (i.e. low loss
due to draining). And the number of multi-cores running, BTW, is held
near its setpoint; currently 800 slots (that's what I was after.)
So, in summary, usually better than 95% efficiency, with upwards of 98%
or even 99% with a good supply of both jobs types.
Cheers,
Ste
On 14/04/17 08:30, sjones wrote:
> Hi David,
>
> I have made a couple talks on that and I have figures going back a few
> years years. I also have some comparative figures using some different
> techniques for draining. The results are varied, but it is quite
> possible to tune a system to maintain a "low" loss due to draining.
>
> I'll do some formal numbers and pictures after Easter, but at
> Liverpool, less than 5% loss is common, with a score/mcore slot
> allocation ratio of (say) 60/40. The main issue, once the system is
> well set-up, is job consistency - there must be a constant, ready
> supply of score and mcore jobs. If there is, long stretches of high
> efficiency can be obtained.
>
> (Of course, very high eff. can also be obtained if there is No supply
> of either mcore or score, but not both!)
>
> Ste
>
>
> On 2017-04-13 12:43, David Colling wrote:
>> Dear All,
>>
>> There is an argument going on on the CMS computing management list about
>> whether or not CMS' internal management of single and multicore jobs
>> inside a single multicore pilot is more or less efficient than the
>> draining of the node that a batch system must do in order to run a
>> mixture of single and multicore jobs. Thanks to Andrew L. I have figures
>> (or rather graphs) for the inefficiency inside the CMS multicore pilots
>> but does anybody have figures for the inefficiency caused by batch
>> system draining for aa mixture of jobs? Perhaps through looking at Atlas
>> jobs where they run single core jobs in single core pilots and multicore
>> jobs in multicore pilots.
>>
>> Any information would be very very welcome.
>>
>> Best,
>> david
--
Steve Jones [log in to unmask]
Grid System Administrator office: 220
High Energy Physics Division tel (int): 43396
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 3396
University of Liverpool http://www.liv.ac.uk/physics/hep/
|