Yo,
So now that at least one site has the new ERT installed, and the
framework has been put in the LCG 2.7.0 hit list, it's time to revive an
old discussion about truth in advertising with ERT.
Advertising '0' for ERT is essentially always wrong. There is no
correlation (at least not by design) between the cycle time of a site's
information refresh at the top-level BDIIs, and cycle time of scheduler
passes at that same site. The best you can say is "last time we spoke,
there were no ALICE jobs waiting in the queue, so if you submit now, and
the situation does not change drastically on the short term, your job
will run at the start of the next scheduler cycle".
This means a fair "immediate" estimate for the ERT is one-half the
scheduler cycle (call this T/2 for short below). The framework I built
takes this period into account. It will report an ERT of T/2 for a
completely empty system; it will also report an ERT of T/2 for a system
in which there are waiting jobs, but none of them have been waiting for
longer than T ... because it's reasonable to assume that given enough
free CPUs, those jobs are simply waiting for the next scheduler cycle to
start. When jobs have been waiting for longer than T, then the system
starts more complex estimation.
T is a parameter that you are supposed to set in the conf file.
We need an agreement on how to do this. If sites report fairly, they
all should report fairly. There is some advantage to doing this if we
ever get 'high-priority' jobs going; those jobs will naturally rather go
to a site where the scheduler cycle is 20 seconds rather than six minutes.
The other option is to fake the scheduler cycle with 0. I am pretty
sure my system can handle this OK, and it does have some advantages in
that the more complex calculation will then kick in immediately when
jobs start to queue, even if they will run soon after; this means that
all the sites with lots of action, but still some free CPUs, will see
random fluctuations in the reported ERT between zero and T, which will
naturally lead to some load balancing.
Is this a topic for Abingdon, or is it too early?
The other question is, is anyone using the VOView stuff yet? I notice
it's awful quiet here, and this might be because we're reporting the
truth right now in the VOView, rather than 'zero'.
J "your turn" T
|