JISCMail - LCG-ROLLOUT Archives

Yo,

So now that at least one site has the new ERT installed, and the 
framework has been put in the LCG 2.7.0 hit list, it's time to revive an 
old discussion about truth in advertising with ERT.

Advertising '0' for ERT is essentially always wrong.  There is no 
correlation (at least not by design) between the cycle time of a site's 
information refresh at the top-level BDIIs, and cycle time of scheduler 
passes at that same site.  The best you can say is "last time we spoke, 
there were no ALICE jobs waiting in the queue, so if you submit now, and 
the situation does not change drastically on the short term, your job 
will run at the start of the next scheduler cycle".

This means a fair "immediate" estimate for the ERT is one-half the 
scheduler cycle (call this T/2 for short below).  The framework I built 
takes this period into account.  It will report an ERT of T/2 for a 
completely empty system; it will also report an ERT of T/2 for a system 
in which there are waiting jobs, but none of them have been waiting for 
longer than T ... because it's reasonable to assume that given enough 
free CPUs, those jobs are simply waiting for the next scheduler cycle to 
start.  When jobs have been waiting for longer than T, then the system 
starts more complex estimation.

T is a parameter that you are supposed to set in the conf file.

We need an agreement on how to do this.  If sites report fairly, they 
all should report fairly.  There is some advantage to doing this if we 
ever get 'high-priority' jobs going; those jobs will naturally rather go 
to a site where the scheduler cycle is 20 seconds rather than six minutes.

The other option is to fake the scheduler cycle with 0.  I am pretty 
sure my system can handle this OK, and it does have some advantages in 
that the more complex calculation will then kick in immediately when 
jobs start to queue, even if they will run soon after; this means that 
all the sites with lots of action, but still some free CPUs, will see 
random fluctuations in the reported ERT between zero and T, which will 
naturally lead to some load balancing.

Is this a topic for Abingdon, or is it too early?

The other question is, is anyone using the VOView stuff yet?  I notice 
it's awful quiet here, and this might be because we're reporting the 
truth right now in the VOView, rather than 'zero'.

	J "your turn" T