Hi Stephen,
Burke, S (Stephen) wrote:
>> #define CE_IP_SI00 500
>>in site-cfg.h. Is this reasonable?
>
> On the whole it's probably best to advertise the slowest speed, if a job
> actually runs faster than expected you could regard it as a bonus, but as
> we've seen, if it's unexpectedly slow it may get killed by the cpu limit on
> the queue. What may be less obvious is what you do if you have, say, 10 slow
> processors and 90 fast ones.
WARNING WARNING DANGER DANGER. Doing this can create major problems, as
LHCb has discovered.
If you have queues with wall clock times of 2 days (typical maximum
queue time in LCG), but declare the CPUs to be the slowest in the
cluster, then people who want to run a job which approaches this 2 day
limit on a *normal* CPU (i.e. average speed) will conclude that their
jobs will not run. This is what has happened to us with many jobs going
to NIKHEF, CERN, and RAL (and possibly other sites I'm not aware of).
The strategy you propose only works if you don't ever expect people to
submit long jobs which push the queue time limit. Otherwise, queue time
limits need to be representative.
In my Ideal World: All queue times would be normalised. The batch
system manager would say "This queue is 24 normalised hours". Everyone
would know what the normalisation standard was, so could estimate from
that how long their jobs took (in normalised time units). If a site
wanted to have multiple processes running on the same processor (via HT
or otherwise), then they would just have the local responsibility to
make sure the wall clock time for the job was adjusted accordingly.
Similarly, normalised CPU time would be nice.
What are the reasons such a system isn't in place? I can understand
that there are many cases where hard wall-clock time limits are
required, so that should probably be kept too.
Cheers,
Ian
--
Ian Stokes-Rees [log in to unmask]
Particle Physics, Oxford http://www-pnp.physics.ox.ac.uk/~stokes
|