On 4 Apr 2011, at 14:51, Andrew McNab wrote:
> On 04/04/2011 13:35, Ewan MacMahon wrote:
>>> -----Original Message-----
>>> From: Testbed Support for GridPP member institutes [mailto:TB-
>>>
>>> That was the initial idea, but the HEPSPEC06 figure depends on more than
>>> the CPU model and MHz (eg the kernel version) so it would need to be at
>>> least per-site, and really per subcluster. That just gets us back to the
>>> CE-queue mapping, which is something Steve's script can do without having
>>> to modify what ATLAS makes available via the dashboard jobsummary.
>>
>> The per resource mapping is going to be a worse approximation than the
>> cpuid is
>
> Clearly that's not true for many sites (eg ones that have identical machines behind a particular CE) and given we're talking about average HEPSPEC06 of the machines in a particular subcluster averaged over hundreds or thousands of jobs then the statistical fluctations arising from which machines jobs happen to land on are washed out.
That's not always true.
IO bound jobs will take disproportionately longer on the faster CPU's than the slower CPU's than the average HEPSPEC06 might suggest.
One could also put more explicit biases in the local cluster schedular. Sometimes these may be unintentional (e.g. keeping HTC jobs away from the Infiniband machines where possible), yet impact on the distribution of cpu allocation to job types.
As to the total size of these effects? I don't know.
Sam's point about mapping (cpuid, Queue name) to HEPSPEC06 value would remove any such systematic errors from being a problem, of course. In the light of an unknown, surely that's the better way to proceed?
Stuart
|