I agree. The environment variable solution will take some time to roll-out even if sites are motivated by GridPP to do it. Go with the lookup table.
John
-----Original Message-----
From: Testbed Support for GridPP member institutes [mailto:[log in to unmask]] On Behalf Of Andrew McNab
Sent: 05 April 2011 11:18
To: [log in to unmask]
Subject: Re: HEPSPEC06 numbers for GridPP metrics
On 04/04/2011 23:00, John Gordon wrote:
> Alistair, that is the plan. The problem is that panda and the ATLAS dashboards only record raw cpu time. The discussion here has been (or should have been) how to weight the panda numbers better to reflect the goodness of the site for ATLAS.
>
> For the others who weren't there it's worth pointing out that this isn't an exercise for tb-support. ATLAS will decide what metrics they want to use to reward the performance of sites supporting them. TB-support, or rather Andrew and Alessandra, were volunteered(*) to work out a method for giving ATLAS the information to weight by HS06.
>
> John
>
> (*) yes, volunteer is a transitive verb now:-)
And the ATLAS dashboard can give you CPU time used per-site or per
CE/queue combination, which gets us back to the original email.
Going by what is visible today, are there any other options for getting
CPU time for analyis and for production jobs per site (ideally
per-subcluster eg via the CE/queue name since CPU model isn't available
on the dashboard)? If not, I think we should collect the list of
CE-queue name to HEPSPEC06/core mappings from sites so Steve can get
that working.
Ideally, yes, something like an environment variable with the HEPSPEC06
figure to be accounted by each job, recorded by the job and passed up
the chain and delivered as per-site totals would be best. But that is
going to involve changes in multiple levels of ATLAS's chain of stuff
and we need something up and running pretty much now to give people fair
notice before the start of May.
Andrew
> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:[log in to unmask]] On Behalf Of Alastair Dewhurst
> Sent: 04 April 2011 22:41
> To: [log in to unmask]
> Subject: Re: HEPSPEC06 numbers for GridPP metrics
>
> Hi
>
> I know I am coming quite late to the discussion and I also wasn't at
> GridPP so apologies if this has already been suggested and discounted
> but if you want a metric of how much work has been done surely we
> should just get this direct from panda.
>
> For any month period it shouldn't be too difficult to get the number
> of production and user jobs each site has run. Given that ATLAS
> request Tier 2 should run 50% user and 50% production jobs it would
> then be possible to weight them accordingly (and if ATLAS changed
> their request the weighting could be redone). It would then be
> possible to see which sites were doing the work that got sent to the
> UK. Most of the other suggestion seem to rely on some assumption
> that sites will be mostly full or the tier 1 will be mostly up or
> there will be a constant stream of jobs and thats not going to be the
> case.
>
> Alastair
>
>
>
> On 4 Apr 2011, at 18:20, Ewan MacMahon wrote:
>
>>> -----Original Message-----
>>> From: Testbed Support for GridPP member institutes [mailto:TB-
>>> [log in to unmask]] On Behalf Of Stephen Burke
>>>
>>> Alessandra Forti [mailto:[log in to unmask]] said:
>>>> and we are trying to find a metric that expresses how best a site is
>>>> performing.
>>>
>>> So what is the argument that says that one site is performing
>>> better than
>>> another if it gets some jobs that either site could have run? If
>>> the other
>>> site is full, down or blacklisted that should count against it, but
>>> otherwise it's just chance.
>>>
>> Given the increase in experimental computing requirements and the
>> decrease in budgets, I think the underlying assumption is that
>> under normal circumstances (i.e. not when the Tier 1 is dead, for
>> example) there will be work enough to go round, and it's just a
>> question of what we can get through.
>>
>>> (Well, in practice it probably correlates with storage but presumably
>>> that's another metric.)
>>>
>> For ATLAS I think that's actually rather the point - if a site
>> has, for example, a very fast site network link and can get fresh
>> interesting data sets into its storage and ready to analyse faster
>> than other sites can, then that's a genuine improvement in throughput
>> and they should get the credit for it.
>>
>> If we can stop measuring by artificial metrics and actually measure
>> real results delivered then it automatically gives credit for
>> everything a site can do to help and eliminates perverse incentives
>> for things like turning turbo mode off. Or delivering lots of very
>> slow disk to get the 'terabyte days' metric up. Or whatever.
>>
>> Ewan
--
Cheers,
Andrew
--------------------------------------------------------------
Dr Andrew McNab, High Energy Physics, University of Manchester
|