Alistair, that is the plan. The problem is that panda and the ATLAS dashboards only record raw cpu time. The discussion here has been (or should have been) how to weight the panda numbers better to reflect the goodness of the site for ATLAS.
For the others who weren't there it's worth pointing out that this isn't an exercise for tb-support. ATLAS will decide what metrics they want to use to reward the performance of sites supporting them. TB-support, or rather Andrew and Alessandra, were volunteered(*) to work out a method for giving ATLAS the information to weight by HS06.
John
(*) yes, volunteer is a transitive verb now:-)
-----Original Message-----
From: Testbed Support for GridPP member institutes [mailto:[log in to unmask]] On Behalf Of Alastair Dewhurst
Sent: 04 April 2011 22:41
To: [log in to unmask]
Subject: Re: HEPSPEC06 numbers for GridPP metrics
Hi
I know I am coming quite late to the discussion and I also wasn't at
GridPP so apologies if this has already been suggested and discounted
but if you want a metric of how much work has been done surely we
should just get this direct from panda.
For any month period it shouldn't be too difficult to get the number
of production and user jobs each site has run. Given that ATLAS
request Tier 2 should run 50% user and 50% production jobs it would
then be possible to weight them accordingly (and if ATLAS changed
their request the weighting could be redone). It would then be
possible to see which sites were doing the work that got sent to the
UK. Most of the other suggestion seem to rely on some assumption
that sites will be mostly full or the tier 1 will be mostly up or
there will be a constant stream of jobs and thats not going to be the
case.
Alastair
On 4 Apr 2011, at 18:20, Ewan MacMahon wrote:
>> -----Original Message-----
>> From: Testbed Support for GridPP member institutes [mailto:TB-
>> [log in to unmask]] On Behalf Of Stephen Burke
>>
>> Alessandra Forti [mailto:[log in to unmask]] said:
>>> and we are trying to find a metric that expresses how best a site is
>>> performing.
>>
>> So what is the argument that says that one site is performing
>> better than
>> another if it gets some jobs that either site could have run? If
>> the other
>> site is full, down or blacklisted that should count against it, but
>> otherwise it's just chance.
>>
> Given the increase in experimental computing requirements and the
> decrease in budgets, I think the underlying assumption is that
> under normal circumstances (i.e. not when the Tier 1 is dead, for
> example) there will be work enough to go round, and it's just a
> question of what we can get through.
>
>> (Well, in practice it probably correlates with storage but presumably
>> that's another metric.)
>>
> For ATLAS I think that's actually rather the point - if a site
> has, for example, a very fast site network link and can get fresh
> interesting data sets into its storage and ready to analyse faster
> than other sites can, then that's a genuine improvement in throughput
> and they should get the credit for it.
>
> If we can stop measuring by artificial metrics and actually measure
> real results delivered then it automatically gives credit for
> everything a site can do to help and eliminates perverse incentives
> for things like turning turbo mode off. Or delivering lots of very
> slow disk to get the 'terabyte days' metric up. Or whatever.
>
> Ewan
|