On 22/01/13 15:36, Alessandra Forti wrote:
> The only explanation Chris has given about how representative they are
> is that these jobs use the WMS. My reply is: so do ops, atlas, lhcb and
> CMS nagios tests.
Consider
ce01: passes 100% of jobs
ce02: fails 100% of jobs.
Ops test availability is the "OR" of CE availability. If you have one CE
that fails every job, and one that passes, ops availability 100%, even
though a user experiences 50% jobs failing (assuming they hit the CEs
equally). Steve's test represents that experience.
And yes, QMUL is now using the Imperial BDII (and causing it some
stress, I'm afraid), to avoid the BDII issues mentioned below.
Chris
>
> cheers
> alessandra
>
> On 22/01/2013 10:58, Peter Gronbech wrote:
>> Hi Alessandra et al,
>> These tests have been a little unreliable in the past but I think
>> Chris's explanation of how they represent jobs from smaller VOs points
>> out that they can provide useful data.
>> I am curious as to why some sites attract more of these jobs than others.
>>
>> RALPP, Manchester, Oxford account for 85% of the jobs.
>>
>> Is this because we have more ce's? (Oxford has 3), Ralpp has 3,
>> Manchester has 3 but only 2 in production.
>>
>> I then wondered why although Oxford and RALPP are doing well (97%)
>> success, why do we fail some times.
>> All the errors from our sites are down to a failed lcg-cp which Kashif
>> believes is down to a time out from the top level bdii at RAL. Chris W
>> has already opened a ticket about this.
>>
>> However the errors at Manchester seem to be down to a CVMFS issue on
>> wn2206180 as they error message is
>> Trying to source:
>> /cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc5-gcc43-opt/17.6.0/cmtsite/asetup.sh
>> AtlasOffline 17.6.0
>> Failed to find asetup.sh
>>
>> Alessandra, can you check if this is the case, if so your score would
>> probably go to 100%. The question is why don't you see the lcg-cp
>> error. Are you using your own top bdii?
>>
>> Thanks Pete
>>
>
>
|