JISCMail - TB-SUPPORT Archives

We seem to be conflating two issues here.

Stuart Purdie wrote:
> Downside to that

Firstly, I've had a bad day, so (to quote a Neil McGovern):

A. Because it breaks the logical sequence of discussion
Q. Why is top posting bad?



>  is that these tests are not designated part of an 'official' monitoring framework [0].  This distinction matters for shared clusters where, in particular, at Edinburgh, the SAM tests are prioritised, as they're classed as a 'necessary level of monitoring', but the SLL tests are not.  Running the SLL tests as 'ops' would effectively bless them, running an end run around the official designation.
> 


Steve Lloyd intended his tests to test what a normal user would 
experience. In his case, an ATLAS user.

Looking at the wall of red on 
http://pprc.qmul.ac.uk/~lloyd/gridpp/atest.html

at the moment a normal user would be experiencing pretty poor performance.

 > In other words: politics on that one are thorny.

My view on this is that Steve's test should not (must not?) be 
prioritised. To do so replicates the ops tests.

Why is this?
There seem to be 3 types of errors at the moment:

1) Site problem (Horizontal stripes of red) - a site is failing all the 
tests.
AIUI, Steve's recent move to AtlasSetup 
http://stevelloydatlastests.blogspot.com/ only works for releases >16 - 
which may be the cause at some sites.


2) Vertical stripes of red covering most of the sites,

Looking behind them you get:
"Reason	Failed to submit job"

Imperial's WMS looks like the cause of one of these strips.


3) Then there is the general scattering of red. The error is:
"Failed to retrieve output"

Normally I think my CE is broken, but given by the number of sites with 
this problem, there's something more general (or we've all got 
overloaded CEs). I've had a quick look, and they all seem to be 
lcgwms02.gridpp.rl.ac.uk

http://pprc.qmul.ac.uk/~lloyd/gridpp/rbtest.html also hints at that wms 
having a problem (but also failures from wms01.


> 
> [0] Last time I was aware, anyway.

Kashif's nagios tests however do run as the ops VO, and are to test a 
different thing - that a site can run the various simple jobs.



> 
> On 3 Nov 2010, at 17:40, Kashif Mohammad wrote:
> 
>> Hi
>>
>>>> The only real way I can think of to alleviate that issue would be to have
>> a CE and a worker node somewhere that was reserved for these tests; to
>>>> guarantee fast response.  (Also: LCG-CE's have about a 15 minute latency -
>> CREAM is a lot better in this respect).
>>
>> The other option is to start testing wms using ops proxy and as every site
>> has to give priority to ops jobs, it will definitely hit the worker node.
>>
>> Regards
>> Kashif
>>
>> -----Original Message-----
>> From: Testbed Support for GridPP member institutes
>> [mailto:[log in to unmask]] On Behalf Of Stuart Purdie
>> Sent: 03 November 2010 17:14
>> To: [log in to unmask]
>> Subject: Re: WMS problem
>>
>>
>> On 3 Nov 2010, at 16:35, Peter Gronbech wrote:
>>
>>> Kashif has noticed that the WMS's in the UK are not working very well.
>>>
>>> See Steve Lloyds page: http://pprc.qmul.ac.uk/~lloyd/gridpp/rbtest.html 
>>>
>>> It looks like they may all be overloaded, this is affecting the jobs he
>> submits from gridppnagios. This means that the status information for some
>> sites is old, (Waiting for a successful job run), and consequently could
>> adversely affect availability and reliability figures. 
>>> Does anybody know why the WMSs are over loaded? Presumably this is
>> affecting all grid job submission not just our monitoring jobs.
>>
>>
>> It's not nessecerily the WMSen that are the problem here.
>>
>> A monitoring job has to hit a worker node, before it complete's
>> successfully.  If all the CE's are full (like we are), then it's going to
>> have to queue; irrespective of the WMS.  (Certianly, one of our WMS's -
>> svr022 looks perfectly calm; the other only moderatly busy).
>>
>> The root problem is that the test _doesn't_ test the WMS only - it (by it's
>> very nature) has to be an end-to-end test.
>>
>> The only real way I can think of to alleviate that issue would be to have a
>> CE and a worker node somewhere that was reserved for these tests; to
>> guarantee fast response.  (Also: LCG-CE's have about a 15 minute latency -
>> CREAM is a lot better in this respect).
>>

There's nothing stopping nagios tests being submitted directly to my 
site (via a private WMS that doesn't get overloaded or whatever). You do 
have the LCG-CE latency and job start time.

The rate of errors on Steve Lloyd's atlas tests does seem to indicate a 
problem common to many sites, or a wms problem though.

Chris