On 3 Nov 2010, at 19:56, Christopher J.Walker wrote:
> We seem to be conflating two issues here.
No - we're discussing the WMS tests.
Everything else you wrote seems to be germane only to the other tests; not the WMS ones.
> Stuart Purdie wrote:
>> Downside to that
>
> Firstly, I've had a bad day, so (to quote a Neil McGovern):
>
> A. Because it breaks the logical sequence of discussion
> Q. Why is top posting bad?
>
>> is that these tests are not designated part of an 'official' monitoring framework [0]. This distinction matters for shared clusters where, in particular, at Edinburgh, the SAM tests are prioritised, as they're classed as a 'necessary level of monitoring', but the SLL tests are not. Running the SLL tests as 'ops' would effectively bless them, running an end run around the official designation.
>
>
> Steve Lloyd intended his tests to test what a normal user would experience. In his case, an ATLAS user.
>
> Looking at the wall of red on http://pprc.qmul.ac.uk/~lloyd/gridpp/atest.html
>
> at the moment a normal user would be experiencing pretty poor performance.
>
> > In other words: politics on that one are thorny.
>
> My view on this is that Steve's test should not (must not?) be prioritised. To do so replicates the ops tests.
>
> Why is this?
> There seem to be 3 types of errors at the moment:
>
> 1) Site problem (Horizontal stripes of red) - a site is failing all the tests.
> AIUI, Steve's recent move to AtlasSetup http://stevelloydatlastests.blogspot.com/ only works for releases >16 - which may be the cause at some sites.
>
>
> 2) Vertical stripes of red covering most of the sites,
>
> Looking behind them you get:
> "Reason Failed to submit job"
>
> Imperial's WMS looks like the cause of one of these strips.
>
>
> 3) Then there is the general scattering of red. The error is:
> "Failed to retrieve output"
>
> Normally I think my CE is broken, but given by the number of sites with this problem, there's something more general (or we've all got overloaded CEs). I've had a quick look, and they all seem to be lcgwms02.gridpp.rl.ac.uk
>
> http://pprc.qmul.ac.uk/~lloyd/gridpp/rbtest.html also hints at that wms having a problem (but also failures from wms01.
>
>
>> [0] Last time I was aware, anyway.
>
> Kashif's nagios tests however do run as the ops VO, and are to test a different thing - that a site can run the various simple jobs.
>
>
>
>> On 3 Nov 2010, at 17:40, Kashif Mohammad wrote:
>>> Hi
>>>
>>>>> The only real way I can think of to alleviate that issue would be to have
>>> a CE and a worker node somewhere that was reserved for these tests; to
>>>>> guarantee fast response. (Also: LCG-CE's have about a 15 minute latency -
>>> CREAM is a lot better in this respect).
>>>
>>> The other option is to start testing wms using ops proxy and as every site
>>> has to give priority to ops jobs, it will definitely hit the worker node.
>>>
>>> Regards
>>> Kashif
>>>
>>> -----Original Message-----
>>> From: Testbed Support for GridPP member institutes
>>> [mailto:[log in to unmask]] On Behalf Of Stuart Purdie
>>> Sent: 03 November 2010 17:14
>>> To: [log in to unmask]
>>> Subject: Re: WMS problem
>>>
>>>
>>> On 3 Nov 2010, at 16:35, Peter Gronbech wrote:
>>>
>>>> Kashif has noticed that the WMS's in the UK are not working very well.
>>>>
>>>> See Steve Lloyds page: http://pprc.qmul.ac.uk/~lloyd/gridpp/rbtest.html
>>>> It looks like they may all be overloaded, this is affecting the jobs he
>>> submits from gridppnagios. This means that the status information for some
>>> sites is old, (Waiting for a successful job run), and consequently could
>>> adversely affect availability and reliability figures.
>>>> Does anybody know why the WMSs are over loaded? Presumably this is
>>> affecting all grid job submission not just our monitoring jobs.
>>>
>>>
>>> It's not nessecerily the WMSen that are the problem here.
>>>
>>> A monitoring job has to hit a worker node, before it complete's
>>> successfully. If all the CE's are full (like we are), then it's going to
>>> have to queue; irrespective of the WMS. (Certianly, one of our WMS's -
>>> svr022 looks perfectly calm; the other only moderatly busy).
>>>
>>> The root problem is that the test _doesn't_ test the WMS only - it (by it's
>>> very nature) has to be an end-to-end test.
>>>
>>> The only real way I can think of to alleviate that issue would be to have a
>>> CE and a worker node somewhere that was reserved for these tests; to
>>> guarantee fast response. (Also: LCG-CE's have about a 15 minute latency -
>>> CREAM is a lot better in this respect).
>>>
>
> There's nothing stopping nagios tests being submitted directly to my site (via a private WMS that doesn't get overloaded or whatever). You do have the LCG-CE latency and job start time.
>
> The rate of errors on Steve Lloyd's atlas tests does seem to indicate a problem common to many sites, or a wms problem though.
>
> Chris
|