On 16/01/12 20:42, Alessandra Forti wrote:
> Dear all,
>
> I've been asked to contribute to two sessions in the F2F operation TEG
> meeting that will take place next Monday.
>
> https://indico.cern.ch/conferenceDisplay.py?confId=161833
>
> 1) Site perspective and requirements on monitoring in the medium term in
> the morning with a 20' minutes talk
Steve Lloyds tests are very effective at running a simple test - and
saying whether it succeeded or failed - and what the error was.
GridPP Nagios is poor at this (and that's the fault of the upstream wlcg
nagios I think). Perhaps it is user error on my part, but I still have
difficulty working out how many job failures contribute to a period of
failure, what the test is, and how it is failing.
There also seems sometimes to be a lack of understanding that testing
several things at once finds failures, but doesn't make them easy to
debug. Eg "why don't we randomly check different VOMS servers - well
yes, but then a random 1/3 of jobs fail. As a site, that could mean
problems with 1/3 of my worker nodes etc.
Tests for Non LHC VOs would also be extremely useful (but perhaps not
within the scope of the TEG).
> 2) Site priorities proposal on each area covered by the operation TEG:
> the 3 most important. In this session I'll be also part of the
> discussion panel that will set the medium long term priorities with
> other site representatives (including Ian although I'm not sure he has
> given his availability).
>
> If you want to contribute with your views feel free to contact me and
> I'll try to put together a UK view. We can start to talk about it
> tomorrow at the Ops meeting.
>
|