Print

Print


I can't speak to the UK, but ...

When it comes to monitoring, all I want is:
a) something that emails me automatically when something goes wrong
and
b) that has a link for further information in it.

Basically nagios.

Don't make me check a webpage, it never ever works and I am speaking from dire experience here.
And don't include a generic link either where I then have to guess which of the n settings I have to check/change to figure out where the error comes from. 

CMS is a guilty of that as Atlas.

Try running tests on a site that is not a member of the experiment (i.e. a T3) and see if this site can understand the error and you'll do just fine.

Bonus points for a site being able to initiate a test (to check something has been fixed), but that's really a bonus.

Cheers,
Daniela





On 17 September 2013 14:01, Alessandra Forti <[log in to unmask]> wrote:
I sent this to Jeremy thinking he would put it in agenda but he told me he wasn't there eirther.


-------- Original Message --------
Subject: Re: Ops meeting @ 11am
Date: Tue, 17 Sep 2013 10:01:05 +0100
From: Alessandra Forti <[log in to unmask]>
CC: Jeremy Coles <[log in to unmask]>


Hi Jeremy,

as there is the engineer to repair the central switch this morning I 
don't know if I can make it to the meeting or if I can be reliably there.

SL6:

* Bristol postponed
* Glasgow and Lancaster are now in test with atlas queues
* Manchester has brought forward the upgrade 2 weeks and we have 
declared a week downtime from the 30th of September untill the 7th of 
October.
* Birmingham is done.

* There are problems with the java voms-proxy-info again affecting atlas 
jobs on sites that limit the memory to 3GB (few UK sites are doing 
that). Atlas is thinking of replacing voms-proxy-info with arcproxy. I'm 
giving a talk at the ADC meeting later today to decide what to do.

https://ggus.eu/ws/ticket_info.php?ticket=97230

Monitoring:

I started a discussion about nagios on the sites monitoring 
consolidation list. Only Jeff Templon replied. We need a UK point of 
view. If sites show no interest I don't blame the monitoring people for 
going their way. If we don't speak they are right to take this decisions 
almost without consultation.

cheers
alessandra





On 17/09/2013 09:38, Jeremy Coles wrote:
> Dear All
>
> The agenda for today's ops meeting is available at http://indico.cern.ch/conferenceDisplay.py?confId=273350. The plan is to review the GDB updates from last week and check again on the SL6 status (especially to bring out any issues or concerns).
>
> Pete has kindly agreed to chair this week - though if Pete is unable to connect from RAL, please could someone else from the core ops team take control. As Matt mentioned in the tickets email, there will not be an ops meeting next week due to GridPP31 (https://www.gridpp.ac.uk/gridpp31/).
>
> For minutes the list is Mark=6 Wahid=8 Daniela=7 Kashif=7 Matt=7 Chris=7 Alessandra=7 Pete=7 Rob=7 Ewan=7 Brian=7.
>
> regards,
> Jeremy


-- 
Facts aren't facts if they come from the wrong people. (Paul Krugman)






--
Sent from the pit of despair

-----------------------------------------------------------
[log in to unmask]
HEP Group/Physics Dep
Imperial College
London, SW7 2BW
Tel: +44-(0)20-75947810
http://www.hep.ph.ic.ac.uk/~dbauer/