Thanks for the summary David. It looks fine to me and I have not seen any other responses.
Jeremy
On 26 Sep 2013, at 11:46, David Crooks wrote:
> Hi all,
>
> After talking a bit with Alessandra, I've tried to summarise everyone's positions into one email that we can send on to the monitoring consolidation group - if anyone has anything they'd like to add to the following could you let me know and I'll send the summary on tomorrow?
>
> "
> I've attempted to summarise the position of the UK sites that have been in touch about Nagios. The points that were raised were (as a list for clarity):
>
> 1) A desire for a monitoring solution that gave automatic notifications and links to further information, and didn't require additional webpages (which describes Nagios). We noted that Nagios could be used to import central nagios tests and repurposing them for local testing.
>
> 2) In addition, it would be useful if the further details could include details of the testing execution commands (even including the test itself) for local diagnosis.
>
> 3) We wondered whether (and where) there might be common ground with the WLCG Nagios project - while this may have been discussed, it would be useful to clarify this.
>
> 4) It's important to have a clear and documented messaging/transport layer for any solution that's decided on, for integration with future monitoring solutions.
> "
>
> Cheers,
> David
>
>
> On 19 Sep 2013, at 15:34, David Crooks <[log in to unmask]> wrote:
>
>> Hi,
>>
>> The advantage of copying the probe execution commands into the test results (or have them directly accessible), at least, is that it would be useful with a (hopefully) relatively low documentation overhead.
>>
>> Cheers,
>> David
>>
>> On 19 Sep 2013, at 14:27, Christopher J. Walker <[log in to unmask]> wrote:
>>
>>> On 19/09/13 14:15, L Kreczko wrote:
>>>> Hi,
>>>>
>>>>
>>>> I agree with Daniela and would add to b): I would find useful for
>>>> debugging an issue to know the details of the test.
>>>
>>> Yes, I whinge about this every time.
>>>
>>> Also, With the old SAM infrastructure, you could see what the test was,
>>> when tests had been submitted, and whether they had succeeded. With
>>> Nagios, much of that seems hidden.
>>>
>>>
>>>>
>>>> As an example I have the service org.arc.GRIDFTP:
>>>> https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/extinfo.cgi?type=2&host=lcgce01.phy.bris.ac.uk&service=org.arc.GRIDFTP-%2Fops%2FRole%3Dlcgadmin
>>>> <https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/extinfo.cgi?type=2&host=lcgce01.phy.bris.ac.uk&service=org.arc.GRIDFTP-%2Fops%2FRole%3Dlcgadmin>
>>>> I can see in the upper right a link labelled 'extra notes' which points
>>>> to http://wiki.nordugrid.org/index.php/Nagios_Tests#org.arc.GRIDFTP
>>>> Here I would like to see what this tests consists of. This would ideally
>>>> be a command I can execute locally to get a detailed (and hopefully
>>>> meaningful) error message.
>>>
>>> It doesn't even need to be that - the commands executed by the script -
>>> and their output would probably be sufficient.
>>>
>>> What _I_ want is an overview that says I have a problem, and the ability
>>> to drill down and find what that problem is. The ability to do this for
>>> intermittent failures is also useful.
>>>
>>> Steve Lloyds tests are actually very good in this regard. You can see
>>> pretty much all of what he does - and by looking at other sites you can
>>> probably determine whether it is a site problem, or a more global problem.
>>>
>>> On a number of occasions, I have seen no problems with Nagios, but used
>>> Steve's tests to debug nodes with an issue.
>>>
>>> Chris
>>>
>>>>
>>>>
>>>> Cheers,
>>>> Luke
>>>>
>>>>
>>>>
>>>>
>>>> On 19 September 2013 13:23, Alessandra Forti <[log in to unmask]
>>>> <mailto:[log in to unmask]>> wrote:
>>>>
>>>> In particular one of the reasons nagios was chosen was because it is
>>>> possible to import tests results from the central nagios boxes into
>>>> the local one and tailor the alarms according to local taste.
>>>>
>>>> I'd like to know how many sites are doing this. I remember RalPP
>>>> being one of the biggest supporter of this.
>>>>
>>>>
>>>> On 19/09/2013 12:36, Alessandra Forti wrote:
>>>>> Hi Daniela,
>>>>>
>>>>> thanks for the feedback. Anybody else has an opinion on this?
>>>>>
>>>>> cheers
>>>>> alessandra
>>>>>
>>>>> On 17/09/2013 14:17, Daniela Bauer wrote:
>>>>>> I can't speak to the UK, but ...
>>>>>>
>>>>>> When it comes to monitoring, all I want is:
>>>>>> a) something that emails me automatically when something goes wrong
>>>>>> and
>>>>>> b) that has a link for further information in it.
>>>>>>
>>>>>> Basically nagios.
>>>>>>
>>>>>> Don't make me check a webpage, it never ever works and I am
>>>>>> speaking from dire experience here.
>>>>>> And don't include a generic link either where I then have to
>>>>>> guess which of the n settings I have to check/change to figure
>>>>>> out where the error comes from.
>>>>>>
>>>>>> CMS is a guilty of that as Atlas.
>>>>>>
>>>>>> Try running tests on a site that is not a member of the
>>>>>> experiment (i.e. a T3) and see if this site can understand the
>>>>>> error and you'll do just fine.
>>>>>>
>>>>>> Bonus points for a site being able to initiate a test (to check
>>>>>> something has been fixed), but that's really a bonus.
>>>>>>
>>>>>> Cheers,
>>>>>> Daniela
>>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *********************************************************
>>>> Dr Lukasz Kreczko +44 (0)117 928 8724
>>>> <tel:%2B44%20%280%29117%20928%208724>
>>>> CMS Group
>>>> School of Physics
>>>> University of Bristol
>>>> *********************************************************
>>
>
|