Hi,
The advantage of copying the probe execution commands into the test results (or have them directly accessible), at least, is that it would be useful with a (hopefully) relatively low documentation overhead.
Cheers,
David
On 19 Sep 2013, at 14:27, Christopher J. Walker <[log in to unmask]> wrote:
> On 19/09/13 14:15, L Kreczko wrote:
>> Hi,
>>
>>
>> I agree with Daniela and would add to b): I would find useful for
>> debugging an issue to know the details of the test.
>
> Yes, I whinge about this every time.
>
> Also, With the old SAM infrastructure, you could see what the test was,
> when tests had been submitted, and whether they had succeeded. With
> Nagios, much of that seems hidden.
>
>
>>
>> As an example I have the service org.arc.GRIDFTP:
>> https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/extinfo.cgi?type=2&host=lcgce01.phy.bris.ac.uk&service=org.arc.GRIDFTP-%2Fops%2FRole%3Dlcgadmin
>> <https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/extinfo.cgi?type=2&host=lcgce01.phy.bris.ac.uk&service=org.arc.GRIDFTP-%2Fops%2FRole%3Dlcgadmin>
>> I can see in the upper right a link labelled 'extra notes' which points
>> to http://wiki.nordugrid.org/index.php/Nagios_Tests#org.arc.GRIDFTP
>> Here I would like to see what this tests consists of. This would ideally
>> be a command I can execute locally to get a detailed (and hopefully
>> meaningful) error message.
>
> It doesn't even need to be that - the commands executed by the script -
> and their output would probably be sufficient.
>
> What _I_ want is an overview that says I have a problem, and the ability
> to drill down and find what that problem is. The ability to do this for
> intermittent failures is also useful.
>
> Steve Lloyds tests are actually very good in this regard. You can see
> pretty much all of what he does - and by looking at other sites you can
> probably determine whether it is a site problem, or a more global problem.
>
> On a number of occasions, I have seen no problems with Nagios, but used
> Steve's tests to debug nodes with an issue.
>
> Chris
>
>>
>>
>> Cheers,
>> Luke
>>
>>
>>
>>
>> On 19 September 2013 13:23, Alessandra Forti <[log in to unmask]
>> <mailto:[log in to unmask]>> wrote:
>>
>> In particular one of the reasons nagios was chosen was because it is
>> possible to import tests results from the central nagios boxes into
>> the local one and tailor the alarms according to local taste.
>>
>> I'd like to know how many sites are doing this. I remember RalPP
>> being one of the biggest supporter of this.
>>
>>
>> On 19/09/2013 12:36, Alessandra Forti wrote:
>>> Hi Daniela,
>>>
>>> thanks for the feedback. Anybody else has an opinion on this?
>>>
>>> cheers
>>> alessandra
>>>
>>> On 17/09/2013 14:17, Daniela Bauer wrote:
>>>> I can't speak to the UK, but ...
>>>>
>>>> When it comes to monitoring, all I want is:
>>>> a) something that emails me automatically when something goes wrong
>>>> and
>>>> b) that has a link for further information in it.
>>>>
>>>> Basically nagios.
>>>>
>>>> Don't make me check a webpage, it never ever works and I am
>>>> speaking from dire experience here.
>>>> And don't include a generic link either where I then have to
>>>> guess which of the n settings I have to check/change to figure
>>>> out where the error comes from.
>>>>
>>>> CMS is a guilty of that as Atlas.
>>>>
>>>> Try running tests on a site that is not a member of the
>>>> experiment (i.e. a T3) and see if this site can understand the
>>>> error and you'll do just fine.
>>>>
>>>> Bonus points for a site being able to initiate a test (to check
>>>> something has been fixed), but that's really a bonus.
>>>>
>>>> Cheers,
>>>> Daniela
>>>>
>>
>>
>>
>> --
>> *********************************************************
>> Dr Lukasz Kreczko +44 (0)117 928 8724
>> <tel:%2B44%20%280%29117%20928%208724>
>> CMS Group
>> School of Physics
>> University of Bristol
>> *********************************************************
|