JISCMail - TB-SUPPORT Archives

David,

Some of these points may be a bit obvious - sorry if that's the case.

Anyway, some architectural principles might be needed. For example:

a) Security and privacy concerns trump others.

b) Any significant condition that can be conveniently caught by a site 
should be caught by a site.

c) Any condition that is caught at a site should be distributed via a 
publish and subscribe pattern.

d) Any concerned monitoring authority should able to subscribe to get 
notification of any condition caught by a site

e) Any significant condition that can't be caught by a site may be 
caught by a concerned monitoring authority.

f) Any condition that is caught by a concerned monitoring authority 
should be distributed via a publish and subscribe pattern.

g) Any concerned site should able to subscribe to get notification of 
any condition caught by a concerned monitoring authority.

Etc.

If we built that kind of flexible event driven test and monitoring 
system based on those sort of publish and subscribe principles,
a graph of alerts can be managed by those actually concerned, rather 
than by any central diktat. For instance, a whole
aggregating web site could be constructed merely by subscribing to the 
appropriate services to acquires the necessary signals.

Note: Obviously, publish and subscribe is a "push pattern", where a 
services notices an event and pushes the alert to all who
have subscribed to it. The alternative is a pull pattern, where 
interested parties poll services to extract the data. I can see
pros and cons for both, but it would be a very big deal to write an 
email on them all!

Anyway, it's just an idea to make things flexible. There are other ways 
to do it, of course.

Steve






On 07/22/2014 03:31 PM, David Crooks wrote:
> Dear all,
>
> As we talked about in Ops, I'd like to give a bump to my question about feedback on the CVMFS monitoring proposal. As a reminder, Maarten's talk which covers it is here: https://indico.cern.ch/event/305362/session/1/contribution/10/material/slides/0.pdf, pages 34-36.
>
> The points that were made when we talked about this last week (hopefully paraphrasing Ian's points accurately):
>
> 1. The proposal suggests the gathering of low level systems data like CPU and memory usage. In our discussion we felt that this was more detailed than would be necessary for many sites. A suggestion was given that this could be made opt-in so that sites that would find it useful could ask for it.
>
> 2. Ian noted that it would be useful to focus on the functional tests used and make sure that they test the most appropriate things - the existing CVMFS nagios probe might be a useful place to start.
>
> Please let me know by the end of Wednesday if you want to suggest any amendments or additions to these points; subsequently we'll pass them on to the WLCG Ops and monitoring consolidation meetings.
>
> Best wishes,
> David


-- 
Steve Jones                             [log in to unmask]
System Administrator                    office: 220
High Energy Physics Division            tel (int): 42334
Oliver Lodge Laboratory                 tel (ext): +44 (0)151 794 2334
University of Liverpool                 http://www.liv.ac.uk/physics/hep/