David,
Some of these points may be a bit obvious - sorry if that's the case.
Anyway, some architectural principles might be needed. For example:
a) Security and privacy concerns trump others.
b) Any significant condition that can be conveniently caught by a site
should be caught by a site.
c) Any condition that is caught at a site should be distributed via a
publish and subscribe pattern.
d) Any concerned monitoring authority should able to subscribe to get
notification of any condition caught by a site
e) Any significant condition that can't be caught by a site may be
caught by a concerned monitoring authority.
f) Any condition that is caught by a concerned monitoring authority
should be distributed via a publish and subscribe pattern.
g) Any concerned site should able to subscribe to get notification of
any condition caught by a concerned monitoring authority.
Etc.
If we built that kind of flexible event driven test and monitoring
system based on those sort of publish and subscribe principles,
a graph of alerts can be managed by those actually concerned, rather
than by any central diktat. For instance, a whole
aggregating web site could be constructed merely by subscribing to the
appropriate services to acquires the necessary signals.
Note: Obviously, publish and subscribe is a "push pattern", where a
services notices an event and pushes the alert to all who
have subscribed to it. The alternative is a pull pattern, where
interested parties poll services to extract the data. I can see
pros and cons for both, but it would be a very big deal to write an
email on them all!
Anyway, it's just an idea to make things flexible. There are other ways
to do it, of course.
Steve
On 07/22/2014 03:31 PM, David Crooks wrote:
> Dear all,
>
> As we talked about in Ops, I'd like to give a bump to my question about feedback on the CVMFS monitoring proposal. As a reminder, Maarten's talk which covers it is here: https://indico.cern.ch/event/305362/session/1/contribution/10/material/slides/0.pdf, pages 34-36.
>
> The points that were made when we talked about this last week (hopefully paraphrasing Ian's points accurately):
>
> 1. The proposal suggests the gathering of low level systems data like CPU and memory usage. In our discussion we felt that this was more detailed than would be necessary for many sites. A suggestion was given that this could be made opt-in so that sites that would find it useful could ask for it.
>
> 2. Ian noted that it would be useful to focus on the functional tests used and make sure that they test the most appropriate things - the existing CVMFS nagios probe might be a useful place to start.
>
> Please let me know by the end of Wednesday if you want to suggest any amendments or additions to these points; subsequently we'll pass them on to the WLCG Ops and monitoring consolidation meetings.
>
> Best wishes,
> David
--
Steve Jones [log in to unmask]
System Administrator office: 220
High Energy Physics Division tel (int): 42334
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
University of Liverpool http://www.liv.ac.uk/physics/hep/
|