Hi,
I know several sites are using Ganglia for fabric monitoring at some
level ( http://ganglia.sourceforge.net/ ), and it has the useful feature
of being able to aggregate results from multiple sites.
There is already interest from within BaBarGrid on using it in this way,
and so to get some experience of doing this, I've set up a Ganglia gmetad
daemon and the web scripts on the GridPP website:
http://www.gridpp.ac.uk/ganglia/
This currently only has feeds from two Manchester farms, and I suggest we
proceed by adding any more sites that are interested, and then later sort
out how we want to structure things in terms of experiments, Tier-2
centres etc (the detailed views can be on your own site, so there is a lot
of scope for devolving parts of this, and even offering different views
for different experiments.)
If your site is running Ganglia and you want to participate, please send
me either the host names and port numbers of 2 or 3 of your gmond daemons
(in which case we store all the history on www.gridpp.ac.uk) or of your
gmetad and the URL of your own detailed view (this avoids any need to have
incoming IP access to your worker nodes.) Currently there is a long list
of possible DNS names for the webservers gmetad: gppwww1, gppwww2, grid1,
grid2, grid3 all under .hep.man.ac.uk.
I do also have an RPM which adds CPU and PSU fan speed and temperature
monitoring to Ganglia, along with monitoring of PBS job occupancy for each
node: let me know if you're interested in using this and I'll put together
some notes about it.
Cheers,
Andrew
-------------------------------------------------------------------------
[log in to unmask] http://www.hep.man.ac.uk/u/mcnab/
+44-161-275-4227 "/C=UK/O=eScience/OU=Manchester/L=HEP/CN=Andrew McNab"
Grid Research, High Energy Physics Group, University of Manchester, UK
-------------------------------------------------------------------------
|