It seems to me that what is required is some kind of simple independent
watchdog service which can push information "up" regarding the status of
local services, and which is not tied in to other globus or EDG "pieces"
such that it will continue to function even if Globus and everything else
falls over. The "only" requirements would be that the machine is running,
the network connection is up, and this watchdog service was available.
Isn't this what the GRIS/GIIS/R-GMA/MDS architecture is supposed to do?
Ian.
-----Original Message-----
From: Andrew McNab
> I think there's a more general point here. What we really need is
> some monitoring which not only tells you that a site is not working,
> but tells you why! Detailed testbed monitoring is something which is
> sorely needed at the moment; there are a fair number of web pages
> which give some kind of view of the system, but nothing which really
> enables problems to be diagnosed.
Yes, although that's quite hard to do for job submission problems. Once
a site gets accessible via the RB's, its easier to put a series of, say,
SE tests into a script and get diagnostic output. ie its easier to query
an SE about a file when you can actually run a job at that site, than to
query the pool account lock files when you can't even get in to run a job.
|