Hi Pablo,
> Well, I don't think making the sysadmin less aware of the things that happen
> inside the cluster is good at all. If a sysadmin doesn't want to see if a user
> fails his jobs, he/she can just avoid looking at the PBS exit status... but
> what if I want to know? What if I make statistics of how good my cluster is?
> There could be a user problem, but it could also be your SE giving trouble,
> and that affects the sysadmin. Or a timeout, or a full disk somewhere...
>
> Some months ago, I developed a ganglia metric to measure errors in our pbs
> system:
> http://ganglia.lcg.cscs.ch/ganglia//pbserrors.html
> Then I also plot it, and show it in our main monitoring page. It turned out to
> be VERY useful. If you see one failure here or there, you can just assume it's
> the normal grid stuff. But if suddenly the number of errors/time rises, that
> means something bad is going on. If it's all from the same user, it's probably
> a user problem. If it's all from a worker node, or from a CE... well, you can
> see patterns.
>
> There are many more possibilities around this. It's information, usable,
> filterable, and IMHO shouldn't be masked.
Those are good arguments indeed. Feel free to open an RFE in GGUS.
Maybe it could even be made configurable in CREAM...
|