On 13 Jun 2007, at 09:42, Burke, S (Stephen) wrote:
> Testbed Support for GridPP member institutes
>> [mailto:[log in to unmask]] On Behalf Of Graeme Stewart said:
>> Check external networking has to be another basic test ;-)
>
> And also simply monitoring jobs per WN to see if one WN has a lot of
> failures or short jobs.
Indeed. The nagios test is a good one to alarm on straight away.
The attached script prints node name, processed job number and
average cpu/wall from a torque accounting file.
Try something like:
# nodestat.py 20070607 | sort -k 2 -n
(If someone can nagiosify that, be my guest ;-)
g
--
Dr Graeme Stewart - http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
|