On Tue, 18 Mar 2003, Ian Stokes-Rees wrote:
> One of our worker nodes (tbgen01) has had close to zero load on it for the
> last 24 hours, however it has constantly had active jobs. Currently it is
> executing "globus-url-copy". I am wondering how many of these "active but
> doing nothing" jobs there are, and if there is any way to allow other PBS
> jobs to come in if these "inactive" type jobs are running.
Jobs getting stuck on i/o is a problem anyway on farms. Its just
happens more with edg farms since there is more to go wrong.
The job will eventually be killed by your max wall time but normally
I try and contact the person and then qdel their job if I am sure
it really is stuck.
Steve
>
> See the cyan line at http://pptb01.physics.ox.ac.uk/graphs/load.gif (not yet
> dynamically updated) for the load profile for the last several days.
>
> Oxford active jobs can be viewed at:
>
> http://tbce01.physics.ox.ac.uk/cgi-bin/lsh/qstat
>
> And the state of the Worker Nodes is summarised at:
>
> http://tbce01.physics.ox.ac.uk/cgi-bin/lsh/pbsnodes
>
> In other news, I am working on a script which can run from the command line
> or as a CGI which will return in text, XML, HTML, and "single field mode"
> different status indicators. The current state of this can be seen at:
>
> http://tbce01.physics.ox.ac.uk/cgi-bin/lsh/
>
> Simply add one of the commands to the end of the URL and append options
> after that to see what it does. I _believe_ it is quite safe, given the
> filtering I do on all input parameters, and the mapping of a command name to
> an executable via a hash table. I'd be happy for any feedback. I'm hoping
> to come up with something which ties this in with RRDtool and Nagios for
> Grid site monitoring -- Yes, I know, Yet Another Grid Monitoring Project...
>
> Ian.
>
> --
> Ian Stokes-Rees [log in to unmask]
> Particle Physics, Oxford http://www-pnp.physics.ox.ac.uk/~stokes/
>
--
Steve Traylen
[log in to unmask]
http://www.gridpp.ac.uk/
|