On Mon, 9 Jun 2008 14:10:13 +0200
Manfred Alef wrote:
> Hi,
Hi Manfred,
[...]
> at GridKa a node is set offline when a job which was
> started on that particular node has finished with an
> Exit_status < 0.
>
> (GridKa is using the PBS Pro batch system. The Exit_status
> should work similar to Torque: > 0 := user error, e.g.,
> time limits exceeded; < 0 := system issues.)
Yep, that's our criteria.
But are you using PBS Pro with Maui¿
If yes, have you find any job that maui knows but torque doesn't?¿
> >
> > It worked for "standard" corrupted nodes (AKA black_holes), but last
> > Thursday we found some new kind of black_holes.
> > Torque just saw their jobs as Queued (;Q;) -> Dequeud (;D;), and our
>
> (Check your homedirectories for free space and free inodes.)
Yep, that was the problem, a corrupted filesystem... we must add some
nagios checks to determinate this kind of problems, but we are looking
for something at toruqe/maui level...
> One common cause of black hole nodes at GridKa are jobs
> which run into an endless loop writing messages to a log
> file. These huge log files can fill up every disk within
> a few minutes. Therefore it's very difficult do detect
> this type of black hole node until the whole disk is
> occupied :-(
And those nodes start "eating" jobs without control? or the node
is just marked as busy/whatever status at pbs level?
> Best regards,
> M.
Thanks for the reply,
Arnau
|