Hi,
> I'd like to ask for some advice on detecting corrupted nodes that
> execute jobs on Torque / Maui.
>
> We're currently using a script that parses torque logs and looks for
> nodes that execute more than 5 jobs in 10 minutes with an exit
> statuts != 0.
at GridKa a node is set offline when a job which was
started on that particular node has finished with an
Exit_status < 0.
(GridKa is using the PBS Pro batch system. The Exit_status
should work similar to Torque: > 0 := user error, e.g.,
time limits exceeded; < 0 := system issues.)
>
> It worked for "standard" corrupted nodes (AKA black_holes), but last
> Thursday we found some new kind of black_holes.
> Torque just saw their jobs as Queued (;Q;) -> Dequeud (;D;), and our
(Check your homedirectories for free space and free inodes.)
One common cause of black hole nodes at GridKa are jobs
which run into an endless loop writing messages to a log
file. These huge log files can fill up every disk within
a few minutes. Therefore it's very difficult do detect
this type of black hole node until the whole disk is
occupied :-(
Best regards,
M.
|