Hi,
I'd like to ask for some advice on detecting corrupted nodes that
execute jobs on Torque / Maui.
We're currently using a script that parses torque logs and looks for
nodes that execute more than 5 jobs in 10 minutes with an exit
statuts != 0.
It worked for "standard" corrupted nodes (AKA black_holes), but last
Thursday we found some new kind of black_holes.
Torque just saw their jobs as Queued (;Q;) -> Dequeud (;D;), and our
check was not able to detect it (it looks for Exited jobs). Maui did not
give any error neither. We only determinate that we had a black_hole
looking at the amount of jobs that maui scheduled to same hos in that
day...
So, the simplest check is one that counts how many jobs are being
executed in a node for certain time, and then, mark that node as a
possible black_hole. We had this check some time ago, but we got many
false positives...
How are other sites detecting this kind of problems?
TIA,
Arnau
|