Hi,
Over the last week or so we've had a few nodes go offline because of a
steadily increasing load on them.
The boxes remain responsive and you can log into them but ps -ef or ps
aux just hang halfway through (indeed the load is actually caused by
hanging ps processes spawned by the nagios monitoring). Shutdown then
fails to restart them and we have to force a hard bounce to get the node
back.
Investigating on a few of them it seems the problem is probably related
to so Atlas user jobs, at least on all the nodes I've looked I've found
a python process running as the same pool account _even though the batch
system thinks no jobs by that user are running on the node_. The python
proccess is clocking up cputime according to top but is immune to kill
-9.
Is anyone else seeing anything like this? The WNs are running a fully up
to date version of SL44.
Yours,
Chris.
|