Hi,
We have had a number of atlas production jobs failing to exit over the
last week or so.
On all nodes we've seen this on a single zombie sh process belonging to
a production atlas pool user is left behind that can't be killed, but
using no cpu time. pbs_mom still claims the machine is busy so the node
is out of commission while it's still there.
Killing pbs_mom allows the atlas process to die. To be on the safe side
we reboot the node.
This is on SL4.4 32bit nodes.
John
Brew, CAJ (Chris) wrote:
> Hi,
>
> Over the last week or so we've had a few nodes go offline because of a
> steadily increasing load on them.
>
> The boxes remain responsive and you can log into them but ps -ef or ps
> aux just hang halfway through (indeed the load is actually caused by
> hanging ps processes spawned by the nagios monitoring). Shutdown then
> fails to restart them and we have to force a hard bounce to get the node
> back.
>
> Investigating on a few of them it seems the problem is probably related
> to so Atlas user jobs, at least on all the nodes I've looked I've found
> a python process running as the same pool account _even though the batch
> system thinks no jobs by that user are running on the node_. The python
> proccess is clocking up cputime according to top but is immune to kill
> -9.
>
> Is anyone else seeing anything like this? The WNs are running a fully up
> to date version of SL44.
>
> Yours,
> Chris.
--
Dr John Bland, Systems Administrator
Room 210, Oliver Lodge
Particle Physics Group, University of Liverpool
Mail: [log in to unmask]
Tel : 0151 794 3396
|