There were no jobs to kill as such, the actual production jobs had been
and gone days before on our nodes. All they had left behind was a
single, recalcitrant 'sh' process. They were spotted because they were
days over their wall-time limit but the node was up.
In future I'll quarantine an example but I didn't want 20+ job slots
tied up doing nothing. The only information we could glean was that
previous jobs had been using >1GB of RAM (using 0.5-1.0GB of swap) but
not running out of total ram.
Graeme Stewart wrote:
> Hi Chris/John
> Please don't kill just problematic jobs off - that way bugs never get
> fixed! (If there's a group of jobs hanging in the same way then it's
> fine to kill them all but one.)
> Production shouldn't stall, but if it does then dump the process tree
> and look for open file handles and network connections to try and work
> out what the problem is.
> For user jobs the parameter space is wider, but the same principle
> applies. Especially if the job is using the ganga framework then it's
> essential we get information to debug the problem. Remember that user
> analysis usually access data using rfio or dcap, so there are failure
> modes here that we're not so experienced with - and this may also be
> using the storage system in a way sites do not have experience with.
> If a particular user's jobs are really problematic then it's perfectly
> permissible to ban them from the site until we get to the bottom of
> the problem - but please raise a GGUS ticket and CC atlas operations
> or the UK operations lists.
> On Wed, Jun 18, 2008 at 10:24 AM, John Bland <[log in to unmask]> wrote:
>> We have had a number of atlas production jobs failing to exit over the last
>> week or so.
>> On all nodes we've seen this on a single zombie sh process belonging to a
>> production atlas pool user is left behind that can't be killed, but using no
>> cpu time. pbs_mom still claims the machine is busy so the node is out of
>> commission while it's still there.
>> Killing pbs_mom allows the atlas process to die. To be on the safe side we
>> reboot the node.
>> This is on SL4.4 32bit nodes.
>> Brew, CAJ (Chris) wrote:
>>> Over the last week or so we've had a few nodes go offline because of a
>>> steadily increasing load on them.
>>> The boxes remain responsive and you can log into them but ps -ef or ps
>>> aux just hang halfway through (indeed the load is actually caused by
>>> hanging ps processes spawned by the nagios monitoring). Shutdown then
>>> fails to restart them and we have to force a hard bounce to get the node
>>> Investigating on a few of them it seems the problem is probably related
>>> to so Atlas user jobs, at least on all the nodes I've looked I've found
>>> a python process running as the same pool account _even though the batch
>>> system thinks no jobs by that user are running on the node_. The python
>>> proccess is clocking up cputime according to top but is immune to kill
>>> Is anyone else seeing anything like this? The WNs are running a fully up
>>> to date version of SL44.
>> Dr John Bland, Systems Administrator
>> Room 210, Oliver Lodge
>> Particle Physics Group, University of Liverpool
>> Mail: [log in to unmask]
>> Tel : 0151 794 3396
Dr John Bland, Systems Administrator
Room 210, Oliver Lodge
Particle Physics Group, University of Liverpool
Mail: [log in to unmask]
Tel : 0151 794 3396