> I found a few processes owned by LCG cms VO users on our
> (SGE 5.3) cluster that seem to have survived after the
> associated job was terminated for exceeding its time allocation.
> The program command line begins cmkin-bmm-mb.
> I've terminated the process no but has anyone else found proceses
> like this misbehaving.
>
> William Hay, UKI-LT2-UCL-CENTRAL site admin, Information systems, EISD,UCL
>
I've now tracked this down and (hopefully) fixed it. The problem
arose because the jobs were creating a new process group within
themselves. SGE normally kills a job by sending a SIGKILL to
the process group which it created for the job which obviously will
not reach any process created in the secondary group.
I have therefore replaced the default method with a script which
generates a list of process groups in the session created by
SGE for the job and then sends a SIGKILL to all of them.
There is a slight chance that a process group may be created
between the creation of the list and the sending of the SIGKILL
but I suspect this can be ignored in practice.
Anyway just in case anyone is having similar problems I thought I'd let you
know.
William Hay, UKI-LT2-UCL-CENTRAL site admin, Information systems, EISD,UCL
|