On Thursday 17 March 2011 15:34:51 Ewan MacMahon wrote:
> > -----Original Message-----
> > From: Testbed Support for GridPP member institutes [mailto:TB-
> > [log in to unmask]] On Behalf Of Simon Fayer
> > Sent: 17 March 2011 15:16
> > To: [log in to unmask]
> > Subject: glexec batch system interoperability
> >
> > Hi everyone,
> >
> > While doing some tests with the glexec "suexec" test program (
> > http://www.nikhef.nl/grid/lcaslcmaps/glexec/osinterop ) I've
> > noticed that it provokes some strange behaviour with SGE...
> > Normally after a job terminates, all child processes are also
> > killed (no matter how much a user tries to disown them). When
> > using suexec, SGE seems to fail to kill the child process,
> > leaving the process running on the node indefinately.
>
> I have no direct SGE experience at all, however, according to:
> http://www.sysadmin.hep.ac.uk/wiki/ProcessesOnBatchNodes
> if you're using the ENABLE_ADDGRP_KILL parameter it adds a per-job
> supplementary group ID to keep track of even daemon child
> processes, so anything that doesn't preserve those (like
> glexec-ing) would defeat it. You could probably get the same
> effect without glexec by having your test script explicitly drop
> and supplementary group IDs before forking.
Yes, this does seem to be what's happening. It appears that standard
users can't normally remove supplemental groups (according to
setgroups(2)), so this is a new problem created by glexec/suexec. Does
this create similar problems for other batch systems as well or do
they have some clever way to work around this?
Regards,
Simon
|