On 17/03/11 19:07, Simon Fayer wrote:
> On Thursday 17 March 2011 15:34:51 Ewan MacMahon wrote:
>>> -----Original Message-----
>>> From: Testbed Support for GridPP member institutes [mailto:TB-
>>> [log in to unmask]] On Behalf Of Simon Fayer
>>> Sent: 17 March 2011 15:16
>>> To: [log in to unmask]
>>> Subject: glexec batch system interoperability
>>>
>>> Hi everyone,
>>>
>>> While doing some tests with the glexec "suexec" test program (
>>> http://www.nikhef.nl/grid/lcaslcmaps/glexec/osinterop ) I've
>>> noticed that it provokes some strange behaviour with SGE...
>>> Normally after a job terminates, all child processes are also
>>> killed (no matter how much a user tries to disown them). When
>>> using suexec, SGE seems to fail to kill the child process,
>>> leaving the process running on the node indefinately.
I do occasionally see some processes running on our nodes even after
jobs have finished. This is clearly a waste of resources (at best).
>>
>> I have no direct SGE experience at all, however, according to:
>> http://www.sysadmin.hep.ac.uk/wiki/ProcessesOnBatchNodes
>> if you're using the ENABLE_ADDGRP_KILL parameter it adds a per-job
>> supplementary group ID to keep track of even daemon child
>> processes, so anything that doesn't preserve those (like
>> glexec-ing) would defeat it. You could probably get the same
>> effect without glexec by having your test script explicitly drop
>> and supplementary group IDs before forking.
>
> Yes, this does seem to be what's happening. It appears that standard
> users can't normally remove supplemental groups (according to
> setgroups(2)), so this is a new problem created by glexec/suexec. Does
> this create similar problems for other batch systems as well or do
> they have some clever way to work around this?
I'd be grateful if you would file a ggus ticket for this. It will be a
problem for QMUL too.
Chris
|