On Mon, 14 Feb 2005, Burke, S (Stephen) wrote:
> Testbed Support for GridPP member institutes
> > [mailto:[log in to unmask]] On Behalf Of Steve Traylen said:
> > I would submit it to the torque bugzilla, clearly a big hole it PBS.
>
> Didn't we already know that? We've had dangling processes in the past. I
> think PBS kills (or tries to kill) everything in the process group, but
> if the group changes it doesn't get caught ... it's not entirely obvious
> how you would do it cleanly given that one user might be running two or
> more jobs on the same node.
Yes, this is all standard "public" batch farm management stuff that any
competent system manager knows about: you turn off at and cron, you have
something to clean up after users (especially big temporary files) and you
have something in place to deal with processes that become detached from
jobs (a wall-clock time limit at the very least.)
The thing that's new is that automated installation of sites makes people
think they get all that for free: that's not the case, and you still need
an admin who knows what they're doing.
The pool accounts issue is largely irrelevant, since I'm not aware of LCG
advertising a "how to recycle pool accounts" recipe. (If there is one,
I'll have a look and give some feedback.)
Furthermore, the relevance of these issues to account recycling was raised
repeatedly by the developers (ie me) when the pool accounts system was
developed and added to EDG - in my first email about it, dammit! :)
Cheers,
Andrew
-------------------------------------------------------------------------
[log in to unmask] http://www.hep.man.ac.uk/u/mcnab/
+44-161-275-4227 "/C=UK/O=eScience/OU=Manchester/L=HEP/CN=Andrew McNab"
Grid Security Research Fellow, University of Manchester, UK
-------------------------------------------------------------------------
|