On Thu, Jan 20, 2005 at 03:15:06pm +0100, Fokke Dijkstra wrote:
>
> Another problem running jobs using mpirun is that pbs_mom does not really keep track of what a job is doing on the other worker nodes. When a job crashes or is killed from outside this may then only kill the processes on the master woker node leaving garbage processes on the other worker nodes involved.
>
This is true, but I do not believe that mpirun is the culprit. In fact,
MPICH-GM (the port of MPICH to Myricom's GM message-passing system) blames
ssh for not reaping remote processes when the (local to mpirun) ssh client
dies. It also includes specific code which keeps track of remote process IDs
and runs 'kill' commands remotely in order to cleanup the mess after
abnormal termination of processes. Contrary to ssh, rsh takes care to
propagate signals received by the local client to the remote side, so
that no stray processes are left if the local rsh client receives
a SIGINT or a SIGTERM signal. (Well, SIGSTOP or SIGKILL are not handled,
but that's really difficult since they cannot be caught! :)
--
Vangelis Koukis
[log in to unmask]
OpenPGP public key ID:
pub 1024D/1D038E97 2003-07-13 Vangelis Koukis <[log in to unmask]>
Key fingerprint = C5CD E02E 2C78 7C10 8A00 53D8 FBFC 3799 1D03 8E97
|