On Wed, Mar 3, 2010 at 9:31 AM, Matt Hodges <[log in to unmask]> wrote:
>>>>>> Chris Brew writes:
>
> > I've not worked out the precise cause but running yaim on a worker
> > node does a stop and start of the pbs_mom service which seems to
> > delete jobs on occasion.
>
> > If anyone has any idea what's going on and how to stop it I'm all
> > ears.
>
> If you want to avoid the pbs_mom restart from killing jobs, it should be
> sufficient to export `previous' to the environment when running yaim.
> This will result in pbs_mom restarting with the `-p' option:
>
> -p Specifies the impact on jobs which were in execution when the
> mini-server shut down. On any restart of MOM, the new
> mini-server will not be the parent of any running jobs, MOM
> has lost control of her offspring (not a new situation for a
> mother). With the -p option, Mom will allow the jobs to
> continue to run and monitor them indirectly via polling. The
> -p option is mutually exclusive with the -r option.
>
> At least, that's what we use at the RAL Tier-1 to avoid pbs_mom restarts
> triggered by Quattor from deleting jobs.
>
Also note in the very latest version there was a slight change in the
logic here.
http://www.clusterresources.com/torquedocs21/changelog.shtml#2310
b - Fixed pbs_mom's default restart behavior. On a restart the MOM is
suppose to terminate jobs that were in a running state while the MOM
was up and report them to the batch server where the job will be reset
to a queued state. But it should not try and kill any of the running
processes that were associated with the job. Prior to this fix the MOM
would try and kill running processes associated with any running jobs.
I presume this only applies to jobs marked as re-runnable.
Steve
> Matt
>
--
Steve Traylen
|