Hi Steve,
the cluster resources page is not accessible. I'll try later, in the
mean time could you pin point the version this change happened?
thanks
cheers
alessandra
Steve Traylen wrote:
> On Wed, Mar 3, 2010 at 9:31 AM, Matt Hodges <[log in to unmask]> wrote:
>
>>>>>>> Chris Brew writes:
>>>>>>>
>> > I've not worked out the precise cause but running yaim on a worker
>> > node does a stop and start of the pbs_mom service which seems to
>> > delete jobs on occasion.
>>
>> > If anyone has any idea what's going on and how to stop it I'm all
>> > ears.
>>
>> If you want to avoid the pbs_mom restart from killing jobs, it should be
>> sufficient to export `previous' to the environment when running yaim.
>> This will result in pbs_mom restarting with the `-p' option:
>>
>> -p Specifies the impact on jobs which were in execution when the
>> mini-server shut down. On any restart of MOM, the new
>> mini-server will not be the parent of any running jobs, MOM
>> has lost control of her offspring (not a new situation for a
>> mother). With the -p option, Mom will allow the jobs to
>> continue to run and monitor them indirectly via polling. The
>> -p option is mutually exclusive with the -r option.
>>
>> At least, that's what we use at the RAL Tier-1 to avoid pbs_mom restarts
>> triggered by Quattor from deleting jobs.
>>
>>
> Also note in the very latest version there was a slight change in the
> logic here.
> http://www.clusterresources.com/torquedocs21/changelog.shtml#2310
>
> b - Fixed pbs_mom's default restart behavior. On a restart the MOM is
> suppose to terminate jobs that were in a running state while the MOM
> was up and report them to the batch server where the job will be reset
> to a queued state. But it should not try and kill any of the running
> processes that were associated with the job. Prior to this fix the MOM
> would try and kill running processes associated with any running jobs.
>
> I presume this only applies to jobs marked as re-runnable.
>
> Steve
>
>
>
>
>> Matt
>>
>>
>
>
>
>
--
The most effective way to do it, is to do it. (Amelia Earhart)
Northgrid Tier2 Technical Coordinator
http://www.hep.manchester.ac.uk/computing/tier2
|