Chris Brew wrote
> I'm pretty wary of running yaim across all the worker nodes now after a
> couple of instances of it causing torque to kill all the jobs.
>
> I've not worked out the precise cause but running yaim on a worker node does
> a stop and start of the pbs_mom service which seems to delete jobs on
> occasion.
>
> If anyone has any idea what's going on and how to stop it I'm all ears.
Saw this yesterday. Was concerned about this at gLite 3.2 updates page:
07.01.2010 - 3.2 Update 07
"This update introduces the complete Torque (version 2.3.6) and Maui
(version 3.2.6p21) for gLite 3.2 and updates the also the Torque clients.
Note that Torque server and client versions have to be the same for a
proper setup. Keep this in mind for the case of mixed SL4/SL5 Torque
installations."
One SL4 lcg-CE (behind a few updates) has
torque-2.3.0-snap.200801151629.2cri.slc4.i386.rpm
Most of its SL5 WN (behind at least 1 update) have
torque-2.3.0-snap.200801151629.2cri.sl5.x86_64.rpm
Yesterday in re-integrating a WN into this CE's cluster, decided to
first upgrade it to latest everything & rerun yaim (I want to believe it
should be done as often as needed & should be safe - hollow laugh), so
that WN has torque-2.3.6-2cri.el5.x86_64.rpm
And yes it was VERY unstable for a while - pbs_mom on most WN died several
times & pbs_server on lcg-CE died MANY times & jobs ended up showing in qstat
but not existing on the WN.
Seems stable now & on to-do (alas, behind other things) is upgrade that
lcg-CE & its other SL5 WN to latest.... =:o
Another SL4 lcg-CE (running Dr Metson's Hadoop WN), built later, has 2.3.6 &
all its WN (built later) have 2.3.6, & it doesn't exhibit any instability.
Yet.
|