Hello all,
It happen that some worker nodes become irresponsive because grid jobs
have misbehaved and ate all the memory. The worker node might be
pingable but connection to daemons and especially pbs_mom hangs or is
impossible.
Sometimes Torque/MAUI reacts very badly and no new jobs are scheduled
to be running. Jobs accumulate in the waiting queue though there are
actually free slots.
It is not easy to correct this situation. If I can find the faulty
node(s) and put them offline, Torque may recover eventually.
I observed also a correlation between this state and messages in the
Torque log :
PBS_Server;Svr;PBS_Server;socket_to_handle, internal socket table full
My main question here before starting to try to understand this is :
=> Are there other site administrators who observed similar
behaviour ?
We are running :
Server : torque-2.3.6-2cri.el5 / maui-3.2.6p21-snap.1234905291.5.el5
Mom : torque-mom-2.3.6-2cri.el5
Thanks.
JM
--
------------------------------------------------------------------------
Jean-michel BARBET | Tel: +33 (0)2 51 85 84 86
Laboratoire SUBATECH Nantes France | Fax: +33 (0)2 51 85 84 79
CNRS-IN2P3/Ecole des Mines/Universite | E-Mail: [log in to unmask]
------------------------------------------------------------------------
|