Hi,
yes, same here. We're starting Maui automatically if it becomes
irresponsible because of that - that helps at least for some minutes ...
Cheers,
Andreas
On Wed, 2011-02-16 at 08:22 +0100, Jean-Michel Barbet wrote:
> Hello all,
>
> It happen that some worker nodes become irresponsive because grid jobs
> have misbehaved and ate all the memory. The worker node might be
> pingable but connection to daemons and especially pbs_mom hangs or is
> impossible.
>
> Sometimes Torque/MAUI reacts very badly and no new jobs are scheduled
> to be running. Jobs accumulate in the waiting queue though there are
> actually free slots.
>
> It is not easy to correct this situation. If I can find the faulty
> node(s) and put them offline, Torque may recover eventually.
>
> I observed also a correlation between this state and messages in the
> Torque log :
> PBS_Server;Svr;PBS_Server;socket_to_handle, internal socket table full
>
> My main question here before starting to try to understand this is :
>
> => Are there other site administrators who observed similar
> behaviour ?
>
> We are running :
> Server : torque-2.3.6-2cri.el5 / maui-3.2.6p21-snap.1234905291.5.el5
> Mom : torque-mom-2.3.6-2cri.el5
>
> Thanks.
>
> JM
>
--
| Andreas Haupt | E-Mail: [log in to unmask]
| DESY Zeuthen | WWW: http://www-zeuthen.desy.de/~ahaupt
| Platanenallee 6 | Phone: +49/33762/7-7359
| D-15738 Zeuthen | Fax: +49/33762/7-7216
|