Print

Print


Hello

Thank you for your solutions. I would like to add one too, it is a 
primitive one but helped me a lot.

A script which runs from hour in hour and force jobs to be run on 
different worker nodes without any testing.

Cheers
Felix


On 21.10.2011 13:03, Arnau Bria wrote:
> On Fri, 21 Oct 2011 11:28:55 +0200
> Alessandra Doria wrote:
>
>> Hi Felix and all,
>> the issue you describe happens quite often in our site, we are lookng
>> for a fix.
>> We have Torque2.3 , Maui3.2 (server SL4, moms SL5).
>> Pbs_mom at one node goes in a bad state, but the node is not down.
>> The scheduler continues to assign jobs  to the bad node and they all
>> go Waiting .
> We have a local script that parses qstat -n -i -1 and sets offline any
> node with a job in Q status, and any node with more than 5 jobs in W
> status. Is the only way we found to deal with this issue. (we used to
> parse maui logs, but they changed in version 3.1).
>
>
>> Just this moning I tried to set the guilty node offline (pbsnodes -o
>> wnXXX) and all the W jobs were reassigned.
>> I'm not sure this works everytime: in case of some "persistent" W
>> jobs, a massive qdel or the brute-force recipe by Maarten is better.
>> Cheers
>> Alessandra
> Cheers,
> Arnau
>
> __________ NOD32 6370 (20110811) Information __________
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>