Hello Thank you for your solutions. I would like to add one too, it is a primitive one but helped me a lot. A script which runs from hour in hour and force jobs to be run on different worker nodes without any testing. Cheers Felix On 21.10.2011 13:03, Arnau Bria wrote: > On Fri, 21 Oct 2011 11:28:55 +0200 > Alessandra Doria wrote: > >> Hi Felix and all, >> the issue you describe happens quite often in our site, we are lookng >> for a fix. >> We have Torque2.3 , Maui3.2 (server SL4, moms SL5). >> Pbs_mom at one node goes in a bad state, but the node is not down. >> The scheduler continues to assign jobs to the bad node and they all >> go Waiting . > We have a local script that parses qstat -n -i -1 and sets offline any > node with a job in Q status, and any node with more than 5 jobs in W > status. Is the only way we found to deal with this issue. (we used to > parse maui logs, but they changed in version 3.1). > > >> Just this moning I tried to set the guilty node offline (pbsnodes -o >> wnXXX) and all the W jobs were reassigned. >> I'm not sure this works everytime: in case of some "persistent" W >> jobs, a massive qdel or the brute-force recipe by Maarten is better. >> Cheers >> Alessandra > Cheers, > Arnau > > __________ NOD32 6370 (20110811) Information __________ > > This message was checked by NOD32 antivirus system. > http://www.eset.com > > >