Hi Felix and all,
the issue you describe happens quite often in our site, we are lookng
for a fix.
We have Torque2.3 , Maui3.2 (server SL4, moms SL5).
Pbs_mom at one node goes in a bad state, but the node is not down.
The scheduler continues to assign jobs to the bad node and they all go
Waiting .
Just this moning I tried to set the guilty node offline (pbsnodes -o
wnXXX) and all the W jobs were reassigned.
I'm not sure this works everytime: in case of some "persistent" W jobs,
a massive qdel or the brute-force recipe by Maarten is better.
Cheers
Alessandra
Il 21/10/2011 11:07, Maarten Litmaath ha scritto:
> Hi Felix,
>
>> I have torque/maui.
>>
>> All jobs that are not online are in the W status. And all jobs are
>> trying to reach the next functional station.
>>
>> I have 66 WN's, 66,65,64,63 are full with jobs, 62 is not functioning
>> but is online. The job is stuck in the W status for nr 62. And stays
>> there. Isn't is possible for the job to jump tu the next station? Does
>> it has to stay in W status only because WN is started?
>>
>> Is there any form of jumping this station and moving to another one?
> It has often been reported that Torque/Maui can get stuck when a single WN
> gets into a bad state. A brute-force recipe to deal with that:
>
> 1. remove the WN from /var/spool/pbs/server_priv/nodes
> 2. remove the corresponding jobs from /var/spool/pbs/server_priv/jobs
> 3. restart the PBS/Torque daemons
>
|