Print

Print


Chiming in here, as we experience from time to time exactly the same  
problem with some Atlas jobs. On occasions, one of the blocked jobs  
is sufficient to bring the complete scheduling process to a halt,  
although this does not seem to happen on a regular basis.

As Piotr, we could do with some help to try and diagnose this  
problem. Currently the only viable solution seems to be to qdel the  
offending job(s). Running torque 2.1.5-1 and maui 3.2.6p16-2.

cheers,
gianfranco

On 15 May 2007, at 08:27, Piotr Siwczak wrote:

> Hi,
>
> I've been experiencing a strange behaviour from torque/maui for  
> some time. Some jobs get blocked permanently (showq shows them in  
> the "blocked" group) and cannot be run.
>
> The checkjob command shows up and error like this:
>
> Messages:  cannot start job - RM failure, rc: 15057, msg: 'Cannot  
> execute at specified host because of checkpoint or stagein files'
>
> This happens for random nodes and random jobs. pbs_server/mom/maui  
> restarts do not help. The nodes the jobs are intendet to run on are  
> reported as "healthy". They process other jobs successfully. I also  
> tried to release the hold and rerun the job on different host - no  
> luck either.
>
> Any hints about this problem are greatly appreciated.
>
> Thanks for your efforts,
> -Piotr