On May 15, 2007, at 10:11 AM, Antun Balaz wrote:
> Hi,
>
> Something like this:
>
> https://gus.fzk.de/pages/ticket_details.php?ticket=18353
>
> We are experiencing similar problem for months, but no luck...
>
This problem has been reported by at least a dozen sites and is
probably happening
else where. Alas no one has ever provided a suggestion. Various
people have
also asked on the torque users mailing list.
A recent changelog
http://www.clusterresources.com/pipermail/torqueusers/2007-March/
005282.html
included,
Requeued jobs will now get exec_host and session_id cleared. This has
long been a frustration for some 3rd party utils.
but sites that have tried new builds still see the problem.
Steve
> Regards, Antun
>
> -----
> Antun Balaz
> Research Assistant
> E-mail: [log in to unmask]
> Web: http://scl.phy.bg.ac.yu/
>
> Phone: +381 11 3713152
> Fax: +381 11 3162190
>
> Scientific Computing Laboratory
> Institute of Physics, Belgrade, Serbia
> -----
>
> ---------- Original Message -----------
> From: Piotr Siwczak <[log in to unmask]>
> To: [log in to unmask]
> Sent: Tue, 15 May 2007 09:27:29 +0200
> Subject: [LCG-ROLLOUT] Torque/maui problem - some jobs get blocked
> permanently
>
>> Hi,
>>
>> I've been experiencing a strange behaviour from torque/maui for some
>> time. Some jobs get blocked permanently (showq shows them in the
>> "blocked" group) and cannot be run.
>>
>> The checkjob command shows up and error like this:
>>
>> Messages: cannot start job - RM failure, rc: 15057, msg: 'Cannot
>> execute at specified host because of checkpoint or stagein files'
>>
>> This happens for random nodes and random jobs. pbs_server/mom/maui
>> restarts do not help. The nodes the jobs are intendet to run on are
>> reported as "healthy". They process other jobs successfully. I also
>> tried to release the hold and rerun the job on different host - no
>> luck either.
>>
>> Any hints about this problem are greatly appreciated.
>>
>> Thanks for your efforts,
>> -Piotr
> ------- End of Original Message -------
--
Steve Traylen
[log in to unmask]
CERN, IT-GD-OPS.
|