Hi Steve,
I quite often spot batches of jobs that have gone into the W state and
been hanging around for days. I haven't checked in enough detail to know
whether the problem is proxy related, but next time I'll look at this.
The only solution I know of is to qdel them, but I look forward to
hearing from anyone with more understanding.
Cheers,
Ben
On 08/07/11 11:15, Stephen Jones wrote:
> Hi all,
>
> I've got a few grid jobs on our TORQUE cluster in W state, which means
> that torque has failed to start them, and has put them aside for a while
> before trying again.
>
> This happens now and again. When I have a good look at the jobs, I find
> that the file that holds their proxy is missing/no longer there. I also
> find that the jobs are quite old (some number of days).
>
> I assume the jobs arrived with a proxy that was either (a) already stale
> or (b) almost stale. Some part of CE flushes proxy files once they get
> stale, so the job is then missing its proxy file. When the scheduler
> runs the job, stagein fails because the proxy is gone. TORQUE puts the
> job in W, and it goes round in the queue forever, 'til I qdel them. It
> seems to me that, if some job has a finite proxy, and if "travel time"
> from submission to execution host is indeterminately long, then some
> jobs may arrive with stale proxies, and this error will occur. Has
> anyone else seen this phenomenon, and how should it be handled? Cheers,
>
> Steve
>
>
--
Dr Ben Waugh Tel. +44 (0)20 7679 7223
Dept of Physics and Astronomy Internal: 37223
University College London
London WC1E 6BT
|