Hi all,
I've got a few grid jobs on our TORQUE cluster in W state, which means
that torque has failed to start them, and has put them aside for a while
before trying again.
This happens now and again. When I have a good look at the jobs, I find
that the file that holds their proxy is missing/no longer there. I also
find that the jobs are quite old (some number of days).
I assume the jobs arrived with a proxy that was either (a) already stale
or (b) almost stale. Some part of CE flushes proxy files once they get
stale, so the job is then missing its proxy file. When the scheduler
runs the job, stagein fails because the proxy is gone. TORQUE puts the
job in W, and it goes round in the queue forever, 'til I qdel them. It
seems to me that, if some job has a finite proxy, and if "travel time"
from submission to execution host is indeterminately long, then some
jobs may arrive with stale proxies, and this error will occur. Has
anyone else seen this phenomenon, and how should it be handled? Cheers,
Steve
--
Steve Jones [log in to unmask]
System Administrator office: 220
High Energy Physics Division tel (int): 42334
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
University of Liverpool http://www.liv.ac.uk/physics/hep/
|