On 8 Jul 2011, at 11:15, Stephen Jones wrote:
> Hi all,
>
> I've got a few grid jobs on our TORQUE cluster in W state, which means that torque has failed to start them, and has put them aside for a while before trying again.
>
> This happens now and again. When I have a good look at the jobs, I find that the file that holds their proxy is missing/no longer there. I also find that the jobs are quite old (some number of days).
>
> I assume the jobs arrived with a proxy that was either (a) already stale or (b) almost stale. Some part of CE flushes proxy files once they get stale, so the job is then missing its proxy file. When the scheduler runs the job, stagein fails because the proxy is gone. TORQUE puts the job in W, and it goes round in the queue forever, 'til I qdel them. It seems to me that, if some job has a finite proxy, and if "travel time" from submission to execution host is indeterminately long, then some jobs may arrive with stale proxies, and this error will occur. Has anyone else seen this phenomenon, and how should it be handled? Cheers,
There are other causes for this, but stale proxies are the unavoidable one.
Fundementally, once a job has 'failed' by Torque once, then blah reports this failure, and CREAM deletes the job staging area.
If torque has rerunable=true, then the job still exists in Torque.
The MOM's on the worker nodes then can't stage in required files, and thus the continual waiting state.
There are other reasons that a job may enter waiting state - the fail to stage in data due to overload at a CE is a classic here.
After a job is older than the total length of the queue [0] , and waiting, I delete them from Torque. Generally we get a couple a month, except when there's some other problem, which is when we can get thousands (blah submission errors to torque are a favourite for that). It's normally a couple of days after we've had such an incident that I do a clean.
If you want a precise test, if the job stageing directory on the CE is gone, then it is safe to delete - CREAM already believes the job is dead. (I suppose I should work out how to dig out the blah state of the job more directly at some point).
[0] Or: when is a queue length older than most of them, in cases where there's a long latency due to being full
|