On 14 Jan 2011, at 12:00, Stephen Jones wrote:
> Hi all. This applies to only 1 VO.
>
> At the moment, I've got a small number of NGS test jobs in wait state on our batch system. They've been there for a few days. Qstat shows this wait state: 599922.hammer STDIN ngs003 0 W long
>
> and diagnose says this: WARNING: job '599922' has failed to start 179 times
>
> and tracejob says this: 01/14/2011 00:27:39 S post_modify_req: PBSE_UNKJOBID for job 599922.hammer.ph.liv.ac.uk in state RUNNING-STAGEGO, dest = r16-n02.ph.liv.ac.uk
>
> And the pbs_mom logs say things like this: 20110114:01/14/2011 11:50:36;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=r16-n02.ph.liv.ac.uk MSG=modify job failed, unknown job 619859.hammer.ph.liv.ac.uk), aux=0, type=ModifyJob, from [log in to unmask]
>
> Anyway, the upshot of all this is: the stagein file (as seen in qstat -f) is not present on the lcg-CE. In other words; a job arrives from some WMS, specifying certain stageins. They are not actually planted down (or they are removed etc.). The jobs gets to the PBS system, and ends up on a node. The jobs asks for the stagein, it fails to get it, and the jobs goes into W for another long time then tries again, over and over. Does anyone know what this is all about? How can a stagein file be “missing”? Isn't that checked when the job arrives?
I've seen this in a couple of cases; here's what I'd be checking:
Check passwordless SSH works for those users (starting from a worker node). (Maybe they got missed off the passwd file on one node?)[This is most likely, from what you've described].
If /dev/null is incorrect, that can cause it. (Doesn't explain the one VO thing, unless you have the rest of the jobs coming in differently somehow)
Unless I miss my guess, that's likely to be the INCA tests for the NGS, and I think they coming via globus GRAM submission, rather than via WMS. So it might be that whole submission route is iffy, somehow, rather than anything else. Maybe try a globus-job-run for yourself, and see how if that works for a different user.
Run a test job as an NGS user (might have to apply for that), and monitor it's progress. One thing that can happen is the job queues for so long the CE assumes it's dead, and deletes things. Another is that the job starts; ce cleans up stage in, and then job fails and Torque re-submits. It might be some weirdness with either too many or (alternatively) no stage in files. Things to check are that the stage in files turn up before the job starts.
As for those particular jobs, qdel them, they're dead, without hope of ressurection.
It's not terribly direct I'm afraid, but hopefully something would fall out of those analyses, to point you in the right direction.
|