Hi,
Manchester is having jobs in W state too (we have had them for a while
actually). I don't think the problem is with the proxy. The proxy might
affect the job either before it arrives to the CE or when it is already
on the WN not in between. Yesterday we had 20 jobs queued on one node
all in W state and looking at the jobs with tracejob they were all in
this stopped at this state
06/24/2010 12:31:57 S post_modify_req: PBSE_UNKJOBID for job
2181832.ce01.tier2.hep.manchester.ac.uk in state
RUNNING-STAGEGO, dest = bohr3130.tier2.hep.manchester.ac.uk
while looking at pbs_mom with mom_ctl it hadn't received any message
from the pbs server for quite a while and there were no jobs running so
I restarted pbs_mom but this didn't have any effect. The node was put
offline eventually as it was sucking to many jobs and jobs started to
run regularly again on other nodes. I put the node back online today and
it is working fine.
I'm starting to think it is a pbs_server problem that gets into some
funny state. Two mails I found on pbs mailing lists with this problem
blame it on the stage in part.
cheers
alessandra
Stephen Jones wrote:
> Winnie,
>
> We had a problem with two different VOs. And now you add a third.
> In all, the X509_USER_PROXY variable, from qstat -f, points to thin air.
>
> In the job script, there is this:
> #PBS -W
> [log in to unmask]:/home/dzero017/.lcgjm/globus-cache-export.S20589/globus-cache-export.S20589.gpg
>
>
> The gpg file is a symbolic link, that points to the (missing) proxy
> file. I don't know whether the proxy file was ever put in place, or if
> it was put in place then removed. In any case, the job is broken.
>
> Steve
>
>
> Winnie Lacesso wrote:
>>> When I look at the jobs with qstat -f, I find that their
>>> X509_USER_PROXY variable points to a proxy file that does not exist.
>>
>> Exactly the case here. It looks like there was a flood of about 300
>> jobs in the wee hours of Monday to this CE & the ones that've been
>> queued since then are in this state.
>>
>> Thanks very much for your kind advice. Will cancel them & apologize
>> to submitter.
>
>
--
The most effective way to do it, is to do it. (Amelia Earhart)
Northgrid Tier2 Technical Coordinator
http://www.hep.manchester.ac.uk/computing/tier2
|