On Mon, 5 Sep 2005, Mark Nelson wrote:
> I have a number of lhcb jobs stuck in wait state, these jobs are trying
> to run on several worker nodes. We have a shared file system and each
> machine is able to mount the directories. I am getting the following
> error via e-mail and have been since 09:50 yesterday. I also have a
> number of globus-job-manager processes running on the CE (see below). I
> have restarted pbs, maui and globus on the ce and I can ssh to the CE
> from a worker node as lhcb001
>
> PBS Job Id: 24610.helmsley.dur.scotgrid.ac.uk
> Job Name: STDIN
> File stage in failed, see below.
> Job will be retried later, please investigate and correct problem.
> Post job file processing error; job 24610.helmsley.dur.scotgrid.ac.uk on
> host wn07.dur.scotgrid.ac.uk/1
>
> Unable to copy file 24610.helms.OU to
> helmsley.dur.scotgrid.ac.uk:/mt/home/lhcb001/.lcgjm/globus-cache-export.rhnqO5/batch.out
>
> >>>>>> error from copy
>
> helmsley.dur.scotgrid.ac.uk: Connection refused
That error would have been reported by an "scp" from the WN to the CE:
Were there any complaints from/about the sshd on the CE around that time?
Were there a large number of ssh connections to the CE?
Were there any changes in the ssh firewall rules on the CE?
|