-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Roberto SANTINELLI wrote:
Hi Roberto
So what is the easiest fix just delete the jobs?
Mark.
> Hi Mark,
>
> these hang jobs are related to the daily test cron job that lhcb runs to
> check the sanity of his production sites.
>
> The globus-job-run tries to check several gridftp functionalities
> get, ls, on the CE ...
>
> We have found out that under some (unknown) circumstances the
> execution of these commands may hang forever on the tested machine. This
> has probably been the case on your CE.
>
> Sorry for the inconveniences.
>
> R.
>
> On Mon, 5 Sep 2005, Mark Nelson wrote:
>
>
> Hello
>
> I have a number of lhcb jobs stuck in wait state, these jobs are trying
> to run on several worker nodes. We have a shared file system and each
> machine is able to mount the directories. I am getting the following
> error via e-mail and have been since 09:50 yesterday. I also have a
> number of globus-job-manager processes running on the CE (see below). I
> have restarted pbs, maui and globus on the ce and I can ssh to the CE
> from a worker node as lhcb001
>
> PBS Job Id: 24610.helmsley.dur.scotgrid.ac.uk
> Job Name: STDIN
> File stage in failed, see below.
> Job will be retried later, please investigate and correct problem.
> Post job file processing error; job 24610.helmsley.dur.scotgrid.ac.uk on
> host wn07.dur.scotgrid.ac.uk/1
>
> Unable to copy file 24610.helms.OU to
> helmsley.dur.scotgrid.ac.uk:/mt/home/lhcb001/.lcgjm/globus-cache-export.rhnqO5/batch.out
>
>
>>>>>>>error from copy
>
> helmsley.dur.scotgrid.ac.uk: Connection refused
> .rhnqO5/batch.out: No such file or directory
>
>
>>>>>>>end error output
>
> Output retained on that host in: /var/spool/pbs/undelivered/24610.helms.OU
>
> Unable to copy file 24610.helms.ER to
> helmsley.dur.scotgrid.ac.uk:/mt/home/lhcb001/.lcgjm/globus-cache-export.rhnqO5/batch.err
>
>
>>>>>>>error from copy
>
> helmsley.dur.scotgrid.ac.uk: Connection refused
> .rhnqO5/batch.err: No such file or directory
>
>
>>>>>>>end error output
>
> Output retained on that host in: /var/spool/pbs/undelivered/24610.helms.ER
>
> -
> ----------------------------------------------------------------------------------------
>
> lhcb001 12712 0.0 0.1 5520 3556 ? S Sep04 0:01
> globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> lhcb001 20695 0.0 0.1 5516 3552 ? S Sep04 0:01
> globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> lhcb001 28818 0.0 0.1 5516 3548 ? S Sep04 0:00
> globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> lhcb001 29193 0.0 0.1 5520 3556 ? S Sep04 0:00
> globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> lhcb001 29198 0.0 0.1 5516 3548 ? S Sep04 0:00
> globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> lhcb001 12973 0.0 0.1 5516 3544 ? S 10:44 0:00
> globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> lhcb001 13002 0.0 0.1 4348 2796 ? S 10:44 0:00 perl
> /mt/home/lhcb001/.globus/.gass_cache/local/md5/4a/7df403a165c3ad81cfa6f459c5ae23/md5/82/0b2913e8e51abb8bea5d721ae8c439/data
> -
> --dest-url=https://lxn1177.cern.ch:20106/tmp/condor_g_scratch.0xab6e928.1638/helmsley.dur.scotgrid.ac.uk:2119.0x959bbb8/grid-mon
> lhcb001 13245 0.0 0.3 8052 6516 ? S 10:45 0:00 perl
> /tmp/grid_manager_monitor_agent.lhcb001.13002.1000 --delete-self
> --maxtime=3540s
> lhcb001 13397 0.0 0.1 5520 3548 ? S 10:45 0:00
> globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> lhcb001 13428 0.0 0.1 4320 2772 ? S 10:45 0:00 perl
> /mt/home/lhcb001/.globus/.gass_cache/local/md5/27/01d6922efb71f243a80db333e4afe4/md5/9f/3ca83510506bf23ce4307675b8c52c/data
> -
> --dest-url=https://gdrb07.cern.ch:20001/tmp/condor_g_scratch.0x86bd2e8.31353/helmsley.dur.scotgrid.ac.uk:2119.0x9daf520/grid-mon
> lhcb001 31300 0.0 0.3 8052 6528 ? S 11:38 0:00 perl
> /tmp/grid_manager_monitor_agent.lhcb001.13002.1000 --delete-self
> --maxtime=3540s
> root 31724 0.0 0.0 4760 672 pts/2 S 11:40 0:00 grep lhcb
> [root@helmsley root]# ps axuw |grep
>
> --
> -------------------------------------------------------------
> Mark Nelson - [log in to unmask]
>
> IPPP, Department of Physics, University of Durham,
> Science Laboratories, South Road, Durham, DH1 3LE
> Office: +44 (0)191 334 3811, Direct Dial: +44 (0)191 334 3653
>
> PGP Key: http://www.ippp.dur.ac.uk/~mn/pgp_key.txt
> This mail is for the addressee only
- --
- -------------------------------------------------------------
Mark Nelson - [log in to unmask]
IPPP, Department of Physics, University of Durham,
Science Laboratories, South Road, Durham, DH1 3LE
Office: +44 (0)191 334 3811, Direct Dial: +44 (0)191 334 3653
PGP Key: http://www.ippp.dur.ac.uk/~mn/pgp_key.txt
This mail is for the addressee only
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Red Hat - http://enigmail.mozdev.org
iD8DBQFDHD5OlzM++u0MgcERAl5TAJ9z6MoIdpEZbhGgZB5L7A0psJ9VSACdFEhd
ISCjZTg7W3SYKYHEu0pY6VM=
=3Z00
-----END PGP SIGNATURE-----
|