Print

Print


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Roberto SANTINELLI wrote:
Hi Roberto

So what is the easiest fix just delete the jobs?

Mark.
> Hi Mark,
> 
> these hang jobs are related to the daily test cron job that lhcb runs to 
> check the sanity of his production sites.
> 
> The globus-job-run tries to check several gridftp functionalities 
> get, ls, on the CE ...
>                                                                                                               
>        We have found out that under some (unknown) circumstances the
> execution of these commands may hang forever on the tested machine. This
> has probably been the case on your CE.
>                                                                                                               
>        Sorry for the inconveniences.
> 
> R.
> 
> On Mon, 5 Sep 2005, Mark Nelson wrote:
> 
> 
> Hello
> 
> I have a number of lhcb jobs stuck in wait state, these jobs are trying
> to run on several worker nodes.  We have a shared file system and each
> machine is able to mount the directories.  I am getting the following
> error via e-mail and have been since 09:50 yesterday. I also have a
> number of globus-job-manager processes running on the CE (see below).  I
> have restarted pbs, maui and globus on the ce and I can ssh to the CE
> from a worker node as lhcb001
> 
> PBS Job Id: 24610.helmsley.dur.scotgrid.ac.uk
> Job Name:   STDIN
> File stage in failed, see below.
> Job will be retried later, please investigate and correct problem.
> Post job file processing error; job 24610.helmsley.dur.scotgrid.ac.uk on
> host wn07.dur.scotgrid.ac.uk/1
> 
> Unable to copy file 24610.helms.OU to
> helmsley.dur.scotgrid.ac.uk:/mt/home/lhcb001/.lcgjm/globus-cache-export.rhnqO5/batch.out
> 
> 
>>>>>>>error from copy
> 
> helmsley.dur.scotgrid.ac.uk: Connection refused
> .rhnqO5/batch.out: No such file or directory
> 
> 
>>>>>>>end error output
> 
> Output retained on that host in: /var/spool/pbs/undelivered/24610.helms.OU
> 
> Unable to copy file 24610.helms.ER to
> helmsley.dur.scotgrid.ac.uk:/mt/home/lhcb001/.lcgjm/globus-cache-export.rhnqO5/batch.err
> 
> 
>>>>>>>error from copy
> 
> helmsley.dur.scotgrid.ac.uk: Connection refused
> .rhnqO5/batch.err: No such file or directory
> 
> 
>>>>>>>end error output
> 
> Output retained on that host in: /var/spool/pbs/undelivered/24610.helms.ER
> 
> -
> ----------------------------------------------------------------------------------------
> 
> lhcb001  12712  0.0  0.1  5520 3556 ?        S    Sep04   0:01
> globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> lhcb001  20695  0.0  0.1  5516 3552 ?        S    Sep04   0:01
> globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> lhcb001  28818  0.0  0.1  5516 3548 ?        S    Sep04   0:00
> globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> lhcb001  29193  0.0  0.1  5520 3556 ?        S    Sep04   0:00
> globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> lhcb001  29198  0.0  0.1  5516 3548 ?        S    Sep04   0:00
> globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> lhcb001  12973  0.0  0.1  5516 3544 ?        S    10:44   0:00
> globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> lhcb001  13002  0.0  0.1  4348 2796 ?        S    10:44   0:00 perl
> /mt/home/lhcb001/.globus/.gass_cache/local/md5/4a/7df403a165c3ad81cfa6f459c5ae23/md5/82/0b2913e8e51abb8bea5d721ae8c439/data
> -
> --dest-url=https://lxn1177.cern.ch:20106/tmp/condor_g_scratch.0xab6e928.1638/helmsley.dur.scotgrid.ac.uk:2119.0x959bbb8/grid-mon
> lhcb001  13245  0.0  0.3  8052 6516 ?        S    10:45   0:00 perl
> /tmp/grid_manager_monitor_agent.lhcb001.13002.1000 --delete-self
> --maxtime=3540s
> lhcb001  13397  0.0  0.1  5520 3548 ?        S    10:45   0:00
> globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> lhcb001  13428  0.0  0.1  4320 2772 ?        S    10:45   0:00 perl
> /mt/home/lhcb001/.globus/.gass_cache/local/md5/27/01d6922efb71f243a80db333e4afe4/md5/9f/3ca83510506bf23ce4307675b8c52c/data
> -
> --dest-url=https://gdrb07.cern.ch:20001/tmp/condor_g_scratch.0x86bd2e8.31353/helmsley.dur.scotgrid.ac.uk:2119.0x9daf520/grid-mon
> lhcb001  31300  0.0  0.3  8052 6528 ?        S    11:38   0:00 perl
> /tmp/grid_manager_monitor_agent.lhcb001.13002.1000 --delete-self
> --maxtime=3540s
> root     31724  0.0  0.0  4760  672 pts/2    S    11:40   0:00 grep lhcb
> [root@helmsley root]# ps axuw |grep
> 
> --
> -------------------------------------------------------------
> Mark Nelson - [log in to unmask]
> 
> IPPP, Department of Physics, University of Durham,
> Science Laboratories, South Road, Durham, DH1 3LE
> Office: +44 (0)191 334 3811, Direct Dial: +44 (0)191 334 3653
> 
> PGP Key: http://www.ippp.dur.ac.uk/~mn/pgp_key.txt
> This mail is for the addressee only

- --
- -------------------------------------------------------------
Mark Nelson - [log in to unmask]

IPPP, Department of Physics, University of Durham,
Science Laboratories, South Road, Durham, DH1 3LE
Office: +44 (0)191 334 3811, Direct Dial: +44 (0)191 334 3653

PGP Key: http://www.ippp.dur.ac.uk/~mn/pgp_key.txt
This mail is for the addressee only
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Red Hat - http://enigmail.mozdev.org

iD8DBQFDHD5OlzM++u0MgcERAl5TAJ9z6MoIdpEZbhGgZB5L7A0psJ9VSACdFEhd
ISCjZTg7W3SYKYHEu0pY6VM=
=3Z00
-----END PGP SIGNATURE-----