Hi Mark,
I don't think that you'll cause really important problems by killing
these test jobs. Anyway I would rather let Ricardo reply for me.
Nevertheless it might be worth try to understand why these jobs get stuck
and just keepping these jobs running would help you in understanding...
R.
On Mon, 5 Sep 2005, Mark Nelson wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Roberto SANTINELLI wrote:
> Hi Roberto
>
> So what is the easiest fix just delete the jobs?
>
> Mark.
> > Hi Mark,
> >
> > these hang jobs are related to the daily test cron job that lhcb runs to
> > check the sanity of his production sites.
> >
> > The globus-job-run tries to check several gridftp functionalities
> > get, ls, on the CE ...
> >
> > We have found out that under some (unknown) circumstances the
> > execution of these commands may hang forever on the tested machine. This
> > has probably been the case on your CE.
> >
> > Sorry for the inconveniences.
> >
> > R.
> >
> > On Mon, 5 Sep 2005, Mark Nelson wrote:
> >
> >
> > Hello
> >
> > I have a number of lhcb jobs stuck in wait state, these jobs are trying
> > to run on several worker nodes. We have a shared file system and each
> > machine is able to mount the directories. I am getting the following
> > error via e-mail and have been since 09:50 yesterday. I also have a
> > number of globus-job-manager processes running on the CE (see below). I
> > have restarted pbs, maui and globus on the ce and I can ssh to the CE
> > from a worker node as lhcb001
> >
> > PBS Job Id: 24610.helmsley.dur.scotgrid.ac.uk
> > Job Name: STDIN
> > File stage in failed, see below.
> > Job will be retried later, please investigate and correct problem.
> > Post job file processing error; job 24610.helmsley.dur.scotgrid.ac.uk on
> > host wn07.dur.scotgrid.ac.uk/1
> >
> > Unable to copy file 24610.helms.OU to
> > helmsley.dur.scotgrid.ac.uk:/mt/home/lhcb001/.lcgjm/globus-cache-export.rhnqO5/batch.out
> >
> >
> >>>>>>>error from copy
> >
> > helmsley.dur.scotgrid.ac.uk: Connection refused
> > .rhnqO5/batch.out: No such file or directory
> >
> >
> >>>>>>>end error output
> >
> > Output retained on that host in: /var/spool/pbs/undelivered/24610.helms.OU
> >
> > Unable to copy file 24610.helms.ER to
> > helmsley.dur.scotgrid.ac.uk:/mt/home/lhcb001/.lcgjm/globus-cache-export.rhnqO5/batch.err
> >
> >
> >>>>>>>error from copy
> >
> > helmsley.dur.scotgrid.ac.uk: Connection refused
> > .rhnqO5/batch.err: No such file or directory
> >
> >
> >>>>>>>end error output
> >
> > Output retained on that host in: /var/spool/pbs/undelivered/24610.helms.ER
> >
> > -
> > ----------------------------------------------------------------------------------------
> >
> > lhcb001 12712 0.0 0.1 5520 3556 ? S Sep04 0:01
> > globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> > fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > lhcb001 20695 0.0 0.1 5516 3552 ? S Sep04 0:01
> > globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> > fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > lhcb001 28818 0.0 0.1 5516 3548 ? S Sep04 0:00
> > globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> > fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > lhcb001 29193 0.0 0.1 5520 3556 ? S Sep04 0:00
> > globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> > fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > lhcb001 29198 0.0 0.1 5516 3548 ? S Sep04 0:00
> > globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> > fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > lhcb001 12973 0.0 0.1 5516 3544 ? S 10:44 0:00
> > globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> > fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > lhcb001 13002 0.0 0.1 4348 2796 ? S 10:44 0:00 perl
> > /mt/home/lhcb001/.globus/.gass_cache/local/md5/4a/7df403a165c3ad81cfa6f459c5ae23/md5/82/0b2913e8e51abb8bea5d721ae8c439/data
> > -
> > --dest-url=https://lxn1177.cern.ch:20106/tmp/condor_g_scratch.0xab6e928.1638/helmsley.dur.scotgrid.ac.uk:2119.0x959bbb8/grid-mon
> > lhcb001 13245 0.0 0.3 8052 6516 ? S 10:45 0:00 perl
> > /tmp/grid_manager_monitor_agent.lhcb001.13002.1000 --delete-self
> > --maxtime=3540s
> > lhcb001 13397 0.0 0.1 5520 3548 ? S 10:45 0:00
> > globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
> > fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > lhcb001 13428 0.0 0.1 4320 2772 ? S 10:45 0:00 perl
> > /mt/home/lhcb001/.globus/.gass_cache/local/md5/27/01d6922efb71f243a80db333e4afe4/md5/9f/3ca83510506bf23ce4307675b8c52c/data
> > -
> > --dest-url=https://gdrb07.cern.ch:20001/tmp/condor_g_scratch.0x86bd2e8.31353/helmsley.dur.scotgrid.ac.uk:2119.0x9daf520/grid-mon
> > lhcb001 31300 0.0 0.3 8052 6528 ? S 11:38 0:00 perl
> > /tmp/grid_manager_monitor_agent.lhcb001.13002.1000 --delete-self
> > --maxtime=3540s
> > root 31724 0.0 0.0 4760 672 pts/2 S 11:40 0:00 grep lhcb
> > [root@helmsley root]# ps axuw |grep
> >
> > --
> > -------------------------------------------------------------
> > Mark Nelson - [log in to unmask]
> >
> > IPPP, Department of Physics, University of Durham,
> > Science Laboratories, South Road, Durham, DH1 3LE
> > Office: +44 (0)191 334 3811, Direct Dial: +44 (0)191 334 3653
> >
> > PGP Key: http://www.ippp.dur.ac.uk/~mn/pgp_key.txt
> > This mail is for the addressee only
>
> - --
> - -------------------------------------------------------------
> Mark Nelson - [log in to unmask]
>
> IPPP, Department of Physics, University of Durham,
> Science Laboratories, South Road, Durham, DH1 3LE
> Office: +44 (0)191 334 3811, Direct Dial: +44 (0)191 334 3653
>
> PGP Key: http://www.ippp.dur.ac.uk/~mn/pgp_key.txt
> This mail is for the addressee only
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.6 (GNU/Linux)
> Comment: Using GnuPG with Red Hat - http://enigmail.mozdev.org
>
> iD8DBQFDHD5OlzM++u0MgcERAl5TAJ9z6MoIdpEZbhGgZB5L7A0psJ9VSACdFEhd
> ISCjZTg7W3SYKYHEu0pY6VM=
> =3Z00
> -----END PGP SIGNATURE-----
>
--
EUROPEAN LABORATORY FOR PARTICLE PHYSICS -- CERN
Roberto Santinelli
IT/GD Division
Building: 28 Office: R-019
Phone: +41 22 767 1925
Mobile: +41 76 487 0443
Fax: +41 22 767 4900
Email: [log in to unmask]
|