Hi Mark,
You can safely kill those fork jobs if they are causing trouble.
For "unknown reasons" sometimes these jobs hang on the CE side while the
UI side has already exited.
Regards
Ricardo
=======================================================================
========
Ricardo Graciani Diaz
Dept. Estructura i Constituents de la Materia
Facultat de Fisica Tel: +34 93 403 9183
Universitat de Barcelona Fax: +34 93 402 1198
Diagonal, 647
E-08028 Barcelona
=======================================================================
========
> -----Mensaje original-----
> De: Roberto SANTINELLI [mailto:[log in to unmask]]
> Enviado el: lunes, 05 de septiembre de 2005 14:54
> Para: LHC Computer Grid - Rollout
> CC: Ricardo Graciani Diaz
> Asunto: Re: [LCG-ROLLOUT] Jobs stuck in Wait State
>
>
> Hi Mark,
> I don't think that you'll cause really important problems by killing
> these test jobs. Anyway I would rather let Ricardo reply for me.
>
> Nevertheless it might be worth try to understand why these jobs get
stuck
> and just keepping these jobs running would help you in
understanding...
>
> R.
>
>
>
> On Mon, 5 Sep 2005, Mark Nelson wrote:
>
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Roberto SANTINELLI wrote:
> > Hi Roberto
> >
> > So what is the easiest fix just delete the jobs?
> >
> > Mark.
> > > Hi Mark,
> > >
> > > these hang jobs are related to the daily test cron job that lhcb
runs
> to
> > > check the sanity of his production sites.
> > >
> > > The globus-job-run tries to check several gridftp functionalities
> > > get, ls, on the CE ...
> > >
> > > We have found out that under some (unknown) circumstances
the
> > > execution of these commands may hang forever on the tested
machine.
> This
> > > has probably been the case on your CE.
> > >
> > > Sorry for the inconveniences.
> > >
> > > R.
> > >
> > > On Mon, 5 Sep 2005, Mark Nelson wrote:
> > >
> > >
> > > Hello
> > >
> > > I have a number of lhcb jobs stuck in wait state, these jobs are
> trying
> > > to run on several worker nodes. We have a shared file system and
each
> > > machine is able to mount the directories. I am getting the
following
> > > error via e-mail and have been since 09:50 yesterday. I also have
a
> > > number of globus-job-manager processes running on the CE (see
below).
> I
> > > have restarted pbs, maui and globus on the ce and I can ssh to the
CE
> > > from a worker node as lhcb001
> > >
> > > PBS Job Id: 24610.helmsley.dur.scotgrid.ac.uk
> > > Job Name: STDIN
> > > File stage in failed, see below.
> > > Job will be retried later, please investigate and correct problem.
> > > Post job file processing error; job
24610.helmsley.dur.scotgrid.ac.uk
> on
> > > host wn07.dur.scotgrid.ac.uk/1
> > >
> > > Unable to copy file 24610.helms.OU to
> > > helmsley.dur.scotgrid.ac.uk:/mt/home/lhcb001/.lcgjm/globus-cache-
> export.rhnqO5/batch.out
> > >
> > >
> > >>>>>>>error from copy
> > >
> > > helmsley.dur.scotgrid.ac.uk: Connection refused
> > > .rhnqO5/batch.out: No such file or directory
> > >
> > >
> > >>>>>>>end error output
> > >
> > > Output retained on that host in:
> /var/spool/pbs/undelivered/24610.helms.OU
> > >
> > > Unable to copy file 24610.helms.ER to
> > > helmsley.dur.scotgrid.ac.uk:/mt/home/lhcb001/.lcgjm/globus-cache-
> export.rhnqO5/batch.err
> > >
> > >
> > >>>>>>>error from copy
> > >
> > > helmsley.dur.scotgrid.ac.uk: Connection refused
> > > .rhnqO5/batch.err: No such file or directory
> > >
> > >
> > >>>>>>>end error output
> > >
> > > Output retained on that host in:
> /var/spool/pbs/undelivered/24610.helms.ER
> > >
> > > -
> > >
----------------------------------------------------------------------
> ------------------
> > >
> > > lhcb001 12712 0.0 0.1 5520 3556 ? S Sep04 0:01
> > > globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
-type
> > > fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > > lhcb001 20695 0.0 0.1 5516 3552 ? S Sep04 0:01
> > > globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
-type
> > > fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > > lhcb001 28818 0.0 0.1 5516 3548 ? S Sep04 0:00
> > > globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
-type
> > > fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > > lhcb001 29193 0.0 0.1 5520 3556 ? S Sep04 0:00
> > > globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
-type
> > > fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > > lhcb001 29198 0.0 0.1 5516 3548 ? S Sep04 0:00
> > > globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
-type
> > > fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > > lhcb001 12973 0.0 0.1 5516 3544 ? S 10:44 0:00
> > > globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
-type
> > > fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > > lhcb001 13002 0.0 0.1 4348 2796 ? S 10:44 0:00
perl
> > >
>
/mt/home/lhcb001/.globus/.gass_cache/local/md5/4a/7df403a165c3ad81cfa6f4
59
> c5ae23/md5/82/0b2913e8e51abb8bea5d721ae8c439/data
> > > -
> > > --dest-
>
url=https://lxn1177.cern.ch:20106/tmp/condor_g_scratch.0xab6e928.1638/he
lm
> sley.dur.scotgrid.ac.uk:2119.0x959bbb8/grid-mon
> > > lhcb001 13245 0.0 0.3 8052 6516 ? S 10:45 0:00
perl
> > > /tmp/grid_manager_monitor_agent.lhcb001.13002.1000 --delete-self
> > > --maxtime=3540s
> > > lhcb001 13397 0.0 0.1 5520 3548 ? S 10:45 0:00
> > > globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
-type
> > > fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > > lhcb001 13428 0.0 0.1 4320 2772 ? S 10:45 0:00
perl
> > >
>
/mt/home/lhcb001/.globus/.gass_cache/local/md5/27/01d6922efb71f243a80db3
33
> e4afe4/md5/9f/3ca83510506bf23ce4307675b8c52c/data
> > > -
> > > --dest-
>
url=https://gdrb07.cern.ch:20001/tmp/condor_g_scratch.0x86bd2e8.31353/he
lm
> sley.dur.scotgrid.ac.uk:2119.0x9daf520/grid-mon
> > > lhcb001 31300 0.0 0.3 8052 6528 ? S 11:38 0:00
perl
> > > /tmp/grid_manager_monitor_agent.lhcb001.13002.1000 --delete-self
> > > --maxtime=3540s
> > > root 31724 0.0 0.0 4760 672 pts/2 S 11:40 0:00
grep
> lhcb
> > > [root@helmsley root]# ps axuw |grep
> > >
> > > --
> > > -------------------------------------------------------------
> > > Mark Nelson - [log in to unmask]
> > >
> > > IPPP, Department of Physics, University of Durham,
> > > Science Laboratories, South Road, Durham, DH1 3LE
> > > Office: +44 (0)191 334 3811, Direct Dial: +44 (0)191 334 3653
> > >
> > > PGP Key: http://www.ippp.dur.ac.uk/~mn/pgp_key.txt
> > > This mail is for the addressee only
> >
> > - --
> > - -------------------------------------------------------------
> > Mark Nelson - [log in to unmask]
> >
> > IPPP, Department of Physics, University of Durham,
> > Science Laboratories, South Road, Durham, DH1 3LE
> > Office: +44 (0)191 334 3811, Direct Dial: +44 (0)191 334 3653
> >
> > PGP Key: http://www.ippp.dur.ac.uk/~mn/pgp_key.txt
> > This mail is for the addressee only
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.2.6 (GNU/Linux)
> > Comment: Using GnuPG with Red Hat - http://enigmail.mozdev.org
> >
> > iD8DBQFDHD5OlzM++u0MgcERAl5TAJ9z6MoIdpEZbhGgZB5L7A0psJ9VSACdFEhd
> > ISCjZTg7W3SYKYHEu0pY6VM=
> > =3Z00
> > -----END PGP SIGNATURE-----
> >
>
> --
> EUROPEAN LABORATORY FOR PARTICLE PHYSICS -- CERN
> Roberto Santinelli
> IT/GD Division
> Building: 28 Office: R-019
> Phone: +41 22 767 1925
> Mobile: +41 76 487 0443
> Fax: +41 22 767 4900
> Email: [log in to unmask]
|