-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ricardo Graciani wrote:
Hi
I've killed those jobs, I hope lhcb be able to use the site again, can
you send a test job.
Mark.
> Hi Mark,
>
> You can safely kill those fork jobs if they are causing trouble.
> For "unknown reasons" sometimes these jobs hang on the CE side while the
> UI side has already exited.
>
> Regards
>
> Ricardo
>
> ========================================================================
> ========
>
> Ricardo Graciani Diaz
>
> Dept. Estructura i Constituents de la Materia
> Facultat de Fisica Tel: +34 93 403 9183
> Universitat de Barcelona Fax: +34 93 402 1198
>
> Diagonal, 647
> E-08028 Barcelona
>
> ========================================================================
> ========
>
>
>
>
>>-----Mensaje original-----
>>De: Roberto SANTINELLI [mailto:[log in to unmask]]
>>Enviado el: lunes, 05 de septiembre de 2005 14:54
>>Para: LHC Computer Grid - Rollout
>>CC: Ricardo Graciani Diaz
>>Asunto: Re: [LCG-ROLLOUT] Jobs stuck in Wait State
>>
>>
>>Hi Mark,
>>I don't think that you'll cause really important problems by killing
>>these test jobs. Anyway I would rather let Ricardo reply for me.
>>
>>Nevertheless it might be worth try to understand why these jobs get
>
> stuck
>
>>and just keepping these jobs running would help you in
>
> understanding...
>
>>R.
>>
>>
>>
>>On Mon, 5 Sep 2005, Mark Nelson wrote:
>>
>>
> Roberto SANTINELLI wrote:
> Hi Roberto
>
> So what is the easiest fix just delete the jobs?
>
> Mark.
>
>>Hi Mark,
>
>>these hang jobs are related to the daily test cron job that lhcb
>
>> runs
>
>>>to
>>>
>>check the sanity of his production sites.
>
>>The globus-job-run tries to check several gridftp functionalities
>>get, ls, on the CE ...
>
>> We have found out that under some (unknown) circumstances
>
>> the
>
>>execution of these commands may hang forever on the tested
>
>> machine.
>
>>>This
>>>
>>has probably been the case on your CE.
>
>> Sorry for the inconveniences.
>
>>R.
>
>>On Mon, 5 Sep 2005, Mark Nelson wrote:
>
>
>>Hello
>
>>I have a number of lhcb jobs stuck in wait state, these jobs are
>>>
>>>trying
>>>
>>to run on several worker nodes. We have a shared file system and
>
>> each
>
>>machine is able to mount the directories. I am getting the
>
>> following
>
>>error via e-mail and have been since 09:50 yesterday. I also have
>
>> a
>
>>number of globus-job-manager processes running on the CE (see
>
>> below).
>
>>>I
>>>
>>have restarted pbs, maui and globus on the ce and I can ssh to the
>
>> CE
>
>>from a worker node as lhcb001
>
>>PBS Job Id: 24610.helmsley.dur.scotgrid.ac.uk
>>Job Name: STDIN
>>File stage in failed, see below.
>>Job will be retried later, please investigate and correct problem.
>>Post job file processing error; job
>
>> 24610.helmsley.dur.scotgrid.ac.uk
>
>>>on
>>>
>>host wn07.dur.scotgrid.ac.uk/1
>
>>Unable to copy file 24610.helms.OU to
>>helmsley.dur.scotgrid.ac.uk:/mt/home/lhcb001/.lcgjm/globus-cache-
>>>
>>>export.rhnqO5/batch.out
>>>
>
>>>>>>>>error from copy
>
>>helmsley.dur.scotgrid.ac.uk: Connection refused
>>.rhnqO5/batch.out: No such file or directory
>
>
>
>>>>>>>>end error output
>
>>Output retained on that host in:
>>>
>>>/var/spool/pbs/undelivered/24610.helms.OU
>>>
>>Unable to copy file 24610.helms.ER to
>>helmsley.dur.scotgrid.ac.uk:/mt/home/lhcb001/.lcgjm/globus-cache-
>>>
>>>export.rhnqO5/batch.err
>>>
>
>>>>>>>>error from copy
>
>>helmsley.dur.scotgrid.ac.uk: Connection refused
>>.rhnqO5/batch.err: No such file or directory
>
>
>
>>>>>>>>end error output
>
>>Output retained on that host in:
>>>
>>>/var/spool/pbs/undelivered/24610.helms.ER
>>>
>>-
>
>
>> ----------------------------------------------------------------------
>
>>>------------------
>>>
>>lhcb001 12712 0.0 0.1 5520 3556 ? S Sep04 0:01
>>globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
>
>> -type
>
>>fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
>>lhcb001 20695 0.0 0.1 5516 3552 ? S Sep04 0:01
>>globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
>
>> -type
>
>>fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
>>lhcb001 28818 0.0 0.1 5516 3548 ? S Sep04 0:00
>>globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
>
>> -type
>
>>fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
>>lhcb001 29193 0.0 0.1 5520 3556 ? S Sep04 0:00
>>globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
>
>> -type
>
>>fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
>>lhcb001 29198 0.0 0.1 5516 3548 ? S Sep04 0:00
>>globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
>
>> -type
>
>>fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
>>lhcb001 12973 0.0 0.1 5516 3544 ? S 10:44 0:00
>>globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
>
>> -type
>
>>fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
>>lhcb001 13002 0.0 0.1 4348 2796 ? S 10:44 0:00
>
>> perl
>
>> /mt/home/lhcb001/.globus/.gass_cache/local/md5/4a/7df403a165c3ad81cfa6f4
>> 59
>
>>>c5ae23/md5/82/0b2913e8e51abb8bea5d721ae8c439/data
>>>
>>-
>>--dest-
>>>
>> url=https://lxn1177.cern.ch:20106/tmp/condor_g_scratch.0xab6e928.1638/he
>> lm
>
>>>sley.dur.scotgrid.ac.uk:2119.0x959bbb8/grid-mon
>>>
>>lhcb001 13245 0.0 0.3 8052 6516 ? S 10:45 0:00
>
>> perl
>
>>/tmp/grid_manager_monitor_agent.lhcb001.13002.1000 --delete-self
>>--maxtime=3540s
>>lhcb001 13397 0.0 0.1 5520 3548 ? S 10:45 0:00
>>globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
>
>> -type
>
>>fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
>>lhcb001 13428 0.0 0.1 4320 2772 ? S 10:45 0:00
>
>> perl
>
>> /mt/home/lhcb001/.globus/.gass_cache/local/md5/27/01d6922efb71f243a80db3
>> 33
>
>>>e4afe4/md5/9f/3ca83510506bf23ce4307675b8c52c/data
>>>
>>-
>>--dest-
>>>
>> url=https://gdrb07.cern.ch:20001/tmp/condor_g_scratch.0x86bd2e8.31353/he
>> lm
>
>>>sley.dur.scotgrid.ac.uk:2119.0x9daf520/grid-mon
>>>
>>lhcb001 31300 0.0 0.3 8052 6528 ? S 11:38 0:00
>
>> perl
>
>>/tmp/grid_manager_monitor_agent.lhcb001.13002.1000 --delete-self
>>--maxtime=3540s
>>root 31724 0.0 0.0 4760 672 pts/2 S 11:40 0:00
>
>> grep
>
>>>lhcb
>>>
>>[root@helmsley root]# ps axuw |grep
>
>>--
>>-------------------------------------------------------------
>>Mark Nelson - [log in to unmask]
>
>>IPPP, Department of Physics, University of Durham,
>>Science Laboratories, South Road, Durham, DH1 3LE
>>Office: +44 (0)191 334 3811, Direct Dial: +44 (0)191 334 3653
>
>>PGP Key: http://www.ippp.dur.ac.uk/~mn/pgp_key.txt
>>This mail is for the addressee only
>
> --
> -------------------------------------------------------------
> Mark Nelson - [log in to unmask]
>
> IPPP, Department of Physics, University of Durham,
> Science Laboratories, South Road, Durham, DH1 3LE
> Office: +44 (0)191 334 3811, Direct Dial: +44 (0)191 334 3653
>
> PGP Key: http://www.ippp.dur.ac.uk/~mn/pgp_key.txt
> This mail is for the addressee only
>>
>>--
>>EUROPEAN LABORATORY FOR PARTICLE PHYSICS -- CERN
>>Roberto Santinelli
>>IT/GD Division
>>Building: 28 Office: R-019
>>Phone: +41 22 767 1925
>>Mobile: +41 76 487 0443
>>Fax: +41 22 767 4900
>>Email: [log in to unmask]
- --
- -------------------------------------------------------------
Mark Nelson - [log in to unmask]
IPPP, Department of Physics, University of Durham,
Science Laboratories, South Road, Durham, DH1 3LE
Office: +44 (0)191 334 3811, Direct Dial: +44 (0)191 334 3653
PGP Key: http://www.ippp.dur.ac.uk/~mn/pgp_key.txt
This mail is for the addressee only
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Red Hat - http://enigmail.mozdev.org
iD8DBQFDHEmRlzM++u0MgcERArc6AJ9C7FDvrcs3MmGNSc0wYk33WVoiZQCfaxPU
XLyVAUJ6vqL/v7FC6pyD8ck=
=UFag
-----END PGP SIGNATURE-----
|