Emanuele,
I've checked our firewall config and it appears port 9001 (and 9002) are
blocked - this must have happened recently - probably when the new firewall
was installed and the rulesets were transfered.
I have asked for an urgent change to the rulesets to open 9001/2 inbound.
I'll ley you knwo when I get notification it's been done.
Martin.
--
-------------------------------------------------------
Martin Bly | +44 1235 446981 | [log in to unmask]
Systems Admin, Tier 1/A Service, RAL PPD CSG
-------------------------------------------------------
> -----Original Message-----
> From: Emanuele LEONARDI [mailto:[log in to unmask]]
> Sent: Tuesday, December 09, 2003 9:53 AM
> To: [log in to unmask]
> Subject: Re: [LCG-ROLLOUT] Globus error 3
>
>
> Hi Trevor.
>
> From CERN I see:
>
> (leonardi@it-adc-pc02) ~/grid/recipes> telnet lxshare0380.cern.ch 9001
> Trying 137.138.145.208...
> Connected to lxshare0380.cern.ch.
> Escape character is '^]'.
> ^]
> telnet> quit
> Connection closed.
>
> (leonardi@it-adc-pc02) ~/grid/recipes> telnet
> gtbcg16.ifca.unican.es 9001
> Trying 193.144.209.116...
> Connected to gtbcg16.ifca.unican.es.
> Escape character is '^]'.
> ^]
> telnet> quit
> Connection closed.
>
> (leonardi@it-adc-pc02) ~/grid/recipes> telnet
> lcgrb01.gridpp.rl.ac.uk 9001
> Trying 130.246.183.184...
>
> i.e. port 9001 is accessible on CERN and IFCA RBs but not on
> RAL RB (did
> not test the others). Same thing for port 9002.
>
> As the same test done inside RAL works, this looks really like a
> firewall problem...
>
> Emanuele
>
> Daniels, T (Trevor) wrote:
> > I checked these ports at each of the RBs:
> >
> > 9000 9001
> >
> > CERN(0380) open open
> > CERN(0381) closed closed
> > ICFA open open
> > IFIC open open
> > KFKI closed closed
> > NIKHEF open open
> > PIC open open
> > RAL open open
> > SINP open open
> > SINICA closed closed
> >
> > The test of the RAL RB may not reflect the external view
> since the tests
> > were made from inside the RAL firewall.
> >
> > Trevor
> > .lf n25
> >
> > Dr Trevor Daniels
> > c/o CCLRC eSC Department Phone: (+44)|(0) 1235 778093
> > Rutherford Appleton Laboratory Fax: (+44)|(0) 1235 446626
> > Chilton, DIDCOT, Oxon, OX11 0QX, UK Email: [log in to unmask]
> > The contents of this email are sent in confidence for the use of the
> > intended recipient only. If you are not one of the
> intended recipients do
> > not take action on it or show it to anyone else, but return
> this email to
> > the sender and delete your copy of it.
> >
> >
> >
> >>-----Original Message-----
> >>From: Bly, MJ (Martin) [mailto:[log in to unmask]]
> >>Sent: Tuesday, December 09, 2003 9:31 AM
> >>To: [log in to unmask]
> >>Subject: Re: [LCG-ROLLOUT] Globus error 3
> >>
> >>
> >>We're on to it...
> >>
> >>RB is currently unhappy too.
> >>
> >>M.
> >>--
> >> -------------------------------------------------------
> >> Martin Bly | +44 1235 446981 | [log in to unmask]
> >> Systems Admin, Tier 1/A Service, RAL PPD CSG
> >> -------------------------------------------------------
> >>
> >>
> >>>-----Original Message-----
> >>>From: Gonzalo Merino [mailto:[log in to unmask]]
> >>>Sent: Tuesday, December 09, 2003 9:24 AM
> >>>To: [log in to unmask]
> >>>Subject: Re: [LCG-ROLLOUT] Globus error 3
> >>>
> >>>
> >>>Hello,
> >>>
> >>>I have been asking people from the EDG WP1 about this behaviour and
> >>>apparently this is due to a memory-leaking bug in
> >>>edg-wl-interlogd. This
> >>>problem is still not fixed in the current rpms, they are
> >>>working on it.
> >>>
> >>>So, there is indeed a problem in the code that needs to be solved.
> >>>However, it seems that there is also a configuration
> >>
> >>problem in LCG-1
> >>
> >>>that has amplified the effect of the bug. This would not have
> >>>shown up
> >>>that much without edg-wl-interlogd in the CEs beeing unable
> >>>to contact
> >>>the bookkeeping server in lcgrb01.gridpp.rl.ac.uk, port
> >>
> >>9001 (9000 is
> >>
> >>>default bookkeeping server's port for queries, 9001 for event
> >>>reception). This could point to a firewall setup problem at RAL.
> >>>
> >>>We have observed this "inflating edg-wl-interlogd" problem
> >>
> >>in our CE
> >>
> >>>(grid-w1.ifae.es), and it turns out that there are lots of
> >>
> >>log files
> >>
> >>>/var/tmp/dg20logd_.* in this machine all of them pointing to
> >>>undelivered
> >>>bookeeping information back to lcgrb01.gridpp.rl.ac.uk.
> >>>
> >>>Could the system administrator at RAL check the firewall
> >>
> >>settings for
> >>
> >>>accessing port 9001 on the RB machine?
> >>>
> >>>cheers,
> >>>Gonzalo
> >>>
> >>>
> >>>Francisco Javier Rodriguez Calonge wrote:
> >>>
> >>>>Jiri Kosina wrote:
> >>>>
> >>>>
> >>>>>Hello,
> >>>>>
> >>>>>Time to time we ecounter problems with submitting job to
> >>>>
> >>our farm,
> >>
> >>>>>edg-job-status reports
> >>>>>
> >>>>>*************************************************************
> >>>>>BOOKKEEPING INFORMATION:
> >>>>>
> >>>>>Printing status info for the Job :
> >>>>>https://lxshare0380.cern.ch:9000/scW9jsIq8INJjBeOaPVgLA
> >>>>>Current Status: Done (Cancelled)
> >>>>>Exit code: 0
> >>>>>Status Reason: Got a job held event, reason: Globus
> >>>>
> >>>error 3: an I/O
> >>>
> >>>>>operation failed
> >>>>>Destination:
> >>>>
> >>>golias25.farm.particle.cz:2119/jobmanager-lcgpbs-short
> >>>
> >>>>>reached on: Thu Nov 27 15:53:27 2003
> >>>>>*************************************************************
> >>>>>
> >>>>>I have tried restarting pbs, mds and gatekeeper, but the
> >>>>
> >>>problem persits.
> >>>
> >>>>>The only solution I've found to be working is reboot of CE.
> >>>>>
> >>>>>Did anyone ever met this problem? Is there anything I
> >>>>
> >>>should verify?
> >>>
> >>>>>Thanks.
> >>>>>
> >>>>>--
> >>>>>Jiri Kosina
> >>>>>Institute of physics, Academy of Sciences of the Czech Republic
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>Hi Jiri,
> >>>>
> >>>>we have noticed that problem here in CIEMAT and you can
> >>>
> >>find out it
> >>
> >>>>reported in the rollout archives (just search for "Globus
> >>>
> >>>error 3" in
> >>>
> >>>>http://www.listserv.rl.ac.uk/cgi-bin/wa.exe?S1=lcg-rollout).
> >>>>It is related with /opt/edg/sbin/edg-wl-interlogd process.
> >>>
> >>>This process
> >>>
> >>>>exhaust all memory avilable in CE. Under 2% it's not
> >>>
> >>>possible to submit
> >>>
> >>>>any job. The only solution we konw is to restart the daemon
> >>>>edg-wl-locallogger ( we have put a cron task looking at
> >>>
> >>>free memory and
> >>>
> >>>>restarting this daemon when it lies under 10% or so).
> >>>>
> >>>>Cheers, Javier
> >>>>
> >>>>--
> >>>>F.Javier Rodriguez Calonge mailto:[log in to unmask]
> >>>>Tfno: +34 91 346 60 00 Ext: 68 02
> >>>
> >>>--
> >>>Gonzalo Merino ([log in to unmask])
> >>>Institut de Física d'Altes Energies (UAB)
> >>>08193 Bellaterra (Barcelona) SPAIN
> >>>Tel: +34 93 5813322 / Fax: +34 93 5814110
> >>>
> >>
>
>
> --
> /------------------- Emanuele Leonardi -------------------\
> | eMail: [log in to unmask] - Tel.: +41-22-7674066 |
> | IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23 |
> \---------------------------------------------------------/
>
|