On Fri, Sep 30, 2005 at 03:25:36PM +0200 or thereabouts, Ahmed Beriache wrote:
> Hi all,
>
> We have the follownig problem at CGG-LCG2 : When jobs containing
> lcg-xx commands arrive on our Worker Nodes, some of lcg-xx commands
> complete successfully but very often they hang for a very long time,
> until the job proxy expires or the job is deleted by LRMS. But when we
> log on the worker node where a lcg command is hanging and rerun it
> manually , it works correctly and the job continue running (until the
> next lcg command).
>
> We tried to unset GLOBUS_TCP_PORT_RANGE variable from workker nodes, but
> this did not help.
Hi Ahmed,
Are you sure you actually did unset the GLOBUS_TCP_PORT_RANGE?
If you just unset it in the profile on the WNs this is not enough
since the profile is carried from the CE and then extended with
the WN's profile.
Run a real job and check your env. Also find one of the hung lcg- commands
and look in /proc/<pid>/environ to check it really is unset.
To really unset it we explicitly unset it in the lcgpbs.in job manager
script.
Steve
>
> Worker Nodes are on a private network and use SE host as router
> (se1.egee.fr.cgg.com).
> We configured the masquerading with this line in iptables configuration
> file :
> -A POSTROUTING -s 10.0.0.0/8 -o eth1 -j SNAT --to-source 84.14.104.242
> We are having this problem since we installed LCG2.6.0 middleware.
>
> Is there any explanation to that ? Did any one experience a similar
> situation ?
>
> Thanks in advance for your help.
>
> Cheers.
>
> Gerald and Ahmed
--
Steve Traylen
[log in to unmask]
http://www.gridpp.ac.uk/
|