On Thu, Jun 09, 2005 at 10:58:57PM +0200 or thereabouts, Maarten Litmaath, CERN wrote:
> On Thu, 9 Jun 2005, Stephen Childs wrote:
>
> > I am trying to debug a particularly troublesome site where I just cannot get
> > job submission via an LCG resource broker to work. I've been trying with a
> > simple test job, but it seems to stick in the running state without ever
> > completing. If I log on to the relevant worker node to see what processes
> > are running, I can see various bash and perl processes related to the
> > running job. There is a globus-url-copy to our resource broker
> > (cagraidsvr18.cs.tcd.ie) that seems to be waiting for something to happen:
> >
> >
> > test001 6493 0.0 0.7 5528 3048 ? S 18:04 0:00
> > globus-url-copy
> > file:///home/test001/globus-tmp.gridmon.5948.0/WMS_gridmon_06418_https_3a_2f_2fcagraidsvr18.cs.tcd.ie_3a9000_2fvAVfCJ3MwtqFkpgRHvH-yg/std.err
> > gsiftp://cagraidsvr18.cs.tcd.ie/var/edgwl/SandboxDir/vA/https_3a_2f_2fcagraidsvr18.cs.tcd.ie_3a9000_2fvAVfCJ3MwtqFkpgRHvH-yg/output/std.err
> >
> > If I attach an strace to it, I get the following:
> >
> > select(9, [3 8], [], [], {0, 490155}) = 0 (Timeout)
> > gettimeofday({1118337453, 673900}, NULL) = 0
> > gettimeofday({1118337453, 673935}, NULL) = 0
> > gettimeofday({1118337453, 673969}, NULL) = 0
> > gettimeofday({1118337453, 674000}, NULL) = 0
> >
> > If I use the technique described on the GOC wiki [1] to copy the proxy from
> > the UI to the WN and then su into one of the pool accounts, I can execute
> > the same globus-url-copy command just fine.
> >
> > Any ideas? I suspect network problems, perhaps in the site's firewall. One
> > other symptom is that ssh sessions left open to machines on the site tend to
> > hang. We don't get this happening at other sites.
>
> I vote for a firewall problem. Try unsetting GLOBUS_TCP_PORT_RANGE on the WNs:
> there are firewalls that cannot handle the same ports being immediately reused
> for a new connection (e.g. WN:20000 --> RB:2811 or RB:20000).
> The firewall may explicitly refuse such "dubious" connections, or they may
> silently drop the associated traffic. If the connection is refused,
> globus-url-copy has a significant probability to hang indefinitely:
As Martin says this sounds very plausible:
http://goc.grid.sinica.edu.tw/gocwiki/gridftp_works_only_once_within_a_minute_or_so
Steve
>
> -----------------------------------------------------------------------------
> $ globus-url-copy file:/etc/group gsiftp://lxplus071.cern.ch/tmp/bug
> error: a system call failed (Connection refused)
> $ globus-url-copy file:/etc/group gsiftp://lxplus071.cern.ch/tmp/bug
> error: a system call failed (Connection refused)
> $ globus-url-copy file:/etc/group gsiftp://lxplus071.cern.ch/tmp/bug
> error: a system call failed (Connection refused)
> $ globus-url-copy file:/etc/group gsiftp://lxplus071.cern.ch/tmp/bug
> Cancelling copy...
> -----------------------------------------------------------------------------
>
> The last one I had to ^C out of.
--
Steve Traylen
[log in to unmask]
http://www.gridpp.ac.uk/
|