On Thu, 9 Jun 2005, Stephen Childs wrote:
> I am trying to debug a particularly troublesome site where I just cannot get
> job submission via an LCG resource broker to work. I've been trying with a
> simple test job, but it seems to stick in the running state without ever
> completing. If I log on to the relevant worker node to see what processes
> are running, I can see various bash and perl processes related to the
> running job. There is a globus-url-copy to our resource broker
> (cagraidsvr18.cs.tcd.ie) that seems to be waiting for something to happen:
>
>
> test001 6493 0.0 0.7 5528 3048 ? S 18:04 0:00
> globus-url-copy
> file:///home/test001/globus-tmp.gridmon.5948.0/WMS_gridmon_06418_https_3a_2f_2fcagraidsvr18.cs.tcd.ie_3a9000_2fvAVfCJ3MwtqFkpgRHvH-yg/std.err
> gsiftp://cagraidsvr18.cs.tcd.ie/var/edgwl/SandboxDir/vA/https_3a_2f_2fcagraidsvr18.cs.tcd.ie_3a9000_2fvAVfCJ3MwtqFkpgRHvH-yg/output/std.err
>
> If I attach an strace to it, I get the following:
>
> select(9, [3 8], [], [], {0, 490155}) = 0 (Timeout)
> gettimeofday({1118337453, 673900}, NULL) = 0
> gettimeofday({1118337453, 673935}, NULL) = 0
> gettimeofday({1118337453, 673969}, NULL) = 0
> gettimeofday({1118337453, 674000}, NULL) = 0
>
> If I use the technique described on the GOC wiki [1] to copy the proxy from
> the UI to the WN and then su into one of the pool accounts, I can execute
> the same globus-url-copy command just fine.
>
> Any ideas? I suspect network problems, perhaps in the site's firewall. One
> other symptom is that ssh sessions left open to machines on the site tend to
> hang. We don't get this happening at other sites.
I vote for a firewall problem. Try unsetting GLOBUS_TCP_PORT_RANGE on the WNs:
there are firewalls that cannot handle the same ports being immediately reused
for a new connection (e.g. WN:20000 --> RB:2811 or RB:20000).
The firewall may explicitly refuse such "dubious" connections, or they may
silently drop the associated traffic. If the connection is refused,
globus-url-copy has a significant probability to hang indefinitely:
-----------------------------------------------------------------------------
$ globus-url-copy file:/etc/group gsiftp://lxplus071.cern.ch/tmp/bug
error: a system call failed (Connection refused)
$ globus-url-copy file:/etc/group gsiftp://lxplus071.cern.ch/tmp/bug
error: a system call failed (Connection refused)
$ globus-url-copy file:/etc/group gsiftp://lxplus071.cern.ch/tmp/bug
error: a system call failed (Connection refused)
$ globus-url-copy file:/etc/group gsiftp://lxplus071.cern.ch/tmp/bug
Cancelling copy...
-----------------------------------------------------------------------------
The last one I had to ^C out of.
|