I am trying to debug a particularly troublesome site where I just cannot get
job submission via an LCG resource broker to work. I've been trying with a
simple test job, but it seems to stick in the running state without ever
completing. If I log on to the relevant worker node to see what processes
are running, I can see various bash and perl processes related to the
running job. There is a globus-url-copy to our resource broker
(cagraidsvr18.cs.tcd.ie) that seems to be waiting for something to happen:
test001 6493 0.0 0.7 5528 3048 ? S 18:04 0:00
globus-url-copy
file:///home/test001/globus-tmp.gridmon.5948.0/WMS_gridmon_06418_https_3a_2f_2fcagraidsvr18.cs.tcd.ie_3a9000_2fvAVfCJ3MwtqFkpgRHvH-yg/std.err
gsiftp://cagraidsvr18.cs.tcd.ie/var/edgwl/SandboxDir/vA/https_3a_2f_2fcagraidsvr18.cs.tcd.ie_3a9000_2fvAVfCJ3MwtqFkpgRHvH-yg/output/std.err
If I attach an strace to it, I get the following:
select(9, [3 8], [], [], {0, 490155}) = 0 (Timeout)
gettimeofday({1118337453, 673900}, NULL) = 0
gettimeofday({1118337453, 673935}, NULL) = 0
gettimeofday({1118337453, 673969}, NULL) = 0
gettimeofday({1118337453, 674000}, NULL) = 0
If I use the technique described on the GOC wiki [1] to copy the proxy from
the UI to the WN and then su into one of the pool accounts, I can execute
the same globus-url-copy command just fine.
Any ideas? I suspect network problems, perhaps in the site's firewall. One
other symptom is that ssh sessions left open to machines on the site tend to
hang. We don't get this happening at other sites.
Stephen
[1]
http://goc.grid.sinica.edu.tw/gocwiki/submit-helper_script_%2e%2e%2e_gave_error%3a_cache_export_dir_%2e%2e%2e
--
Dr. Stephen Childs,
Research Fellow, EGEE Project, phone: +353-1-6081797
Computer Architecture Group, email: Stephen.Childs @ cs.tcd.ie
Trinity College Dublin, Ireland web: http://www.cs.tcd.ie/Stephen.Childs
|