Dear Colleagues,
We are running most recent glite_WN:
...
vdt_globus_essentials-VDT1.10.1x86_64_rhap_5-4
lcg_util-1.11.16-3.sl5
We have recently noticed a problem with lcg-cp program.
It appears that in some circumstances it may go into endless select-loop
state.
By examining low job efficiency (confirmed for atlas and biomed) we have
spoted hundreds on hanging processes.
Example stack traces from client side:
[root@n18-3-4 ~]# pstack 31258
#0 0x000000372e8cc223 in __select_nocancel () from /lib64/libc.so.6
#1 0x00002afd576e65d4 in globus_l_xio_system_poll () from
/opt/globus/lib/libglobus_xio_gcc64dbg.so.0
#2 0x00002afd593e7534 in globus_callback_space_poll () from
/opt/globus/lib/libglobus_common_gcc64dbg.so.0
#3 0x00002afd566b50c2 in copyfilex () from /opt/lcg/lib64/liblcg_util.so.1
#4 0x00002afd566ada8a in lcg_cp5 () from /opt/lcg/lib64/liblcg_util.so.1
#5 0x0000000000401a58 in main ()
[root@n18-3-4 ~]# pstack 11988
#0 0x000000372e8cc223 in __select_nocancel () from /lib64/libc.so.6
#1 0x00002b36e18ed5d4 in globus_l_xio_system_poll () from
/opt/globus/lib/libglobus_xio_gcc64dbg.so.0
#2 0x00002b36e35ee534 in globus_callback_space_poll () from
/opt/globus/lib/libglobus_common_gcc64dbg.so.0
#3 0x00002b36e08bc0c2 in copyfilex () from /opt/lcg/lib64/liblcg_util.so.1
#4 0x00002b36e08b4a8a in lcg_cp5 () from /opt/lcg/lib64/liblcg_util.so.1
#5 0x0000000000401a58 in main ()
[root@n1-3-8 ~]# pstack 31494
#0 0x00000036606ccd63 in __select_nocancel () from /lib64/libc.so.6
#1 0x00002ac5157375d4 in globus_l_xio_system_poll () from
/opt/globus/lib/libglobus_xio_gcc64dbg.so.0
#2 0x00002ac517438534 in globus_callback_space_poll () from
/opt/globus/lib/libglobus_common_gcc64dbg.so.0
#3 0x00002ac5147070c2 in copyfilex () from /opt/lcg/lib64/liblcg_util.so.1
#4 0x00002ac5146ffa8a in lcg_cp5 () from /opt/lcg/lib64/liblcg_util.so.1
#5 0x0000000000401a58 in main ()
This is how it looks on the server side:
#0 0x00002b64be9f5223 in __select_nocancel () from /lib64/libc.so.6
#1 0x00002b64bb79f5d4 in globus_l_xio_system_poll () from
/opt/globus/lib/libglobus_xio_gcc64dbg.so.0
#2 0x00002b64bd8a5534 in globus_callback_space_poll () from
/opt/globus/lib/libglobus_common_gcc64dbg.so.0
#3 0x00000000004048ae in main ()
I only have found this server related bug :
http://bugzilla.globus.org/globus/show_bug.cgi?id=6215
Seems like lcg-cp timeouts are not respected in given case
This happens for either lcg-cp or globus-url copy.
globus-url-copy:
[root@n14-2-10 ~]# pstack 25683
#0 0x0000003aedccc223 in __select_nocancel () from /lib64/libc.so.6
#1 0x00002ba69f2b65d4 in globus_l_xio_system_poll () from
/opt/globus/lib/libglobus_xio_gcc64dbg.so.0
#2 0x00002ba6a0fb6534 in globus_callback_space_poll () from
/opt/globus/lib/libglobus_common_gcc64dbg.so.0
#3 0x000000000040417a in globus_l_guc_transfer_files ()
#4 0x0000000000405d9d in globus_l_guc_expand_urls ()
#5 0x0000000000403296 in main ()
The problem appears for atlas, biomed, hone,lhcb and other vos.
Did you noticed anything like this in your sites?
Thanks to our recent dpm-pool node network problem we were able to see
this in a big scale.
Best Regards
--
Lukasz Flis
|