Hi Lukasz,
> We are running most recent glite_WN:
> ...
> vdt_globus_essentials-VDT1.10.1x86_64_rhap_5-4
> lcg_util-1.11.16-3.sl5
>
> We have recently noticed a problem with lcg-cp program.
> It appears that in some circumstances it may go into endless select-loop
> state.
>
> By examining low job efficiency (confirmed for atlas and biomed) we have
> spoted hundreds on hanging processes.
>
> Example stack traces from client side:
>
> [root@n18-3-4 ~]# pstack 31258
> #0 0x000000372e8cc223 in __select_nocancel () from /lib64/libc.so.6
> #1 0x00002afd576e65d4 in globus_l_xio_system_poll () from
> /opt/globus/lib/libglobus_xio_gcc64dbg.so.0
> #2 0x00002afd593e7534 in globus_callback_space_poll () from
> /opt/globus/lib/libglobus_common_gcc64dbg.so.0
> #3 0x00002afd566b50c2 in copyfilex () from /opt/lcg/lib64/liblcg_util.so.1
> #4 0x00002afd566ada8a in lcg_cp5 () from /opt/lcg/lib64/liblcg_util.so.1
> #5 0x0000000000401a58 in main ()
>
> [...]
>
> This is how it looks on the server side:
> #0 0x00002b64be9f5223 in __select_nocancel () from /lib64/libc.so.6
> #1 0x00002b64bb79f5d4 in globus_l_xio_system_poll () from
> /opt/globus/lib/libglobus_xio_gcc64dbg.so.0
> #2 0x00002b64bd8a5534 in globus_callback_space_poll () from
> /opt/globus/lib/libglobus_common_gcc64dbg.so.0
> #3 0x00000000004048ae in main ()
>
>
> I only have found this server related bug :
> http://bugzilla.globus.org/globus/show_bug.cgi?id=6215
>
> Seems like lcg-cp timeouts are not respected in given case
> This happens for either lcg-cp or globus-url copy.
>
> globus-url-copy:
> [root@n14-2-10 ~]# pstack 25683
> #0 0x0000003aedccc223 in __select_nocancel () from /lib64/libc.so.6
> #1 0x00002ba69f2b65d4 in globus_l_xio_system_poll () from
> /opt/globus/lib/libglobus_xio_gcc64dbg.so.0
> #2 0x00002ba6a0fb6534 in globus_callback_space_poll () from
> /opt/globus/lib/libglobus_common_gcc64dbg.so.0
> #3 0x000000000040417a in globus_l_guc_transfer_files ()
> #4 0x0000000000405d9d in globus_l_guc_expand_urls ()
> #5 0x0000000000403296 in main ()
>
> The problem appears for atlas, biomed, hone,lhcb and other vos.
> Did you noticed anything like this in your sites?
>
> Thanks to our recent dpm-pool node network problem we were able to see
> this in a big scale.
Does the client exit when you kill the corresponding server process?
You may want to run a cron job to kill hanging globus-gridftp-server
processes, as deployed e.g. on the CERN WMS (!) nodes:
http://eticssoft.web.cern.ch/eticssoft/repository/org.glite/kill-stale-ftp/1.0.0/noarch/kill-stale-ftp-1.0.0-3.noarch.rpm
The EMI-adapted version is available here:
http://emisoft.web.cern.ch/emisoft/dist/EMI/2/sl5/x86_64/base/kill-stale-ftp-1.0.1-1.sl5.noarch.rpm
|