Hi Marteen,
Unfortunately clients are still hanging even after coresponging server
has been killed.
Thank you for the server-side solution - we'll give it a try.
I wonder if the problem is also present in EMI-2 based WN.
In our case the bigger problem are clients - 800 processes stuck - mean
significant waste of resources.
Have you observed client issues ?
Cheers
--
Lukasz Flis
>
>> We are running most recent glite_WN:
>> ...
>> vdt_globus_essentials-VDT1.10.1x86_64_rhap_5-4
>> lcg_util-1.11.16-3.sl5
>>
>> We have recently noticed a problem with lcg-cp program.
>> It appears that in some circumstances it may go into endless select-loop
>> state.
>>
>> By examining low job efficiency (confirmed for atlas and biomed) we have
>> spoted hundreds on hanging processes.
>>
>> Example stack traces from client side:
>>
>> [root@n18-3-4 ~]# pstack 31258
>> #0 0x000000372e8cc223 in __select_nocancel () from /lib64/libc.so.6
>> #1 0x00002afd576e65d4 in globus_l_xio_system_poll () from
>> /opt/globus/lib/libglobus_xio_gcc64dbg.so.0
>> #2 0x00002afd593e7534 in globus_callback_space_poll () from
>> /opt/globus/lib/libglobus_common_gcc64dbg.so.0
>> #3 0x00002afd566b50c2 in copyfilex () from
>> /opt/lcg/lib64/liblcg_util.so.1
>> #4 0x00002afd566ada8a in lcg_cp5 () from
>> /opt/lcg/lib64/liblcg_util.so.1
>> #5 0x0000000000401a58 in main ()
>>
>> [...]
>>
>> This is how it looks on the server side:
>> #0 0x00002b64be9f5223 in __select_nocancel () from /lib64/libc.so.6
>> #1 0x00002b64bb79f5d4 in globus_l_xio_system_poll () from
>> /opt/globus/lib/libglobus_xio_gcc64dbg.so.0
>> #2 0x00002b64bd8a5534 in globus_callback_space_poll () from
>> /opt/globus/lib/libglobus_common_gcc64dbg.so.0
>> #3 0x00000000004048ae in main ()
>>
>>
>> I only have found this server related bug :
>> http://bugzilla.globus.org/globus/show_bug.cgi?id=6215
>>
>> Seems like lcg-cp timeouts are not respected in given case
>> This happens for either lcg-cp or globus-url copy.
>>
>> globus-url-copy:
>> [root@n14-2-10 ~]# pstack 25683
>> #0 0x0000003aedccc223 in __select_nocancel () from /lib64/libc.so.6
>> #1 0x00002ba69f2b65d4 in globus_l_xio_system_poll () from
>> /opt/globus/lib/libglobus_xio_gcc64dbg.so.0
>> #2 0x00002ba6a0fb6534 in globus_callback_space_poll () from
>> /opt/globus/lib/libglobus_common_gcc64dbg.so.0
>> #3 0x000000000040417a in globus_l_guc_transfer_files ()
>> #4 0x0000000000405d9d in globus_l_guc_expand_urls ()
>> #5 0x0000000000403296 in main ()
>>
>> The problem appears for atlas, biomed, hone,lhcb and other vos.
>> Did you noticed anything like this in your sites?
>>
>> Thanks to our recent dpm-pool node network problem we were able to see
>> this in a big scale.
>
> Does the client exit when you kill the corresponding server process?
>
> You may want to run a cron job to kill hanging globus-gridftp-server
> processes, as deployed e.g. on the CERN WMS (!) nodes:
>
> http://eticssoft.web.cern.ch/eticssoft/repository/org.glite/kill-stale-ftp/1.0.0/noarch/kill-stale-ftp-1.0.0-3.noarch.rpm
>
>
> The EMI-adapted version is available here:
>
> http://emisoft.web.cern.ch/emisoft/dist/EMI/2/sl5/x86_64/base/kill-stale-ftp-1.0.1-1.sl5.noarch.rpm
>
|