On 28/10/13 00:24, Elena Korolkova wrote:
> Hi Chris
>
> As I mentioned the problematic disk server is the last disk server which put in production. It hast the lowest weight in dpm ATM bus this doesn't solve the problem.
> I don't have room to drain this this disk server and other disk servers. How did you solve this problem?
>
What I'm doing to move data onto new disk servers is:
lfs find -type f --mtime +1 /mnt/lustre_0/storm_3 |grep -v ops |awk
'{foo=rand(); if (foo<0.1) print $0 }' | lfs_migrate -y
The idea being to move a random 10% of the data - at least some of which
will end up on the new servers.
If the new server is 10% of your capacity, if you could move 90% of the
files off (and replace each one with a random file from the other disk
servers), you'd be in better shape I think. This assumes file sizes are
the same - which I suspect they are, but...
How to actually do this with DPM, I have no idea.
Chris
>
> Thanks you very much
>
> Elena
>
> On 27 Oct 2013, at 14:47, Christopher J. Walker wrote:
>
>> On 25/10/13 15:27, Wahid Bhimji wrote:
>>> I would imagine they are FTS transfers rather than sam tests (not sure
>>> why they run as sgmatl - guess that depends on the FTS server proxy
>>>
>>> maybe it could be cause the transfers are slow and get backed up.
>>> Do you have the tcp tunings from
>>> https://www.gridpp.ac.uk/wiki/UKTcpTuning
>>>
>>> in particular the larger default value for the tcp buffer size.
>>>
>>
>> Did you bring this disk server online with lots of others, or on its own? If on its own, then perhaps it is a problem of poor data distribution. Maybe it has filled with a particular dataset. We've certainly experienced that with one disk server. What I did in that case was take it offline, drain it (and several others in fact), and put them back online together - along with migrating some data from the other disk servers.
>>
>>
>> Chris
>>
>>
>>> wahid
>>>
>>> On 25 Oct 2013, at 14:41, Elena Korolkova <[log in to unmask]
>>> <mailto:[log in to unmask]>> wrote:
>>>
>>>> Hi
>>>>
>>>> Wahid has switched Sheffield to used xrootd ( many thanks) and things
>>>> looks better.
>>>>
>>>> I still have one disk server overloaded (it's always the same disk
>>>> server) with lots (104 atm) of globus-gridftp-server processesrun by
>>>> sgmatl (sam tests???)
>>>>
>>>> YOur help is much appreciated.
>>>>
>>>> Elena
>>>>
>>>>
>>>> sgmatl50 28733 7.9 0.0 118796 6956 ? D 11:10 16:45
>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path /
>>>> -inetd -config-base-path /
>>>> sgmatl50 28873 8.6 0.0 122932 8748 ? R 11:31 16:29
>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path /
>>>> -inetd -config-base-path /
>>>> sgmatl50 29025 8.5 0.0 122944 11120 ? R 11:50 14:31
>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path /
>>>> -inetd -config-base-path /
>>>> sgmatl50 29134 7.7 0.0 118800 6964 ? R 12:06 12:02
>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path /
>>>> -inetd -config-base-path /
>>>> sgmatl50 29174 8.8 0.0 122928 8672 ? R 12:16 12:54
>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path /
>>>> -inetd -config-base-path /
>>>> sgmatl50 29214 7.8 0.0 118800 6960 ? R 12:20 11:03
>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path /
>>>> -inetd -config-base-path /
>>>> sgmatl50 29216 8.4 0.0 122936 8740 ? R 12:21 11:49
>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path /
>>>> -inetd -config-base-path /
>>>> sgmatl50 29382 7.8 0.0 118800 6960 ? D 12:27 10:26
>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path /
>>>> -inetd -config-base-path /
>>>>
>>>>
>>>> lsof |grep globus |wc -l
>>>> 480
>>>>
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts09.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:43977 (ESTABLISHED)
>>>> globus-gr 468 sgmatl50 17u IPv4 7384601
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24404->dpool18.triumf.ca
>>>> <http://dpool18.triumf.ca>:52672 (ESTABLISHED)
>>>> globus-gr 468 sgmatl50 18u IPv4 7384602
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24404->dpool18.triumf.ca
>>>> <http://dpool18.triumf.ca>:52678 (ESTABLISHED)
>>>> globus-gr 468 sgmatl50 19u IPv4 7384603
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24404->dpool18.triumf.ca
>>>> <http://dpool18.triumf.ca>:52675 (ESTABLISHED)
>>>> ......................
>>>> globus-gr 468 sgmatl50 25u IPv4 7384609
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24404->dpool18.triumf.ca
>>>> <http://dpool18.triumf.ca>:52680 (ESTABLISHED)
>>>> globus-gr 493 sgmatl50 0u IPv4 7384772
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts06.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:45037 (CLOSE_WAIT)
>>>> globus-gr 494 sgmatl50 0u IPv4 7384794
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts08.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:36046 (ESTABLISHED)
>>>> globus-gr 494 sgmatl50 17u IPv4 7385039
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24414->dcdoor14.usatlas.bnl.gov
>>>> <http://dcdoor14.usatlas.bnl.gov>:48440 (ESTABLISHED)
>>>> globus-gr 494 sgmatl50 18u IPv4 7385040
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24414->dcdoor14.usatlas.bnl.gov
>>>> <http://dcdoor14.usatlas.bnl.gov>:48445 (ESTABLISHED)
>>>> .................................
>>>> globus-gr 1031 sgmatl50 0u IPv4 7386025
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts05.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:41167 (ESTABLISHED)
>>>> globus-gr 1031 sgmatl50 17u IPv4 7386168
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24416->dpool21.triumf.ca
>>>> <http://dpool21.triumf.ca>:40546 (ESTABLISHED)
>>>> globus-gr 1031 sgmatl50 18u IPv4 7386169
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24416->dpool21.triumf.ca
>>>> <http://dpool21.triumf.ca>:40548 (ESTABLISHED)
>>>> globus-gr 1031 sgmatl50 19u IPv4 7386170
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24416->dpool21.triumf.ca
>>>> <http://dpool21.triumf.ca>:40554 (ESTABLISHED)
>>>> ......................
>>>> globus-gr 1101 sgmatl50 0u IPv4 7386163
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts05.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:41242 (ESTABLISHED)
>>>> globus-gr 1101 sgmatl50 17u IPv4 7386955
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24428->dpool29.triumf.ca
>>>> <http://dpool29.triumf.ca>:39636 (ESTABLISHED)
>>>> ....................
>>>> globus-gr 1763 sgmatl50 0u IPv4 7387191
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts09.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:49176 (ESTABLISHED)
>>>> globus-gr 1763 sgmatl50 17u IPv4 7387608
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24422->dpool26.triumf.ca
>>>> <http://dpool26.triumf.ca>:49830 (ESTABLISHED)
>>>> globus-gr 1763 sgmatl50 18u IPv4 7387609
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24422->dpool26.triumf.ca
>>>> <http://dpool26.triumf.ca>:49831 (ESTABLISHED)
>>>> ................
>>>> globus-gr 3043 sgmatl50 0u IPv4 7394899
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts10.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:59123 (CLOSE_WAIT)
>>>> globus-gr 3583 sgmatl50 0u IPv4 7395759
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts07.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:35544 (ESTABLISHED)
>>>> globus-gr 3583 sgmatl50 17u IPv4 7395928
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24398->se03.esc.qmul.ac.uk
>>>> <http://esc.qmul.ac.uk>:52134 (ESTABLISHED)
>>>> .........................
>>>> globus-gr 3584 sgmatl50 17u IPv4 7395975
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24411->dpool18.triumf.ca
>>>> <http://dpool18.triumf.ca>:36199 (ESTABLISHED)
>>>> globus-gr 3584 sgmatl50 18u IPv4 7395976
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24411->dpool18.triumf.ca
>>>> <http://dpool18.triumf.ca>:36205 (ESTABLISHED)
>>>> ...................
>>>> globus-gr 3605 sgmatl50 0u IPv4 7396013
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts09.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:51746 (CLOSE_WAIT)
>>>> globus-gr 3624 sgmatl50 0u IPv4 7396090
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts07.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:36218 (CLOSE_WAIT)
>>>> globus-gr 3645 sgmatl50 0u IPv4 7396314
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts10.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:32929 (ESTABLISHED)
>>>> globus-gr 3645 sgmatl50 17u IPv4 7396338
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24394->se03.esc.qmul.ac.uk
>>>> <http://esc.qmul.ac.uk>:47784 (ESTABLISHED)
>>>> .....
>>>> globus-gr 3671 sgmatl50 0u IPv4 7396603
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts05.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:46809 (CLOSE_WAIT)
>>>> globus-gr 3717 sgmatl50 0u IPv4 7396978
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts05.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:48513 (ESTABLISHED)
>>>> globus-gr 3717 sgmatl50 17u IPv4 7397169
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24418->lpnhe-ds22.in2p3.fr
>>>> <http://lpnhe-ds22.in2p3.fr>:52933 (ESTABLISHED)
>>>> globus-gr 3717 sgmatl50 18u IPv4 7397170
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24418->lpnhe-ds22.in2p3.fr
>>>> <http://lpnhe-ds22.in2p3.fr>:52934 (ESTABLISHED)
>>>> ...........
>>>> globus-gr 3739 sgmatl50 17u IPv4 7397425
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24388->se04.esc.qmul.ac.uk
>>>> <http://esc.qmul.ac.uk>:59821 (ESTABLISHED)
>>>> globus-gr 3739 sgmatl50 18u IPv4 7397426
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24388->se04.esc.qmul.ac.uk
>>>> <http://esc.qmul.ac.uk>:59824 (ESTABLISHED)
>>>> ............
>>>> globus-gr 3739 sgmatl50 26u IPv4 7397434
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24388->s
>>>> globus-gr 3740 sgmatl50 0u IPv4 7397337
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts08.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:46984 (ESTABLISHED)
>>>> globus-gr 3740 sgmatl50 17u IPv4 7397508
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24424->dcdoor10.usatlas.bnl.gov
>>>> <http://dcdoor10.usatlas.bnl.gov>:36545 (ESTABLISHED)
>>>> globus-gr 3740 sgmatl50 18u IPv4 7397509
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24424->dcdoor10.usatlas.bnl.gov
>>>> <http://dcdoor10.usatlas.bnl.gov>:36548 (ESTABLISHED)
>>>> globus-gr 3740 sgmatl50 19u IPv4 7397510
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24424->dcdoor10.usatlas.bnl.gov
>>>> <http://dcdoor10.usatlas.bnl.gov>:36553 (ESTABLISHED)
>>>> globus-gr 3740 sgmatl50 20u IPv4 7397511
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24424->dcdoor10.usatlas.bnl.gov
>>>> <http://dcdoor10.usatlas.bnl.gov>:36554 (ESTABLISHED)
>>>> .........
>>>> globus-gr 3768 sgmatl50 25u IPv4 7397725
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24390->disk058.gla.scotgrid.ac.uk
>>>> <http://gla.scotgrid.ac.uk>:44257 (ESTABLISHED)
>>>> globus-gr 3768 sgmatl50 26u IPv4 7397726
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24390->disk058.gla.scotgrid.ac.uk
>>>> <http://gla.scotgrid.ac.uk>:44260 (ESTABLISHED)
>>>> globus-gr 3795 sgmatl50 0u IPv4 7397980
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts06.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:59871 (CLOSE_WAIT)
>>>> ...........
>>>> globus-gr 3805 sgmatl50 26u IPv4 7398608
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24410->disk056.gla.scotgrid.ac.uk
>>>> <http://gla.scotgrid.ac.uk>:55130 (ESTABLISHED)
>>>> globus-gr 3806 sgmatl50 0u IPv4 7398072
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts05.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:52855 (CLOSE_WAIT)
>>>> globus-gr 3828 sgmatl50 0u IPv4 7398354
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts10.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:40646 (CLOSE_WAIT)
>>>> globus-gr 3835 sgmatl50 0u IPv4 7398485
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts05.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:53275 (ESTABLISHED)
>>>> globus-gr 3835 sgmatl50 17u IPv4 7398636
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24407->lpnhe-ds10.in2p3.fr
>>>> <http://lpnhe-ds10.in2p3.fr>:35237 (ESTABLISHED)
>>>> globus-gr 3835 sgmatl50 18u IPv4 7398646
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24407->lpnhe-ds10.in2p3.fr
>>>> <http://lpnhe-ds10.in2p3.fr>:35241 (ESTABLISHED)
>>>> globus-gr 3835 sgmatl50 19u IPv4 7398647
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24407->lpnhe-ds10.in2p3.fr
>>>> <http://lpnhe-ds10.in2p3.fr>:35240 (ESTABLISHED)
>>>> globus-gr 3835 sgmatl50 20u IPv4 7398648
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24407->lpnhe-ds10.in2p3.fr
>>>> <http://lpnhe-ds10.in2p3.fr>:35245 (ESTABLISHED)
>>>> globus-gr 3835 sgmatl50 21u IPv4 7398649
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24407->lpnhe-ds10.in2p3.fr
>>>> <http://lpnhe-ds10.in2p3.fr>:35239 (ESTABLISHED)
>>>> globus-gr 3835 sgmatl50 22u IPv4 7398650
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24407->lpnhe-ds10.in2p3.fr
>>>> <http://lpnhe-ds10.in2p3.fr>:35238 (ESTABLISHED)
>>>> globus-gr 3835 sgmatl50 23u IPv4 7398651
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24407->lpnhe-ds10.in2p3.fr
>>>> <http://lpnhe-ds10.in2p3.fr>:35243 (ESTABLISHED)
>>>> globus-gr 3835 sgmatl50 24u IPv4 7398652
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24407->lpnhe-ds10.in2p3.fr
>>>> <http://lpnhe-ds10.in2p3.fr>:35244 (ESTABLISHED)
>>>> globus-gr 3835 sgmatl50 25u IPv4 7398653
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24407->lpnhe-ds10.in2p3.fr
>>>> <http://lpnhe-ds10.in2p3.fr>:35242 (ESTABLISHED)
>>>> globus-gr 3836 sgmatl50 0u IPv4 7398495
>>>> TCP
>>>> lcgse10.shef.ac.uk:gsiftp->lcgfts10.gridpp.rl.ac.uk
>>>> <http://gridpp.rl.ac.uk>:40654 (ESTABLISHED)
>>>> globus-gr 3836 sgmatl50 17u IPv4 7398562
>>>> TCP lcgse10.shef.ac.uk
>>>> <http://lcgse10.shef.ac.uk>:24401->disk077.gla.scotgrid.ac.uk
>>>> <http://gla.scotgrid.ac.uk>:60831 (ESTABLISHED)
>>>>
>>>> On 24 Oct 2013, at 09:53, Sam Skipsey wrote:
>>>>
>>>>> In other news, there's a blog entry (finally) on the GridPP Storage
>>>>> blog about what we did at Glasgow to reduce load. (xrootd direct IO
>>>>> and slowing the rate of analysis job starts).
>>>>>
>>>>> Sam
>>>>>
>>>>> On 24 October 2013 08:10, Wahid Bhimji <[log in to unmask]
>>>>> <mailto:[log in to unmask]>> wrote:
>>>>>> Hi
>>>>>>
>>>>>> Sorry you are still having loading issues. A few thoughts:
>>>>>>
>>>>>> 1. Do you have the tuning applied on this disk servers from this page -
>>>>>> specifically the block device readahead
>>>>>> https://www.gridpp.ac.uk/wiki/Performance_and_Tuning#Tuning_block_device_readahead
>>>>>>
>>>>>> 2. I would move to xrootd instead of rfio for atlas jobs - at least
>>>>>> we can
>>>>>> then, if it still happens, ask the DPM core team - otherwise they
>>>>>> will just
>>>>>> suggest it. If you have xrootd running on all the machines it is a
>>>>>> simple
>>>>>> switch in agis.
>>>>>>
>>>>>> 3. Do others see this number of globus-gridftp-server processes? I
>>>>>> looked
>>>>>> on my disk servers and I currently only see one on each that has been
>>>>>> running since over a month… this may be a sign of something strange or a
>>>>>> symptom or it may be just me. Also mine doesn't have the -inetd
>>>>>> option - but
>>>>>> this might be all a red herrings:
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path /
>>>>>>
>>>>>> 4. Unbalanced datasets - maybe sam can help you there - at least to
>>>>>> see if
>>>>>> there are such datasets even if not to redistribute them
>>>>>>
>>>>>> Wahid
>>>>>>
>>>>>> On 24 Oct 2013, at 00:42, Elena Korolkova <[log in to unmask]>
>>>>>> wrote:
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>> close look at an overloaded disk server shows 130 globus-gridftp-server
>>>>>> processes
>>>>>>
>>>>>> log/dpm-gsiftp/gridftp.log -Z /var/log/dpm-gsiftp/dpm-gsiftp.log
>>>>>> -no-detach
>>>>>> -config-base-path / -inetd -config-base-path /
>>>>>> prdatl88 32448 0.8 0.0 118768 6936 ? D 17:15 0:29
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path / -inetd
>>>>>> -config-base-path /
>>>>>> prdatl88 32467 0.7 0.0 118052 6196 ? D 17:16 0:24
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path / -inetd
>>>>>> -config-base-path /
>>>>>> prdatl88 32488 0.8 0.0 118768 6932 ? D 17:17 0:28
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path / -inetd
>>>>>> -config-base-path /
>>>>>> prdatl88 32505 0.8 0.0 118768 6928 ? D 17:18 0:25
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path / -inetd
>>>>>> -config-base-path /
>>>>>> prdatl88 32530 0.6 0.0 118768 6932 ? D 17:19 0:19
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path / -inetd
>>>>>> -config-base-path /
>>>>>> prdatl88 32574 1.3 0.0 122920 11100 ? S 17:21 0:40
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path / -inetd
>>>>>> -config-base-path /
>>>>>> prdatl88 32575 0.7 0.0 118768 6928 ? D 17:21 0:22
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path / -inetd
>>>>>> -config-base-path /
>>>>>> prdatl88 32591 0.7 0.0 118052 6192 ? S 17:22 0:22
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path / -inetd
>>>>>> -config-base-path /
>>>>>> prdatl88 32598 1.0 0.0 121900 8132 ? D 17:23 0:30
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path / -inetd
>>>>>> -config-base-path /
>>>>>> prdatl88 32632 0.6 0.0 118768 6936 ? D 17:25 0:17
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path / -inetd
>>>>>> -config-base-path /
>>>>>> prdatl88 32701 1.2 0.0 122936 8696 ? S 17:28 0:31
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path / -inetd
>>>>>> -config-base-path /
>>>>>> prdatl88 32713 1.6 0.0 122936 11136 ? S 17:29 0:40
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path / -inetd
>>>>>> -config-base-path /
>>>>>> prdatl88 32717 0.6 0.0 118768 6928 ? D 17:30 0:17
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path / -inetd
>>>>>> -config-base-path /
>>>>>> prdatl88 32740 0.7 0.0 118768 6932 ? D 17:31 0:18
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path / -inetd
>>>>>> -config-base-path /
>>>>>> prdatl88 32763 0.6 0.0 118768 6932 ? D 17:32 0:14
>>>>>> /usr/sbin/globus-gridftp-server -d all -p 2811 -auth-level 0 -dsi dpm
>>>>>> -disable-usage-stats -l /var/log/dpm-gsiftp/gridftp.log -Z
>>>>>> /var/log/dpm-gsiftp/dpm-gsiftp.log -no-detach -config-base-path / -inetd
>>>>>> -config-base-path /
>>>>>>
>>>>>>
>>>>>>
>>>>>> In /var/log/rfio/log I see error messages like this:
>>>>>>
>>>>>> Oct 23 21:17:59 rfiod[7854]: Waiting for end of child 10737, status 1
>>>>>> Oct 23 21:17:59 rfiod[10738]: doit(5): connection from [143.167.3.102]
>>>>>> (lcgse0.shef.ac.uk)
>>>>>> Oct 23 21:17:59 rfiod[10739]: doit(5): connection from [143.167.3.102]
>>>>>> (lcgse0.shef.ac.uk)
>>>>>> Oct 23 21:17:59 rfiod[10741]: doit(5): connection from [192.168.42.130]
>>>>>> (wn030.hep)
>>>>>> Oct 23 21:17:59 rfiod[10740]: doit(5): connection from [143.167.3.102]
>>>>>> (lcgse0.shef.ac.uk)
>>>>>> Oct 23 21:17:59 rfiod[10739]: Could not establish an authenticated
>>>>>> connection: server_establish_context_ext: Could not receive token;
>>>>>> _Csec_recv_token: Connection closed; Csec_server_set_service_name:
>>>>>> Could not
>>>>>> set service name; Csec_get_peer_service_name: Could not
>>>>>> Cgetnetaddress: BAD
>>>>>> ERROR NUMBER: 0 !
>>>>>> Oct 23 21:17:59 rfiod[10742]: doit(5): connection from [143.167.3.102]
>>>>>> (lcgse0.shef.ac.uk)
>>>>>> Oct 23 21:17:59 rfiod[10740]: Could not establish an authenticated
>>>>>> connection: server_establish_context_ext: Could not receive token;
>>>>>> _Csec_recv_token: Connection closed; Csec_server_set_service_name:
>>>>>> Could not
>>>>>> set service name; Csec_get_peer_service_name: Could not
>>>>>> Cgetnetaddress: BAD
>>>>>> ERROR NUMBER: 0 !
>>>>>> Oct 23 21:17:59 rfiod[10742]: Could not establish an authenticated
>>>>>> connection: server_establish_context_ext: Could not receive token;
>>>>>> _Csec_recv_token: Connection closed; Csec_server_set_service_name:
>>>>>> Could not
>>>>>> set service name; Csec_get_peer_service_name: Could not
>>>>>> Cgetnetaddress: BAD
>>>>>> ERROR NUMBER: 0 !
>>>>>>
>>>>>> ice_name: Could not set service name; Csec_get_peer_service_name:
>>>>>> Could not
>>>>>> Cgetnetaddress: BAD ERROR NUMBER: 0 !
>>>>>> Oct 23 21:17:59 rfiod[10741]: Could not establish an authenticated
>>>>>> connection: _Csec_recv_token: Connection closed;
>>>>>> Csec_server_set_service_name: Could not set service name;
>>>>>> Csec_get_peer_service_name: Could not Cgetnetaddress: BAD ERROR
>>>>>> NUMBER: 0 !
>>>>>>
>>>>>>
>>>>>> /var/log/dpm-gsiftp/gridftp.log :
>>>>>>
>>>>>> 11177] Wed Oct 23 21:47:49 2013 :: GFork functionality not enabled.:
>>>>>> [11177] Wed Oct 23 21:47:49 2013 :: Configuration read from
>>>>>> /etc/gridftp.conf.
>>>>>> [11177] Wed Oct 23 21:47:49 2013 :: Server started in inetd mode.
>>>>>> [11177] Wed Oct 23 21:47:49 2013 :: New connection from:
>>>>>> lcgfts08.gridpp.rl.ac.uk:51809
>>>>>> [11177] Wed Oct 23 21:47:49 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [CLIENT]: USER :globus-mapping:
>>>>>> [11177] Wed Oct 23 21:47:49 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [SERVER]: 331 Password required for :globus-mapping:.
>>>>>> [11177] Wed Oct 23 21:47:49 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [CLIENT]: PASS
>>>>>> [11177] Wed Oct 23 21:47:49 2013 :: request by /DC=ch/DC=cern/OU=Organic
>>>>>> Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management
>>>>>> from
>>>>>> lcgfts08.gridpp.rl.ac.uk
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: DN /DC=ch/DC=cern/OU=Organic
>>>>>> Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management
>>>>>> successfully authorized.
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: User prdatl88 successfully
>>>>>> authorized.
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [CLIENT]: PASS
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [SERVER]: 230 User prdatl88 logged in.
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [CLIENT]: SITE HELP
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [SERVER]: 214-The following commands are recognized:
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [CLIENT]: FEAT
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [SERVER]: 211-Extensions supported
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [CLIENT]: SITE CLIENTINFO
>>>>>> scheme=gsiftp;appname="libglobus_ftp_client";appver="7.4 (gcc64,
>>>>>> 1340810069-83) [Globus Toolkit 5.2.1]";
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [SERVER]: 250 OK.
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [CLIENT]: TYPE I
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [SERVER]: 200 Type set to I.
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [CLIENT]: PBSZ 1048576
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [SERVER]: 200 PBSZ=1048576
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [CLIENT]: DELE
>>>>>> /lcgse9.shef.ac.uk:/storage1/atlas/2013-10-23/AOD.01328071._004259.pool.root.1.63161335.0
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [SERVER]: 500 Command failed : unlink error: Permission denied
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [CLIENT]: QUIT
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: lcgfts08.gridpp.rl.ac.uk:51809:
>>>>>> [SERVER]: 221 Goodbye.
>>>>>> [11177] Wed Oct 23 21:47:50 2013 :: Closed connection from
>>>>>> lcgfts08.gridpp.rl.ac.uk:51809
>>>>>> [7891] Wed Oct 23 21:47:50 2013 :: Child process 11177 ended with rc = 0
>>>>>>
>>>>>> [10443] Wed Oct 23 22:11:00 2013 :: Failure attempting to transfer
>>>>>> "/lcgse9.shef.ac.uk:/storage1/atlas/2013-10-23/EVNT.01361822._000004.pool.root.1.63152845.0".
>>>>>> [10443] Wed Oct 23 22:11:00 2013 :: Transfer failure:
>>>>>> [10443] Wed Oct 23 22:11:00 2013 :: force_close:
>>>>>> [10443] Wed Oct 23 22:11:00 2013 :: lcgfta02.gridpp.rl.ac.uk:57242:
>>>>>> [SERVER]: 500-Command failed. : globus_xio: System error in send: Broken
>>>>>> pipe
>>>>>> [10443] Wed Oct 23 22:11:00 2013 :: Closed connection from
>>>>>> lcgfta02.gridpp.rl.ac.uk:57242
>>>>>> [7891] Wed Oct 23 22:11:00 2013 :: Child process 10443 ended with rc = 0
>>>>>>
>>>>>>
>>>>>> It looks like "Permission denied" errors are seen when disk servers are
>>>>>> overloaded:
>>>>>> [root@lcgse0 ~]# for i in `seq 10`; do ssh -tx root@se`printf %01d $i`
>>>>>> 'grep "Permission denied" /var/log/dpm-gsiftp/gridftp.log|wc -l';done
>>>>>> 29
>>>>>> Connection to se1 closed.
>>>>>> 26
>>>>>> Connection to se2 closed.
>>>>>> 0
>>>>>> Connection to se3 closed.
>>>>>> 0
>>>>>> Connection to se4 closed.
>>>>>> 108
>>>>>> Connection to se5 closed.
>>>>>> 129
>>>>>> Connection to se6 closed.
>>>>>> 62
>>>>>> Connection to se7 closed.
>>>>>> 56
>>>>>> Connection to se8 closed.
>>>>>> 87
>>>>>> Connection to se9 closed.
>>>>>> 11
>>>>>> Connection to se10 closed.
>>>>>>
>>>>>>
>>>>>> Any help/ideas are greatly appreciated.
>>>>>>
>>>>>> Elena
>>>>>>
>>>>>>
>>>>>>
>>>>>> __________________________________________________
>>>>>> Dr Elena Korolkova
>>>>>> Email: [log in to unmask]
>>>>>> Tel.: +44 (0)114 2223553
>>>>>> Fax: +44 (0)114 2223555
>>>>>> Department of Physics and Astronomy
>>>>>> University of Sheffield
>>>>>> Sheffield, S3 7RH, United Kingdom
>>>>>>
>>>>>>
>>>>>>
>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>> Scotland, with registration number SC005336.
>>>>>>
>>>>
>>>> __________________________________________________
>>>> Dr Elena Korolkova
>>>> Email: [log in to unmask] <mailto:[log in to unmask]>
>>>> Tel.: +44 (0)114 2223553
>>>> Fax: +44 (0)114 2223555
>>>> Department of Physics and Astronomy
>>>> University of Sheffield
>>>> Sheffield, S3 7RH, United Kingdom
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>
> __________________________________________________
> Dr Elena Korolkova
> Email: [log in to unmask]
> Tel.: +44 (0)114 2223553
> Fax: +44 (0)114 2223555
> Department of Physics and Astronomy
> University of Sheffield
> Sheffield, S3 7RH, United Kingdom
>
>
>
>
|