From Glasgow:
svr018:~# sysctl -p
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_rmem = 131072 1048576 2097152
net.ipv4.tcp_wmem = 131072 1048576 2097152
net.ipv4.tcp_mem = 131072 1048576 2097152
net.core.rmem_default = 1048576
net.core.wmem_default = 1048576
net.core.rmem_max = 2097152
net.core.wmem_max = 2097152
net.ipv4.tcp_dsack = 0
net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0
net.core.netdev_max_backlog = 10000
Which is identical, AFAICS...
I'm still deeply puzzled by the error message "[CONNECTION] failed".
Is a SYN packet reaching your DPM and then it's failing when buried in
tcp stack land?
Cheers
Graeme
On 10 Mar 2008, at 20:22, Peter Love wrote:
> Gory details welcome! Then we have something to approach the network
> bods with. I know yaim tweaks sysctl.conf but the default kernel 2.6
> settings have window scaling right? Other DPMs please compare with
> ours:
>
> [root@fal-pygrid-30 ~]# sysctl -p
> net.ipv4.ip_forward = 0
> net.ipv4.conf.default.rp_filter = 1
> net.ipv4.conf.default.accept_source_route = 0
> kernel.sysrq = 0
> kernel.core_uses_pid = 1
> net.ipv4.tcp_rmem = 131072 1048576 2097152
> net.ipv4.tcp_wmem = 131072 1048576 2097152
> net.ipv4.tcp_mem = 131072 1048576 2097152
> net.core.rmem_default = 1048576
> net.core.wmem_default = 1048576
> net.core.rmem_max = 2097152
> net.core.wmem_max = 2097152
> net.ipv4.tcp_dsack = 0
> net.ipv4.tcp_sack = 0
> net.ipv4.tcp_timestamps = 0
> net.core.netdev_max_backlog = 10000
>
>
> Simon George ([log in to unmask]) wrote:
>> Hi everyone,
>>
>> the RHUL issue is not yet solved, but the investigation by David
>> Smith
>> implied that gridftp packets were being dropped in a perimeter
>> router.
>> We passed it on to our network guys who have reproduced the problem
>> and
>> are taking it up with the hardware manufacturer.
>>
>> Some technical info: David found that outgoing TCP segments with
>> sequence numbers more than 65536 away from the last successfully
>> transmitted outgoing segment were dropped. The behaviour seems to
>> imply
>> that there is a maximum TCP window size of 65536 bytes, as if window
>> scaling is disabled.
>>
>> I can give the full gory details to anyone who is interested, just
>> ask.
>>
>> Cheers,
>> Simon
>>
>> Greig Alan Cowan wrote:
>>> Could it be a networking problem on your end? Are transfers to the
>>> dCache affected?
>>>
>>> Greig
>>>
>>> On 10/03/08 15:12, brian davies wrote:
>>>> So it now appears to be working... some of the time
>>>>
>>>> So Channel parameters are:
>>>> glite-transfer-channel-list -x RALLCG2-UKINORTHGRIDLANCSHEP
>>>> Channel: RALLCG2-UKINORTHGRIDLANCSHEP
>>>> Between: RAL-LCG2 and UKI-NORTHGRID-LANCS-HEP
>>>> State: Active
>>>> Contact: [log in to unmask]
>>>> Bandwidth: 0
>>>> Nominal throughput: 0
>>>> Number of files: 8, streams: 1
>>>> TCP buffer size: default
>>>> Message: Activating alls channel; SRM services restored following
>>>> power failure
>>>> Last modification by: /C=UK/O=eScience/OU=CLRC/L=RAL/CN=derek ross
>>>> Last modification time: 2008-02-08 19:06:25
>>>> Number of VO shares: 6
>>>> VO 'alice' share is: 1 and is limited to 1 transfers
>>>> VO 'atlas' share is: 4 and is limited to 4 transfers
>>>> VO 'cms' share is: 0 and is not limited
>>>> VO 'lhcb' share is: 1 and is limited to 1 transfers
>>>> VO 'ops' share is: 1 and is limited to 1 transfers
>>>> VO 'dteam' share is: 1 and is limited to 1 transfers
>>>>
>>>> Succesful srmv1 log for working job follows this came from th ejob
>>>> submitted by:
>>>> glite-transfer-submit
>>>> srm://ralsrmc.rl.ac.uk:8443/castor/ads.rl.ac.uk/prod/atlas/
>>>> stripInput/bgedtesting/sourcefiles/bged-sample-evgen-event-062k-01
>>>>
>>>> srm://fal-pygrid-30.lancs.ac.uk:8443/dpm/lancs.ac.uk/home/atlas/
>>>> bged/mcdisk/ftstestaadvark-1
>>>>
>>>> 7e4a1081-eeb0-11dc-ba63-8ad671e66de1
>>>> [lcgui0357] /home/csf/daviesbg > glite-transfer-status --verbose -l
>>>> 7e4a1081-eeb0-11dc-ba63-8ad671e66de1
>>>> Request ID: 7e4a1081-eeb0-11dc-ba63-8ad671e66de1
>>>> Status: Finished
>>>> Channel: RALLCG2-UKINORTHGRIDLANCSHEP
>>>> Client DN: /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian
>>>> davies
>>>> Reason: <None>
>>>> Submit time: 2008-03-10 14:44:31.475
>>>> Files: 1
>>>> Priority: 3
>>>> VOName: atlas
>>>> Done: 0
>>>> Active: 0
>>>> Pending: 0
>>>> Ready: 0
>>>> Canceled: 0
>>>> Failed: 0
>>>> Finishing: 0
>>>> Finished: 1
>>>> Submitted: 0
>>>> Hold: 0
>>>> Waiting: 0
>>>> Source:
>>>> srm://ralsrmc.rl.ac.uk:8443/castor/ads.rl.ac.uk/prod/atlas/
>>>> stripInput/bgedtesting/sourcefiles/bged-sample-evgen-event-062k-01
>>>>
>>>> Destination:
>>>> srm://fal-pygrid-30.lancs.ac.uk:8443/dpm/lancs.ac.uk/home/atlas/
>>>> bged/mcdisk/ftstestaadvark-1
>>>>
>>>> State: Finished
>>>> Retries: 0
>>>> Reason: (null)
>>>> Duration: 13
>>>>
>>>>
>>>>
>>>> 03/10 14:44:37 11841,0 ping: request by
>>>> /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
>>>> lcgfts0424.gridpp.rl.ac.uk
>>>> 03/10 14:44:37 11841,0 ping: returns 0
>>>> 03/10 14:44:38 11841,0 getFileMetaData: request by
>>>> /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
>>>> lcgfts0424.gridpp.rl.ac.uk
>>>> 03/10 14:44:38 11841,0 getFileMetaData: SRM98 - getFileMetaData
>>>> srm://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/atlas/bged/
>>>> mcdisk/ftste
>>>>
>>>> staadvark-1
>>>> 03/10 14:44:38 11841,0 getFileMetaData: returns 12
>>>> 03/10 14:44:38 11841,0 srmv1: SRM02 - soap_serve error : No such
>>>> file
>>>> or directory
>>>> 03/10 14:44:38 11841,0 put: request by
>>>> /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
>>>> lcgfts0424.gridpp.rl.ac.uk
>>>> 03/10 14:44:38 11841,0 put: SRM98 - put 3500 3500
>>>> 03/10 14:44:38 11841,0 put: SRM98 - put 0
>>>> srm://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/atlas/bged/
>>>> mcdisk/ftstestaadvark-1
>>>>
>>>> 03/10 14:44:38 11841,0 put: returns 0
>>>> 03/10 14:44:40 11841,0 getRequestStatus: request by
>>>> /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
>>>> lcgfts0424.gridpp.rl.ac.uk
>>>> 03/10 14:44:40 11841,0 getRequestStatus: SRM98 - getRequestStatus
>>>> 3500
>>>> 03/10 14:44:40 11841,0 getRequestStatus: returns 0
>>>> 03/10 14:44:40 11841,0 setFileStatus: request by
>>>> /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
>>>> lcgfts0424.gridpp.rl.ac.uk
>>>> 03/10 14:44:40 11841,0 setFileStatus: SRM98 - setFileStatus 3500 0
>>>> Running
>>>> 03/10 14:44:40 11841,0 setFileStatus: returns 0
>>>> 03/10 14:44:40 11841,0 getRequestStatus: request by
>>>> /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
>>>> lcgfts0424.gridpp.rl.ac.uk
>>>> 03/10 14:44:40 11841,0 getRequestStatus: SRM98 - getRequestStatus
>>>> 3500
>>>> 03/10 14:44:40 11841,0 getRequestStatus: returns 0
>>>> 03/10 14:44:48 11841,0 getRequestStatus: request by
>>>> /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
>>>> lcgfts0424.gridpp.rl.ac.uk
>>>> 03/10 14:44:48 11841,0 getRequestStatus: SRM98 - getRequestStatus
>>>> 3500
>>>> 03/10 14:44:48 11841,0 getRequestStatus: returns 0
>>>> 03/10 14:44:48 11841,0 setFileStatus: request by
>>>> /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
>>>> lcgfts0424.gridpp.rl.ac.uk
>>>> 03/10 14:44:48 11841,0 setFileStatus: SRM98 - setFileStatus 3500
>>>> 0 Done
>>>> 03/10 14:44:49 11841,0 setFileStatus: returns 0
>>>>
>>>>
>>>>
>>>> However when i then try running again with
>>>>
>>>> glite-transfer-submit
>>>> srm://ralsrmc.rl.ac.uk:8443/castor/ads.rl.ac.uk/prod/atlas/
>>>> stripInput/bgedtesting/sourcefiles/bged-sample-evgen-event-062k-01
>>>>
>>>> srm://fal-pygrid-30.lancs.ac.uk:8443/dpm/lancs.ac.uk/home/atlas/
>>>> bged/mcdisk/ftstestaadvark-2
>>>>
>>>> 11828d5e-eeb2-11dc-ba63-8ad671e66de1
>>>>
>>>> I getglite-transfer-status --verbose -l
>>>> 11828d5e-eeb2-11dc-ba63-8ad671e66de1
>>>> Request ID: 11828d5e-eeb2-11dc-ba63-8ad671e66de1
>>>> Status: Active
>>>> Channel: RALLCG2-UKINORTHGRIDLANCSHEP
>>>> Client DN: /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian
>>>> davies
>>>> Reason: <None>
>>>> Submit time: 2008-03-10 14:55:47.967
>>>> Files: 1
>>>> Priority: 3
>>>> VOName: atlas
>>>> Done: 0
>>>> Active: 0
>>>> Pending: 0
>>>> Ready: 0
>>>> Canceled: 0
>>>> Failed: 0
>>>> Finishing: 0
>>>> Finished: 0
>>>> Submitted: 0
>>>> Hold: 0
>>>> Waiting: 1
>>>> Source:
>>>> srm://ralsrmc.rl.ac.uk:8443/castor/ads.rl.ac.uk/prod/atlas/
>>>> stripInput/bgedtesting/sourcefiles/bged-sample-evgen-event-062k-01
>>>>
>>>> Destination:
>>>> srm://fal-pygrid-30.lancs.ac.uk:8443/dpm/lancs.ac.uk/home/atlas/
>>>> bged/mcdisk/ftstestaadvark-2
>>>>
>>>> State: Waiting
>>>> Retries: 2
>>>> Reason: DESTINATION error during PREPARATION phase:
>>>> [CONNECTION] failed to contact on remote SRM
>>>> [httpg://fal-pygrid-30.lancs.ac.uk:8443/srm/managerv1]. Givin' up
>>>> after 3 tries
>>>> Duration: 0
>>>>
>>>> Interestingly nothing has gone into the srmv1 log for this transfer
>>>>
>>>> seems whatever the problem is, it is intermittent :(
>>>>
>>>> Brian
>>>>
>>>>
>>>> On 10/03/2008, Greig Alan Cowan <[log in to unmask]> wrote:
>>>>> Hi Brian,
>>>>>
>>>>> First of all, FTS doesn't use lcg-cp.
>>>>>
>>>>> Can you find out from Matt Hodges how the FTS channel is
>>>>> configured? I
>>>>> think it will be in urlcopy mode (not srmCopy). Also, can you
>>>>> tell me
>>>>> what the channel parameters are, like number of streams?
>>>>>
>>>>> What are the DPM srmv1 log files saying?
>>>>>
>>>>>
>>>>> Greig
>>>>>
>>>>>
>>>>> On 10/03/08 14:31, brian davies wrote:
>>>>>> So our new DPM is seeing simnilar issues to RHUL in thqat FTS
>>>>>> controlled transfers From CASTOR to DPM are failing with
>>>>>>
>>>>>> DESTINATION error during PREPARATION phase: [CONNECTION] failed
>>>>>> to
>>>>>> contact on remote SRM
>>>>>> [httpg://fal-pygrid-30.lancs.ac.uk:8443/srm/managerv1]. Givin' up
>>>>>> after 3 tries
>>>>>>
>>>>>> This is only seen on the direction from RAL to LANCS and only
>>>>>> on FTS
>>>>>> controlled transfers ( ie lcg-cp works fine from a UI)
>>>>>>
>>>>>> What was the outcome of the RHUL issues and what steps are done
>>>>>> by
>>>>> FTS
>>>>>> before it initiates the lcg-cp? Are we being hit by a srmCopy vs
>>>>> g-u-c
>>>>>> issue?)
>>>>>>
>>>>>> Brian
>>>>>
|