Just to close this thread, the problem is a routing issue at RAL whereby
outgoing packet went via UKLight router onto SJ5 and thereby bypassed
the RAL fireall. Packets on SJ5 from lancs->ral hit the firewall and
were dropped because the f/w hadn't seen the initial connection.
Switching back to the lightpath will avoid this although we need to fix
the SJ5 'backup' route with RAL networkers.
Thanks for the help,
Peter
Peter Love ([log in to unmask]) wrote:
> Gory details welcome! Then we have something to approach the network
> bods with. I know yaim tweaks sysctl.conf but the default kernel 2.6
> settings have window scaling right? Other DPMs please compare with ours:
>
> [root@fal-pygrid-30 ~]# sysctl -p
> net.ipv4.ip_forward = 0
> net.ipv4.conf.default.rp_filter = 1
> net.ipv4.conf.default.accept_source_route = 0
> kernel.sysrq = 0
> kernel.core_uses_pid = 1
> net.ipv4.tcp_rmem = 131072 1048576 2097152
> net.ipv4.tcp_wmem = 131072 1048576 2097152
> net.ipv4.tcp_mem = 131072 1048576 2097152
> net.core.rmem_default = 1048576
> net.core.wmem_default = 1048576
> net.core.rmem_max = 2097152
> net.core.wmem_max = 2097152
> net.ipv4.tcp_dsack = 0
> net.ipv4.tcp_sack = 0
> net.ipv4.tcp_timestamps = 0
> net.core.netdev_max_backlog = 10000
>
>
> Simon George ([log in to unmask]) wrote:
> > Hi everyone,
> >
> > the RHUL issue is not yet solved, but the investigation by David Smith
> > implied that gridftp packets were being dropped in a perimeter router.
> > We passed it on to our network guys who have reproduced the problem and
> > are taking it up with the hardware manufacturer.
> >
> > Some technical info: David found that outgoing TCP segments with
> > sequence numbers more than 65536 away from the last successfully
> > transmitted outgoing segment were dropped. The behaviour seems to imply
> > that there is a maximum TCP window size of 65536 bytes, as if window
> > scaling is disabled.
> >
> > I can give the full gory details to anyone who is interested, just ask.
> >
> > Cheers,
> > Simon
> >
> > Greig Alan Cowan wrote:
> > >Could it be a networking problem on your end? Are transfers to the
> > >dCache affected?
> > >
> > >Greig
> > >
> > >On 10/03/08 15:12, brian davies wrote:
> > >>So it now appears to be working... some of the time
> > >>
> > >>So Channel parameters are:
> > >>glite-transfer-channel-list -x RALLCG2-UKINORTHGRIDLANCSHEP
> > >>Channel: RALLCG2-UKINORTHGRIDLANCSHEP
> > >>Between: RAL-LCG2 and UKI-NORTHGRID-LANCS-HEP
> > >>State: Active
> > >>Contact: [log in to unmask]
> > >>Bandwidth: 0
> > >>Nominal throughput: 0
> > >>Number of files: 8, streams: 1
> > >>TCP buffer size: default
> > >>Message: Activating alls channel; SRM services restored following
> > >>power failure
> > >>Last modification by: /C=UK/O=eScience/OU=CLRC/L=RAL/CN=derek ross
> > >>Last modification time: 2008-02-08 19:06:25
> > >>Number of VO shares: 6
> > >>VO 'alice' share is: 1 and is limited to 1 transfers
> > >>VO 'atlas' share is: 4 and is limited to 4 transfers
> > >>VO 'cms' share is: 0 and is not limited
> > >>VO 'lhcb' share is: 1 and is limited to 1 transfers
> > >>VO 'ops' share is: 1 and is limited to 1 transfers
> > >>VO 'dteam' share is: 1 and is limited to 1 transfers
> > >>
> > >>Succesful srmv1 log for working job follows this came from th ejob
> > >>submitted by:
> > >>glite-transfer-submit
> > >>srm://ralsrmc.rl.ac.uk:8443/castor/ads.rl.ac.uk/prod/atlas/stripInput/bgedtesting/sourcefiles/bged-sample-evgen-event-062k-01
> > >>
> > >>srm://fal-pygrid-30.lancs.ac.uk:8443/dpm/lancs.ac.uk/home/atlas/bged/mcdisk/ftstestaadvark-1
> > >>
> > >>7e4a1081-eeb0-11dc-ba63-8ad671e66de1
> > >>[lcgui0357] /home/csf/daviesbg > glite-transfer-status --verbose -l
> > >>7e4a1081-eeb0-11dc-ba63-8ad671e66de1
> > >>Request ID: 7e4a1081-eeb0-11dc-ba63-8ad671e66de1
> > >>Status: Finished
> > >>Channel: RALLCG2-UKINORTHGRIDLANCSHEP
> > >>Client DN: /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies
> > >>Reason: <None>
> > >>Submit time: 2008-03-10 14:44:31.475
> > >>Files: 1
> > >>Priority: 3
> > >>VOName: atlas
> > >> Done: 0
> > >> Active: 0
> > >> Pending: 0
> > >> Ready: 0
> > >> Canceled: 0
> > >> Failed: 0
> > >> Finishing: 0
> > >> Finished: 1
> > >> Submitted: 0
> > >> Hold: 0
> > >> Waiting: 0
> > >> Source:
> > >>srm://ralsrmc.rl.ac.uk:8443/castor/ads.rl.ac.uk/prod/atlas/stripInput/bgedtesting/sourcefiles/bged-sample-evgen-event-062k-01
> > >>
> > >> Destination:
> > >>srm://fal-pygrid-30.lancs.ac.uk:8443/dpm/lancs.ac.uk/home/atlas/bged/mcdisk/ftstestaadvark-1
> > >>
> > >> State: Finished
> > >> Retries: 0
> > >> Reason: (null)
> > >> Duration: 13
> > >>
> > >>
> > >>
> > >>03/10 14:44:37 11841,0 ping: request by
> > >>/C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
> > >>lcgfts0424.gridpp.rl.ac.uk
> > >>03/10 14:44:37 11841,0 ping: returns 0
> > >>03/10 14:44:38 11841,0 getFileMetaData: request by
> > >>/C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
> > >>lcgfts0424.gridpp.rl.ac.uk
> > >>03/10 14:44:38 11841,0 getFileMetaData: SRM98 - getFileMetaData
> > >>srm://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/atlas/bged/mcdisk/ftste
> > >>
> > >>staadvark-1
> > >>03/10 14:44:38 11841,0 getFileMetaData: returns 12
> > >>03/10 14:44:38 11841,0 srmv1: SRM02 - soap_serve error : No such file
> > >>or directory
> > >>03/10 14:44:38 11841,0 put: request by
> > >>/C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
> > >>lcgfts0424.gridpp.rl.ac.uk
> > >>03/10 14:44:38 11841,0 put: SRM98 - put 3500 3500
> > >>03/10 14:44:38 11841,0 put: SRM98 - put 0
> > >>srm://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/atlas/bged/mcdisk/ftstestaadvark-1
> > >>
> > >>03/10 14:44:38 11841,0 put: returns 0
> > >>03/10 14:44:40 11841,0 getRequestStatus: request by
> > >>/C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
> > >>lcgfts0424.gridpp.rl.ac.uk
> > >>03/10 14:44:40 11841,0 getRequestStatus: SRM98 - getRequestStatus 3500
> > >>03/10 14:44:40 11841,0 getRequestStatus: returns 0
> > >>03/10 14:44:40 11841,0 setFileStatus: request by
> > >>/C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
> > >>lcgfts0424.gridpp.rl.ac.uk
> > >>03/10 14:44:40 11841,0 setFileStatus: SRM98 - setFileStatus 3500 0
> > >>Running
> > >>03/10 14:44:40 11841,0 setFileStatus: returns 0
> > >>03/10 14:44:40 11841,0 getRequestStatus: request by
> > >>/C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
> > >>lcgfts0424.gridpp.rl.ac.uk
> > >>03/10 14:44:40 11841,0 getRequestStatus: SRM98 - getRequestStatus 3500
> > >>03/10 14:44:40 11841,0 getRequestStatus: returns 0
> > >>03/10 14:44:48 11841,0 getRequestStatus: request by
> > >>/C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
> > >>lcgfts0424.gridpp.rl.ac.uk
> > >>03/10 14:44:48 11841,0 getRequestStatus: SRM98 - getRequestStatus 3500
> > >>03/10 14:44:48 11841,0 getRequestStatus: returns 0
> > >>03/10 14:44:48 11841,0 setFileStatus: request by
> > >>/C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies from
> > >>lcgfts0424.gridpp.rl.ac.uk
> > >>03/10 14:44:48 11841,0 setFileStatus: SRM98 - setFileStatus 3500 0 Done
> > >>03/10 14:44:49 11841,0 setFileStatus: returns 0
> > >>
> > >>
> > >>
> > >>However when i then try running again with
> > >>
> > >>glite-transfer-submit
> > >>srm://ralsrmc.rl.ac.uk:8443/castor/ads.rl.ac.uk/prod/atlas/stripInput/bgedtesting/sourcefiles/bged-sample-evgen-event-062k-01
> > >>
> > >>srm://fal-pygrid-30.lancs.ac.uk:8443/dpm/lancs.ac.uk/home/atlas/bged/mcdisk/ftstestaadvark-2
> > >>
> > >>11828d5e-eeb2-11dc-ba63-8ad671e66de1
> > >>
> > >>I getglite-transfer-status --verbose -l
> > >>11828d5e-eeb2-11dc-ba63-8ad671e66de1
> > >>Request ID: 11828d5e-eeb2-11dc-ba63-8ad671e66de1
> > >>Status: Active
> > >>Channel: RALLCG2-UKINORTHGRIDLANCSHEP
> > >>Client DN: /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=brian davies
> > >>Reason: <None>
> > >>Submit time: 2008-03-10 14:55:47.967
> > >>Files: 1
> > >>Priority: 3
> > >>VOName: atlas
> > >> Done: 0
> > >> Active: 0
> > >> Pending: 0
> > >> Ready: 0
> > >> Canceled: 0
> > >> Failed: 0
> > >> Finishing: 0
> > >> Finished: 0
> > >> Submitted: 0
> > >> Hold: 0
> > >> Waiting: 1
> > >> Source:
> > >>srm://ralsrmc.rl.ac.uk:8443/castor/ads.rl.ac.uk/prod/atlas/stripInput/bgedtesting/sourcefiles/bged-sample-evgen-event-062k-01
> > >>
> > >> Destination:
> > >>srm://fal-pygrid-30.lancs.ac.uk:8443/dpm/lancs.ac.uk/home/atlas/bged/mcdisk/ftstestaadvark-2
> > >>
> > >> State: Waiting
> > >> Retries: 2
> > >> Reason: DESTINATION error during PREPARATION phase:
> > >>[CONNECTION] failed to contact on remote SRM
> > >>[httpg://fal-pygrid-30.lancs.ac.uk:8443/srm/managerv1]. Givin' up
> > >>after 3 tries
> > >> Duration: 0
> > >>
> > >>Interestingly nothing has gone into the srmv1 log for this transfer
> > >>
> > >>seems whatever the problem is, it is intermittent :(
> > >>
> > >>Brian
> > >>
> > >>
> > >>On 10/03/2008, Greig Alan Cowan <[log in to unmask]> wrote:
> > >>>Hi Brian,
> > >>>
> > >>> First of all, FTS doesn't use lcg-cp.
> > >>>
> > >>> Can you find out from Matt Hodges how the FTS channel is configured? I
> > >>> think it will be in urlcopy mode (not srmCopy). Also, can you tell me
> > >>> what the channel parameters are, like number of streams?
> > >>>
> > >>> What are the DPM srmv1 log files saying?
> > >>>
> > >>>
> > >>> Greig
> > >>>
> > >>>
> > >>> On 10/03/08 14:31, brian davies wrote:
> > >>> > So our new DPM is seeing simnilar issues to RHUL in thqat FTS
> > >>> > controlled transfers From CASTOR to DPM are failing with
> > >>> >
> > >>> > DESTINATION error during PREPARATION phase: [CONNECTION] failed to
> > >>> > contact on remote SRM
> > >>> > [httpg://fal-pygrid-30.lancs.ac.uk:8443/srm/managerv1]. Givin' up
> > >>> > after 3 tries
> > >>> >
> > >>> > This is only seen on the direction from RAL to LANCS and only on FTS
> > >>> > controlled transfers ( ie lcg-cp works fine from a UI)
> > >>> >
> > >>> > What was the outcome of the RHUL issues and what steps are done by
> > >>>FTS
> > >>> > before it initiates the lcg-cp? Are we being hit by a srmCopy vs
> > >>>g-u-c
> > >>> > issue?)
> > >>> >
> > >>> > Brian
> > >>>
|