Hi,
Firstly, I would like to apologise for sending such pleading emails
about this and sincerely thank the people who are trying to help.
Ok, so I finally got a chance to look at this again today (just what I
need on a Friday afternoon). I had several replies not to the list that
consistently echoed Stephen and said that it was because QMUL gridftp
server that was unable to contact the RAL source. Alex claims there was
no firewall at QMUL so does anybody have any ideas as to what could have
been blocking this?
I say "could have been" because today I see different behaviour.
Still trying to transfer
Source:
srm://dcache.gridpp.rl.ac.uk:8443/srm/managerv1?SFN=/pnfs/gridpp.rl.ac.uk/data/cms/phedex_loadtest/LoadTest_T1_RAL_020
Destination:
srm://se01.esc.qmul.ac.uk:8443/srm/managerv1?SFN=/dpm/esc.qmul.ac.uk/home/cms//LoadTest/DJC_test_2
Initially, it seemed to start off OK and an srm-get-metadata showed a
file to exist, with zero bytes, but still I thought "Progress!" but then
it failed with a timeout suggesting a network error. Thereafter, the
subsequent re-tries ggave
State: Failed
Retries: 4
Reason: Getting filesize failed. a system call failed
(Connection refused)
Duration: 0
So my two questions are:
1. Does anybody know of anything that might have changed overnight?
2. What do the current errors/behaviour mean?
Next tries...
OK so then I tried transferring a ~40MB file from the Imperial from the
Imperial dCache installation, using FTS to QMUL ... worked like a dream.
As did a 120MB file from Imperial (so quickly infact I assumed that
there was error... however srm-get-metadata tells me otherwise)... As
did a ~2GB file.
The ganglia network plots show that the transfers happenned at about
20MB/s not massive but OK.
So the situation that I am in is that I can transfer files happily from
Imperial to QMUL but not a sausage is going between RAL and QMUL.
What can be causing this? Could it be the different channels or what?
How do I tell if it is the RAL end or the QMUL end or mismatch between
the two.
Never in all my years of HEP computing have I found anything so
frustrating to pin down and debug. It really seems as though there are
little packet gremlins sitting on the network connections or perhaps
ghosts in the disk servers ... maybe I have been working too hard ;-)
Anyway, any glimour of understanding that anybody can throw our way
would be very useful. We are trying hard to make this site available for
the SC (currently for CMS but hopefully Atlas as well soon) and with
QMUL being a large site making this really is important.
Thanks again to those people who are trying to help ...
All the best,
david
|