Hi,
On Thu, Jun 01, 2006 at 07:02:48PM +0100, Greig A Cowan wrote:
> > I would like to understand what was the outcome of the tests you did
> > with Mona concerning the fts/dcache problem. I remember that the last
> > test was to set fts to use srm rather than gsiftp third party copy. Was
> > the iowait ok at that time ?
>
> Correct, we configured the STAR-IC FTS channel to use srmCopy rather than
> 3rd party urlcopy. We observed a few things during the tests:
>
> 1. With FTS in urlcopy mode, a 50GB transfer from Edinburgh to IC gave a
> rate of 138Mb/s. With srmCopy mode, the rate was 188Mb/s so a significant
> boost in transfer rate was achieved.
Did anyone check that srmCopy worked as expected? We still don't know if
the problem is caused by whatever FTS uses by default to copy data or
something else inside FTS. Do you know if Mona collected iostat and
other usefull traces during the time?
188Mb/s seems a bit slow to me, how many concurent files you were
transfering at a time? For phedex (srmcp) we get the best results
with >8 files.
That the speed was higher doesn't necessarily mean that everything was OK the
network path is shared by other people after all and dcache could have been
under load from other people as well (actually the CMS people complained that
their transfers were slow during that time so they did affect the measurements
with their transfers.
Remember also that the biggest problem is when FTS is used between
dcache hosts and not between dpm-dcache. Did you test from a dcache
site as well?
> 2. The inter-node traffic was significantly reduced when using srmCopy
> compared to urlcopy. This is because the data is transferred directly to
> the pool that it will be stored on, rather than being routed to a pool via
> a gridFTP door.
I think that you are confused here, there is no direct copy with srmCopy
or urlcopy. You always talk to a grdftp door (the one that srm returned)
to upload the files. I have no idea how srm decides which door to use but
I suspect that it uses the same calculations as dcache uses to decide which
pool it will use. After you connect dcache calculates *again* which pool
to use and depending on the load,free space,etc. it might not be the local
one. From tests it seems that as long as the load is low and doesn't change
fast the decision from srm and dcache is the same so you get a "local" copy.
When the pools are under load it is a lot more likely that you'll end up
with a different pool than the door. Tuning the transfer costs in dcache
will probably help there but I never had the time to play with them and
the dcache documentation is almost non existant :(
I've seen this when phedex hits hard our disks (> 400Mb/sec, > 10 files)
with srmCopy and we *do* get inter-node traffic at the time, scmCopy doesn't
give you direct transfers and it doesn't reduce inter-node traffic. You can
say that FTS with the default copy method increases inter-node traffic
because it causes high load but this is a different matter...
> 3. Your disk servers were still showing iowait. Mona grabbed some ganglia
> screenshots taken during the transfers and I have attached them to this
> mail.
>
> * ED-dCache-DPM-IC-dCache-urlcopy.jpg
No files are attached :( I would really like to have a look at them since
I was away on holidays during that time.
Cheers,
Kostas
|