Hi all,
Unfortunately, I have missed the first part of the meeting today where
maybe some of the questions/comments below were discussed already, but
probably doesn't hurt to move the discussion to the list too.
(Some of the question may already have been discussed in the past when
I wasn't here. If there is any documentation on the different tests
and the results, I would appreciate if someone could point me to it.)
If I understood correctly, then the issue with the single endpoint for
data access would be the site accounting and also the network I/O.
For the single end point to access data (like it is easily done in
xrootd), is such a structure even possible for dpm? If so, wouldn't it
be possible to keep track of outgoing transfers from the storage
elements and use that as accounting metrics for sites?
For the data access over WAN and network I/O: Was this already tested
before in some configuration within GridPP (by us and not by VOs)? If
there is the need to do some testing of it, I would be interested in
doing this.
For VOs not using root files, it could still be interesting to access
the data over the network instead of copying to the local worker node,
due to the needed transfer rate be much lower when the file is read
over network since the data is instantly processed which is not the
case for transferring the data and then starting to process it. For
root files it could be even more efficient. (For Babar I found before
that reading over the network with instantly processing the read
events, the needed transfer rate dropped down to about 50kB/s per job
continuously over the run time of the job). That's of course different
for non root files and random access to the file. However, if the file
is accessed only once anyway, then there shouldn't be much of a
difference between copying+processing locally and processing while
reading over the network, should it? Caching would probably most
effective for something like condition files which need to be read by
all jobs no matter which data is processed later in the job.
Anyway, we would need to have the effect on local storage and
bandwidth tested and measured when thousands of jobs try to stream or
access data remotely for the different types of files (root files vs
non-root files). Not sure if this is already ongoing or was done in
GridPP before. If there are already any results on this, I would
appreciate if someone could send me a link to it.
Just a thought about how to make the access to data easier for new,
non HEP VOs: Was gfalfs or xrootdfs ever tested within GridPP to
access the data? In the case of using it, VOs wouldn't need to worry
about how to transfer the data to the local node but could just mount
the storage within their job and access it like if it would be local
and let the underlying protocol due the actually data transfer (this
could especially be interesting if there would be a single endpoint
possible to access data across all sites).
If there is no negative experience with using it (especially for read
only access), I would like to test if this could be helpful for LSST
usage.
Any thoughts on all the above would be welcome!
Cheers,
Marcus
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
|