Hello Stephen,
On 5 Oct 2005, at 18:50, Burke, S (Stephen) wrote:
> Testbed Support for GridPP member institutes
>> [mailto:[log in to unmask]] On Behalf Of Tim Barrass said:
>> Taking the example of PhEDEx, our local CMS guys take time fiddling
>> small glue scripts that allow PhEDEx to access whatever local
>> combination of storage tech and transfer tool exist. We find
>> that there
>> is significant variation in site setups that mean that we
>> can't simply provide 2 or 3 out-of-the-box solutions.
>>
>> I just raise this is it might be a complicating issue for easy
>> installation of services.
>
> Just getting back to this, I find this reply slightly worrying. Firstly
> it seems to conflict with the idea of "no backdoor access" for VO boxes
> - when you say "to access whatever local combination of storage tech
> and
> transfer tool exist" does this imply that you aren't just using the
> standard SRM and gridftp interfaces? The whole point of having those
> interfaces is that you shouldn't need any tweaking ...
I'm glad you find it worrying. The reality of working within an
experiment is that we still need to do an awful lot of work to hold
things together enough to get useful work done.
To reassure you though, we are using the standard interfaces and tools.
Typically however we find we have to bolster the functionality of these
because they either don't work, or lack features
(srm-advisory-delete!!!), or are unreliable.
I should point out that what I'm about to describe, we would like to
regard as transitional.
A good example were transfer tools-- globus-url-copy, srmcp and the lcg
tools. We found that these tools exhibited behaviour that crippled our
attempts to run large scale transfers: they return non-helpful errors;
they return success when they actually fail; they wait to timeout
before telling you they can't overwrite an existing file. We found
therefore that in order to cope with millions of transfers a day we had
to pre-delete the file (rm..); transfer; verify (ls/posix check, cksum)
(this is the basic workflow of the PhEDEx transfer agents, btw).
Ideally the tools would handle a lot of this for us.
Another is the various flavours of SRM... Castor SRM, for example,
handles tape stages etc VERY poorly, so we have to go in in the
background and pre-stage files, check that they've migrated (we don't
assume they have just because they make it to stage disk at the moment)
etc. Yes, SRM should handle all the magic sorting-by-tape efficiently
etc for you. In fact, I intend to let the RAL SRM do just that, we do
have that option.
Let me emphasise that we've developed very strong relationships with
other developers. We really want to see a lot of the stuff we do as
transitional, and hope to replace chunks of our system when suitable
replacements become production-ready. We've contributed a lot of
experience to FTS, dCache, Castor, srmcp, and are beginning to see the
fruits of that now.
PhEDEx may well be the worst offender of all the CMS services in this
regard: but this is because it plugs a huge gap in the LCG suite of
solutions which is only now getting partially filled.
> Secondly, this seems to imply a large amount of time for someone to
> set this up - are you really happy to do that at maybe 200 separate
> sites, even if you do have access? And maintain it all when problems
> occur? This seems to be more like a traditional model where people had
> more-or-less direct access at a handful of major computer centres, each
> with its own customised system.
It's really *not our ideal model* of how this should be working. It is
hard, although not typically as hard as everyone having to code
something unique for their site, and in some cases we can give up some
robustness when we need to handle things remotely (MC creation on your
average LCG site, for example). It's a large drain on our resources,
which is why we spend so much time providing feedback on what we
actually need, and get fed up in meetings when someone says everything
is fine.
Currently, the wonderful variety of technologies currently used in even
the main LCG sites means that we need to deal with lots of local
variation (e.g. RAL would like to use gsidcap file access rather than
dcap; no-one else does, except Imperial. This is entirely fair enough,
a good, justifiable decision. But because it's the only site doing it
we come across problems that no-one else has seen).
We want to devolve as much responsibility as possible. Sorry if I've
rambled a bit, but hopefully that helps? If not, feel free to ask more
questions--
Thanks,
Tim
|