On Wed, 2005-09-21 at 16:18 +0100, David McBride wrote:
> And the upshot is that, assuming that CMS's
> problem is not uncharacteristic of the needs of the other experiments,
> this is a resolvable problem. I'll put the details in a seperate email
> shortly.
Hello all,
Well, "shortly" wasn't quite as imminent as I had hoped -- conference
proceedings, a dead laptop hard drive and a house move (!) have
conspired against this email getting out in a timely fashion.
To recap: the root problem that we are trying to solve is that the LCG
software distribution doesn't provide all of the functionality that the
experiments require to get their work done. The experiments know this,
and the LCG developers should know this, but in any case the core LCG
distribution will not be adapted quickly enough to add the missing
functionality before various important deadlines expire.
The experiments have, in some cases, implemented their own services to
make up for the deficiencies in LCG. CMS, for example, has implemented
PhEDEx[1][2] which fills in the missing gaps in LCG's data staging and
replica management capabilities. (As I've only had a chance to talk to
Tim Barrass from CMS, I'm not familiar with the other experiments'
equivilent issues.)
However, as these solutions are not part of the standard LCG
distribution they will not be deployed at the sites where the
experiments most need them. Thus, they are looking to negotiate with
individual sites to run these "extra" services to make up for the
shortfall.
Somewhere along the line the idea of a "VOBOX" appeared -- a seperate,
dedicated machine to support VO-specific services such as the PhEDEx
service above. However, the suggested nature of how the VOBOX would be
installed and maintained triggered complaints from systems
administrators that (as presented) VOBOXes would be not the safe,
scalable, nor secure solution to solving the specific- and general-case
problems they were purported to resolve -- and as such, they would
refuse to run them. I did (and still do) fall into this category.
There has, however, been a breakdown in accurate, timely communications
between the sites and the experiments. From speaking with Tim Barret
(one of the PhEDEx developers) it seems clear that the reality of the
situation is quite different; the experiments appear to be quite happy
(indeed, eager) to negotiate some appropriate policy with site on an
individual basis so that they can get the extra functionality they need.
Sites will ask for various constraints to be met, but from what I see
the experiments would prefer that the practical concerns raised by the
site administrators are properly resolved now so that they can avoid
serious operational issues further down the line.
Assuming that this positive and co-operative attitude demonstrated by
Tim at AHM is representative of all of the experiments, and the
operational requirements of each experiment are similarly modest, then I
think that their operational concerns _can_ be addresed quite easily to
the satisfaction of all concerned.
In the mean time, someone needs to be lighting a fire under the
management reponsible for LCG software development -- we should _never_
have gotten into this state in the first place. (You know you're doing
something wrong when your end-users are having to invest time and energy
to build their own services that don't suffer from your shortcomings.)
It is vital that the development process takes account of the
experiments' operational requirements and verifies that they were being
met as they evolve and are refined throughout the lifetime of the
project. It is clear that this isn't happening -- and that needs to
change, sharpish.
Cheers,
David
[1] http://cms-project-phedex.web.cern.ch/cms-project-phedex/
[2] http://www.nesc.ac.uk/talks/ahm2005/337.ppt
--
David McBride <[log in to unmask]>
Department of Computing, Imperial College, London
|