JISCMail - LCG-ROLLOUT Archives

Hi,
the problem of what is the available on a site, or what is RHXX or SLXX
has been with us from
the first days of edg.
There the initial approach was to ask the few VOs that have been active
what they want.
The result was a minimalists node. Almost all dependencies that the OS
satisfied was driven by the
middleware and the experiment software mandated that all the systems
needed permanent updates to
satisfy the requirements of the VOs as they became active or changed
their software.
After some very emotional meeting (WE WANT THE WHOLE REDHAT!!!!! ) edg
and later the early LCG
took a more pragmatic approach. A reference system was defined that was
used during integration and testing and
that was forced on all the sites. This was not without conflicts with
the experiments, but the situation was somewhat predictable
for them. The sites, of course, could hardly accept a world with all
the systems running identical versions. This collided with local users
requirements and will collide with the reality of multiple (read MANY)
VOs.

During the data challenges 2004 the experiments (at least some of them)
became very pragmatic. They started to
ship almost everything they need with their software. This is not
always efficient, but certainly gives predictable results.
If you want they did a kind of poor man's user mode virtualization of
the resources.

If one looks a bit harder at this then it becomes more clear why even
for a single VO in a real production environment
at least this kind of control is needed.
A typical use case is that inside a collaboration (VO for non HEPs) the
researchers can't switch to new versions of their analysis code at the
same time. Until a paper/thesis  is finished it can be very confusing
to switch.
This means that at the same time the VO will have several list of
versions of libs. that they require to be on the WNs.
Multiply this with the number of VOs that sites already allow to use
their resources and it becomes clear that even
publishing the list of versions (ignoring for a while the security
implications) is a nightmare, managing a site like this
is just far too time consuming.

In my view the sites should only have to install a minimal set of
software. (In an ideal world gLite would have only trivial
dependencies).
The VOs then distribute independent from the application software
releases of their preferred  environment. These environments should be
tagged and the tag name can be published. On sites with a strong
affiliation with a VO these environments can be added to the WNs, but
can be
installed as well like application software.
The VO publishes a compatibility  matrix between their different
environment versions and the different releases of their software.

What is important for this lightweight virtualization to work:
We have to improve the ease with which the VO's software managers can
distribute their software to the sites. This includes the
packaging of "environments". These packaged environments have to be
made available to interested sites to allow them
to install the software locally.

We will need sufficient space to install all the versions in the shared
space/locally

A mechanism has to be put in place to select the wanted versions of
environments and software in the jdl. This has to
be used not only in the matchmaking process, but in addition it has to
control the correct setup of the environment variables when the
job starts,
A system, with quite similar  functionality, but  for the selection of
different middleware releases/flavors will be part of LCG-2-4-0.
This should be possible to  extend this for the use by the VOs.

I am convinced that in the end we have to go even one step further and
provide some virtual machine concept, or the computing culture on the
grid has to go back to where it was in the golden days of F77 where the
applications had almost no individual dependencies.....
But this is not a practical solution in the near future (despite the
fact that there are already some systems available (chos to name one))

As for running on completely different platforms (I mean different as
IRIX, Windows, Mac OS-X, etc.) .
Certainly possible to handle, but not a burning as the handling of
different linux distributions.


      markus


On Mar 1, 2005, at 4:22 PM, Laurence wrote:

> Hi,
>
> We touched upon this issue during a recent Glue Schema discussion.
>
> The common consensus is that the tag published by the information
> system
> should be defined to be the output of a command such as  what Steve T
> suggested "/usr/bin/lsb_release -d". At
> The value published in the information system should not be used to try
> and work out if your application can run at a site.  With most VOs, the
> software manager will install the VO software on the site and publish a
> tag in the RuntimeEnvironment so that the VOs jobs can then be steered
> to that site.  The information could be used to help the VO manger
> install the software  but they should run some kind of probe job at the
> site to check if what they require is on the worker node.
>
> Laurence
>
>

************************************************************************
*******
Markus Schulz
CERN IT