Print

Print


Hi Yves,
Sorry for bringing your name into this discussion, but thankyou for
responding with all the details.
I think this has provoked a unified response from the UK and something
may get done about it!

Cheers Pete 


----------------------------------------------------------------------
Peter Gronbech   Unix Systems Manager and       Tel No. : 01865 273389
              SouthGrid Technical Co-ordinator  Fax No. : 01865 273418

Department of Particle Physics,                          
University of Oxford,                  
Keble Road, Oxford  OX1 3RH, UK  E-mail : [log in to unmask]
----------------------------------------------------------------------

-----Original Message-----
From: Testbed Support for GridPP member institutes
[mailto:[log in to unmask]] On Behalf Of Yves Coppens
Sent: 19 May 2007 11:02
To: [log in to unmask]
Subject: Update 24

Hello,

From my experience of update 23 and 24, I do not recommend to anyone to
upgrade their site until all the current problems have been resolved. In
particular do not upgrade your DPM! 
However, do install the latest lcg-voms.cern.ch!

If you do wish do upgrade some components, below is my experience of
what works and does not and some of the changes to be aware of.

The new VO style support in yaim seems to work fine.

Yaim provides support for software manager and production pool accounts
(e.g. atlassgm01, atlassgm02,...) Unfortunately, this new feature does
not work or is fiddly, so you should keep your old
users.conf and group.conf files.    


Depending from which version of gLite you upgrade from, you'll need to
apply some of the following changes to your site-info.def:

Add and set the SITE_SUPPORT_EMAIL variable.

There is a new way to define queues, so you will need to add something
along the lines: 

ALICE_GROUP_ENABLE="alice"
ATLAS_GROUP_ENABLE="atlas"
...
SHORT_GROUP_ENABLE="atlas alice babar biomed cms dteam hone ilc lhcb ngs
ops zeus calice" 

if you got a queue assigned to each VO and short queue for all VOS. The
variables VO_${VO}_QUEUES are not used any more - keeping them will not
crash yaim (as expected).

There is a new inoffensive YAIM_LOGGING_LEVEL variable. I found out that
setting it to NONE was equivalent to setting it to WARNING.

If you read the release notes carefully, you'll find out that there is a
new way to run yaim. One can now configure a CE as follows:

/opt/glite/yaim/bin/yaim -c -s /root/yaim-conf/site-info.def -n CE 

Once the configuration is over, yaim will wait for a CTRL-C, so you'd
better not use that on workers. Shortly after running yaim on my CE as
above, my gatekeeper ended in a locked state:

$ service globus-gatekeeper status
edg-gatekeeper dead but subsys locked
$ 

The gatekeeper ended up in the same state after I had rerun yaim again.
I've got no idea about what caused this and I had never encountered this
problem in the past, so something to watch out 
for and to investigate.  	

There is an ugly bug(or rather design flaw) in config_sw_dir. When yaim
runs on workers, it will try to do a recursive chmod on your software
area. In my case, this resulted in permission denied and error messages
for every single file in my +50GB /software directory :( Now, imagine if
big sites start this automatically on all workers! For more on this, see
the "NOTICE - VOBOX vs. VO software area ownership" in ROLLOUT. I really
think, an EGEE broadcast should have been issued and not simply an email
to ROLLOUT.

You can upgrade your VOBOX provided you've got the 3.0.1-15 version
(from release 24 - release 23 crashed) of yaim and you take the
precautions mentioned in Marteen's email.

Once you do upgrade DPM, do not forget to change the port in
BDII_SE_URL. THe new DPM uses a BDII rather than a GRIS. 

I'm aware there is a middleware certification process, some testing at
CERN and limited testing on the PPS (unfortunately), but when I see the
type of horrors that reaches production sites, I'm lead to say that the
certification and testing is totally ineffective or highly superficial
or even nonexistent. I'm a culprit for the lack of testing on the PPS,
but we have very little time to perform any test at all! There are plans
to do more testing in the PPS, but I still wish we would have more than
one day for this, and in particular when new major functionalities are
introduced, and that more time would pass before releases go from the
PPS to production. 

Yves