Dear All,
as advertised several times on the CIC portal and as explained in
https://twiki.cern.ch/twiki/bin/view/LCG/VomsLdapServer
voms.cern.ch today is a replica of lcg-voms.cern.ch because they share
one common database.
lcg-voms.cern.ch itself runs on LinuxHA (High Availability) hardware
with a slave host ready to take-over in case if problem, so, in theory,
its availability is 'ensured'.
These are the reasons why we didn't give high priority to the provision
of a voms server replica outside the CERN site.
The reasons of the occasional errors of the gridmap file update are
related to voms-admin and/or tomcat problems that we are trying to
debug for more than 1.5 years.
https://savannah.cern.ch/bugs/index.php?func=detailitem&item_id=16843
contains part of this saga.
Nevertheless, we do have a collaboration with the VOMS developers and
Oracle experts from CNAF and CERN for off-site data replication.
We will make an announcement when we have news that can be used in
production.
Yours
maria
Maarten Litmaath, CERN wrote:
> On Tue, 21 Nov 2006, Mario David wrote:
>
>> Hello
>>
>> I just tried to run the edg-mkgridmap on some nodes and...
>>
>> [root@se02
>> ~]# /usr/bin/perl /opt/edg/libexec/edg-mkgridmap/edg-mkgridmap.pl
>> --output=/etc/grid-security/grid-mapfile --safe
>> voms
>> search(https://lcg-voms.cern.ch:8443/voms/alice/services/VOMSCompatibility?method=getGridmapUsers&container=%2Falice%2FRole%3Dlcgadmin): Connect failed: connect: timeout; Operation now in progress
>>
>> I did a ping and
>> [root@ui01 root]# ping lcg-voms.cern.ch
>> PING prod-voms.cern.ch (128.142.160.91) 56(84) bytes of data.
>> 64 bytes from prod-voms.cern.ch (128.142.160.91): icmp_seq=0 ttl=51
>> time=36.9 ms
>>
>>
>> so my question is of course
>> for such core/central/unique service aren't there any replicated server
>> which could take over when there are such problems
>
> A replicated server probably would not help here. See this bug:
>
> http://savannah.cern.ch/bugs/?16843
>
>> two comments
>> one of my wishes in a wishlist to the ROC managers meeting was
>> specificaly, " All the services should not be single point of failure"
>> thats why its nice to have as many RB as we want or as many bdii's
>> (although the UI is only able to use one BDII, but thats the other side
>> of the coin)
>
> Indeed, e.g. LCG_GFAL_INFOSYS could be obsoleted by a variable that
> lists a set of equivalent BDIIs.
>
>> central LFC's are another type of service which is a single point of
>> failure.
>
> LHCb already replicate their central LFC to off-site read-only copies.
>
>> note that single point of failures are not solved with 20 machines
>> round-robin behind the lcg-voms DNS at the same site.
>> of course if there is network connections problems at the site
>> that will not surely help.
>>
>> in the EELA project one of the efforts we made was indeed to have the
>> voms server DB replicated into two different sites, and up to now things
>> seems to be working OK.
>
> That might be worth pursuing for the LHC experiments as well, though it
> should be noted that replication comes at a price that could be higher
> than the expected benefits.
--
Maria Dimou-Zacharova http://cern.ch/dimou
CERN, CH-1211 Geneva 23, Switzerland
[log in to unmask], Tel:+41227673356, Fax:+41227669820,+41227674900
|