Hello Fotis,
Can you please point us to such "rm" tests that failed because they
could not access the LFCs at CERN ?
Thanks, Sophie.
> Hi,
>
> O/H LCG-ROLLOUT automatic digest system έγραψε:
>
>> Date: Wed, 22 Nov 2006 14:05:05 +0100
>> From: Maria Dimou-Zacharova <[log in to unmask]>
>> Subject: CERN VOMS Servers' replication Was: [LCG-ROLLOUT]
>> lcg-voms.cern.ch problem
>
> [...]
>
>> lcg-voms.cern.ch itself runs on LinuxHA (High Availability) hardware
>> with a slave host ready to take-over in case if problem, so, in
>> theory, its availability is 'ensured'.
>>
>> These are the reasons why we didn't give high priority to the
>> provision of a voms server replica outside the CERN site.
>
>
> These two solutions, solve two distinct areas of problems:
>
> * There are problems with are system-wide (eg. power-supply failure,
> hardware or network connectivity failure, system overload/crash etc).
>
> * There are also problems which are site-wide (eg. generic power outage,
> generic network outage, air-co problems and other more complicated
> stories
> eg. security incidents which influence the service of many nodes at once)
>
> Many "soft" problems of networks fall into the latter category,
> and there are plenty of router-related stories about it.
>
> A LAN-High-Availability solution should not be considered complete,
> if there is a requirement to provide services to a WAN-area,
> and it seems both LCG & EGEE projects -and friends- have it anyhow.
>
> To convince you more about that, look at plenty "rm" SFT/SAM errors:
> They are mostly caused by inability of sites to access the LFCs@CERN!
> There is nothing wrong with sites, this is just a design issue,
> in the sense that such services should be ideally WAN-replicated
> and hence, less influenced by transient network errors/downtimes.
>
> It's even more funny, that as site-admins we get a bad mark for it; as
> many
> as 200 site-admins can get an alert for something they can do nothing
> about!
>
>> Nevertheless, we do have a collaboration with the VOMS developers and
>> Oracle experts from CNAF and CERN for off-site data replication.
>
>
> This is a step in the right direction. In fact, it would be great
> if all Tier-1s jumped in, to provide some kind of redundancy support
> in the various critical/useful services (VOs/LFCs, gocdb, gstat, BDIIs,
> SFT/SAM, there are more...). Ideally, both fail-over & load-balance'd.
> At least R/O for the ones that can be done in the Master/Slave fashion,
> and R/W for some which could be fully replicated (Hint: MySQL v5,++).
>
> We should accept that not everything is possible (or obviously possible).
>
> good luck (the only meaningful & constructive way to sum up this
> letter!),
>
> Foti
> s
|