Dear lcg-rollout members,
there are several ways to make a CE "highly available" (*):
1) high availability on the OS - linux side (eg. DRDB+heartbeat)
2) load balancing-like with multiple lcg-CEs
3) combined lcg-CE & cream-CE solution (unavoidable in 2009?)
4) something you know better
Needless to say, (2) and (3) will need extra work for sBDII & LRMS,
but do give more than just improved availability and at moments
can capture/asborb/mask more complex and unforeseen Failure Modes.
questions:
* has any one of you been happy for long with any one of these setups?
* What are the complexities involved with each one, wrt. m/w upgrades?
* Do you foresee any kind of trouble for each of the aforementioned ideas?
thank you in advance for your time,
Fotis
(*)
Hereby, follows the long explanation for the need for CE High Availability
As you are perhaps already aware, sites are judged by uptime metrics,
as these are defined in "Definition of Availability & Reliability terms";
visit for more the following document:
https://lcg.web.cern.ch/LCG/MB/Explanation%20of%20Availability%20and%20Reliability%20terms%20used%20in%20the%20MoU%20and%20Site%20Availability%20Metrics.pdf
There are at least three parameters to optimize in a potential grid SLA,
(i) uptime/downtime windows and ratios thereof
(ii) cpu*hour product over a time window
(iii) success rate of incoming jobs
currently, Tier2 sites are expected to be at 95% level for parameter (i)
according to MoU - Annex 3.3 ( http://lcg.web.cern.ch/LCG/mou.htm )
and provide adequate level of (ii) on a per year basis as defined in
TDRs. (iii) can be monitored at either LRMS, CE, WMS or User/App level,
in fact the latter is happening through the dashboard for some VOs)
Even without running a proper FMECA analysis for a grid site -
http://en.wikipedia.org/wiki/Failure_Mode,_Effects,_and_Criticality_Analysis
- it is apparent that loss of the Computing Element service has
a very high impact for (i), (ii) & (iii). Worse, degradations of
CE-related services and failures of components related to it have
the most severe effects "return-on-investment-wise" - and vice versa.
To make a long story short, hardening the CE service is imperative
for any grid site of some size and will eventually pay the bills
of money and time costs, so it requires attention.
There are a few different ways to implement such an improvement,
each solution with different advantages and disadvantages.
The answer is not clear but the need is common among grid sites,
therefore you now get this email... ;)
|