JISCMail - LCG-ROLLOUT Archives

On Wed, 4 Oct 2006, Laurence Field wrote:

> This could have happened with any of the distributions below. Redhat, Suse, 
> Debian, Ubuntu etc. will do updates and it is understandable that a site will 
> want to install security updates as soon as possible.  If this update causes 
> an interaction with other software which creates a problem then that is 
> unfortunate.  This problem was spotted by the SFTs and also showed up on the 
> Testbed, however, I don't think we can realistically prevent this kind of 
> problem.

In fact, the problem of half the grid going down after one of these 
distrubutions messed up is probably a sign that more heterogenity would be 
good. Best practices for security upgrades is to make sure no semantics 
change, but mistakes do happen. And if if there had been a perfect spread 
of distributions in use (out of those 14), an SLC mistake would only be 
able to take out 7% of the sites.

Unfortunately, the spread of OSes on the CE and other infrastructure nodes 
is probably less than those figures (which are for the WNs). A 
heterogenous environment might be a bit harder to handle and support, but 
in the long run it is likely to be more resilient in that many mistakes 
(and perhaps even vulnerabilities) are only present in a small percentage 
of the sites.

/Mattias Wadenstein

> ldapsearch -x -h lcg-bdii -p 2170 -b o=grid | grep 
> GlueHostOperatingSystemName | sort -u
> GlueHostOperatingSystemName: CentOS
> GlueHostOperatingSystemName: Debian
> GlueHostOperatingSystemName: linux-rhel-3
> GlueHostOperatingSystemName: linux-rocks-3.3
> GlueHostOperatingSystemName: linux-rocks-4.1
> GlueHostOperatingSystemName: linux-sl-fermi-3.0
> GlueHostOperatingSystemName: Redhat
> GlueHostOperatingSystemName: RedHatEnterpriseAS
> GlueHostOperatingSystemName: Scientific Linux
> GlueHostOperatingSystemName: Scientific Linux CERN
> GlueHostOperatingSystemName: Scientific Linux SL
> GlueHostOperatingSystemName: ScientificSL
> GlueHostOperatingSystemName: SUSE LINUX
> GlueHostOperatingSystemName: Ubuntu
>
>
>
> Kalman Kovari wrote:
>> Hi,
>>
>> 
>>> don't forget to be real, people: the problem was caused by an interaction 
>>> between a *bug fix* in ssh and an *unfixed, dormant bug* in YAIM.  These 
>>> kinds of situations are rather difficult to detect, until they are 
>>> triggered.
>>> 
>> 
>> Yep. That's why one would need a
>> "yaim-installed-slc3-based-gLite-running-small-test-gridsite" to test
>> the release candidates before approving it towards the grid. Is that so
>> unreal? Plus 5 machines to a testbed?
>> 
>> K
>>
>>
>>> 	J "or did you forget the apostrophe in the comment story" T
>>> 
>>> Kalman Kovari wrote:
>>> 
>>>> Hi Nicholas,
>>>>
>>>> 
>>>>> The update was an OS update, not a middleware update, therefore it's out
>>>>> of the control of EGEE and WLCG.  If gLite ran on Windows, would we
>>>>> expect Microsoft to give us (EGEE grid) an individual warning of a
>>>>> security patch?
>>>>> 
>>>> Would we be the 'biggest consumer' of Microsoft? In that case, I would
>>>> expect them to consider our needs...
>>>> 
>>>> If we want to avoid another issue like this, the choices are on the long
>>>> run either to set up an own (gLite or EGEE based) commitee to control
>>>> the repository updates (by setting up our own repo, or by advising
>>>> sysadmins only to upgrade on the commitee's approval of the new sw), OR
>>>> to convince the SLC3 release responsibles to RESPECT the needs of our
>>>> services, and to trust them. The first case would be a big work, and a
>>>> lot of delay on security updates. In the later case their testing team
>>>> would have a bit more work (another testing environment maybe), and we
>>>> could even trust the auto-updates.
>>>> 
>>>> Best Regards,
>>>>  Kalman Kovari
>>>>