On Wed, 4 Oct 2006, Laurence Field wrote:
> This could have happened with any of the distributions below. Redhat, Suse,
> Debian, Ubuntu etc. will do updates and it is understandable that a site will
> want to install security updates as soon as possible. If this update causes
> an interaction with other software which creates a problem then that is
> unfortunate. This problem was spotted by the SFTs and also showed up on the
> Testbed, however, I don't think we can realistically prevent this kind of
> problem.
In fact, the problem of half the grid going down after one of these
distrubutions messed up is probably a sign that more heterogenity would be
good. Best practices for security upgrades is to make sure no semantics
change, but mistakes do happen. And if if there had been a perfect spread
of distributions in use (out of those 14), an SLC mistake would only be
able to take out 7% of the sites.
Unfortunately, the spread of OSes on the CE and other infrastructure nodes
is probably less than those figures (which are for the WNs). A
heterogenous environment might be a bit harder to handle and support, but
in the long run it is likely to be more resilient in that many mistakes
(and perhaps even vulnerabilities) are only present in a small percentage
of the sites.
/Mattias Wadenstein
> ldapsearch -x -h lcg-bdii -p 2170 -b o=grid | grep
> GlueHostOperatingSystemName | sort -u
> GlueHostOperatingSystemName: CentOS
> GlueHostOperatingSystemName: Debian
> GlueHostOperatingSystemName: linux-rhel-3
> GlueHostOperatingSystemName: linux-rocks-3.3
> GlueHostOperatingSystemName: linux-rocks-4.1
> GlueHostOperatingSystemName: linux-sl-fermi-3.0
> GlueHostOperatingSystemName: Redhat
> GlueHostOperatingSystemName: RedHatEnterpriseAS
> GlueHostOperatingSystemName: Scientific Linux
> GlueHostOperatingSystemName: Scientific Linux CERN
> GlueHostOperatingSystemName: Scientific Linux SL
> GlueHostOperatingSystemName: ScientificSL
> GlueHostOperatingSystemName: SUSE LINUX
> GlueHostOperatingSystemName: Ubuntu
>
>
>
> Kalman Kovari wrote:
>> Hi,
>>
>>
>>> don't forget to be real, people: the problem was caused by an interaction
>>> between a *bug fix* in ssh and an *unfixed, dormant bug* in YAIM. These
>>> kinds of situations are rather difficult to detect, until they are
>>> triggered.
>>>
>>
>> Yep. That's why one would need a
>> "yaim-installed-slc3-based-gLite-running-small-test-gridsite" to test
>> the release candidates before approving it towards the grid. Is that so
>> unreal? Plus 5 machines to a testbed?
>>
>> K
>>
>>
>>> J "or did you forget the apostrophe in the comment story" T
>>>
>>> Kalman Kovari wrote:
>>>
>>>> Hi Nicholas,
>>>>
>>>>
>>>>> The update was an OS update, not a middleware update, therefore it's out
>>>>> of the control of EGEE and WLCG. If gLite ran on Windows, would we
>>>>> expect Microsoft to give us (EGEE grid) an individual warning of a
>>>>> security patch?
>>>>>
>>>> Would we be the 'biggest consumer' of Microsoft? In that case, I would
>>>> expect them to consider our needs...
>>>>
>>>> If we want to avoid another issue like this, the choices are on the long
>>>> run either to set up an own (gLite or EGEE based) commitee to control
>>>> the repository updates (by setting up our own repo, or by advising
>>>> sysadmins only to upgrade on the commitee's approval of the new sw), OR
>>>> to convince the SLC3 release responsibles to RESPECT the needs of our
>>>> services, and to trust them. The first case would be a big work, and a
>>>> lot of delay on security updates. In the later case their testing team
>>>> would have a bit more work (another testing environment maybe), and we
>>>> could even trust the auto-updates.
>>>>
>>>> Best Regards,
>>>> Kalman Kovari
>>>>
|