On Wed, 4 Oct 2006, Laurence Field wrote: > This could have happened with any of the distributions below. Redhat, Suse, > Debian, Ubuntu etc. will do updates and it is understandable that a site will > want to install security updates as soon as possible. If this update causes > an interaction with other software which creates a problem then that is > unfortunate. This problem was spotted by the SFTs and also showed up on the > Testbed, however, I don't think we can realistically prevent this kind of > problem. In fact, the problem of half the grid going down after one of these distrubutions messed up is probably a sign that more heterogenity would be good. Best practices for security upgrades is to make sure no semantics change, but mistakes do happen. And if if there had been a perfect spread of distributions in use (out of those 14), an SLC mistake would only be able to take out 7% of the sites. Unfortunately, the spread of OSes on the CE and other infrastructure nodes is probably less than those figures (which are for the WNs). A heterogenous environment might be a bit harder to handle and support, but in the long run it is likely to be more resilient in that many mistakes (and perhaps even vulnerabilities) are only present in a small percentage of the sites. /Mattias Wadenstein > ldapsearch -x -h lcg-bdii -p 2170 -b o=grid | grep > GlueHostOperatingSystemName | sort -u > GlueHostOperatingSystemName: CentOS > GlueHostOperatingSystemName: Debian > GlueHostOperatingSystemName: linux-rhel-3 > GlueHostOperatingSystemName: linux-rocks-3.3 > GlueHostOperatingSystemName: linux-rocks-4.1 > GlueHostOperatingSystemName: linux-sl-fermi-3.0 > GlueHostOperatingSystemName: Redhat > GlueHostOperatingSystemName: RedHatEnterpriseAS > GlueHostOperatingSystemName: Scientific Linux > GlueHostOperatingSystemName: Scientific Linux CERN > GlueHostOperatingSystemName: Scientific Linux SL > GlueHostOperatingSystemName: ScientificSL > GlueHostOperatingSystemName: SUSE LINUX > GlueHostOperatingSystemName: Ubuntu > > > > Kalman Kovari wrote: >> Hi, >> >> >>> don't forget to be real, people: the problem was caused by an interaction >>> between a *bug fix* in ssh and an *unfixed, dormant bug* in YAIM. These >>> kinds of situations are rather difficult to detect, until they are >>> triggered. >>> >> >> Yep. That's why one would need a >> "yaim-installed-slc3-based-gLite-running-small-test-gridsite" to test >> the release candidates before approving it towards the grid. Is that so >> unreal? Plus 5 machines to a testbed? >> >> K >> >> >>> J "or did you forget the apostrophe in the comment story" T >>> >>> Kalman Kovari wrote: >>> >>>> Hi Nicholas, >>>> >>>> >>>>> The update was an OS update, not a middleware update, therefore it's out >>>>> of the control of EGEE and WLCG. If gLite ran on Windows, would we >>>>> expect Microsoft to give us (EGEE grid) an individual warning of a >>>>> security patch? >>>>> >>>> Would we be the 'biggest consumer' of Microsoft? In that case, I would >>>> expect them to consider our needs... >>>> >>>> If we want to avoid another issue like this, the choices are on the long >>>> run either to set up an own (gLite or EGEE based) commitee to control >>>> the repository updates (by setting up our own repo, or by advising >>>> sysadmins only to upgrade on the commitee's approval of the new sw), OR >>>> to convince the SLC3 release responsibles to RESPECT the needs of our >>>> services, and to trust them. The first case would be a big work, and a >>>> lot of delay on security updates. In the later case their testing team >>>> would have a bit more work (another testing environment maybe), and we >>>> could even trust the auto-updates. >>>> >>>> Best Regards, >>>> Kalman Kovari >>>>