Hi,
We’ve drained and rebooted all our worker nodes now and have not seen any problems, we’re reHEPSPEC06ing one of each generation. Ian will post numbers when he’s collated them but the headline seems to be that we don’t see significant degradation in the HEPSPEC06 numbers.
However, what I’m now more concerned about is the WLCG yesterday and the microcode checker availability script linked. According to the mail and the script, updated microcode is only available for our latest generation of CPUs and everything we bought before our 2016 purchase is still vulnerable to CVE-2017-5715 until the software fix is available (in a few weeks).
Until the “retpolines” fix is available, it looks like we need to put about 80% of our cluster offline, is that what other sites are doing?
Yours,
Chris.
On 10/01/2018, 00:23, "Testbed Support for GridPP member institutes on behalf of Alastair Dewhurst" <[log in to unmask] on behalf of [log in to unmask]> wrote:
Hi All
Jeremy mentioned in his email regarding OPs meetings that:
"There seems to be sufficient discussion on other lists to address the security topic that arose over the holiday period”
I assume this was a reference to Spectre and Meltdown. I am on quite a few mail lists but I haven’t seen that much useful discussions.
What I have seen:
- Emails where people repeatedly forward the OSG and EGI recommendations around slowly obscuring the contents of the message with headers and indentation.
- Emails where people argue over what the OSG and EGI recommendations actually mean.
- Emails where people ask/confirm if mitigation is available for certain machine types (Which to be fair is perfectly reasonable and useful).
- Emails where people have patched a test machine ran a quick benchmark and then made a wild extrapolation… (sometimes the first person just reports the result and lets someone else decide to take it as scientific fact and make the wild extrapolation).
What I haven’t seen are any emails where people have said they have patched certain services and after a day or two its fine and all the fears about performance hits did not materialise.
At RAL we currently have 50% of the batch farm in draining which will be patched + rebooted around Lunchtime Wednesday, the other half will be done on Thursday. There is a bit of a flu epidemic going around RAL at the moment, so we haven’t started patching other service yet, but we intend to start on Wednesday.
Anyone done an ARC CE (+ HTCondor Schedd)? They aren’t massively loaded but they are constantly busy machines and could potentially be impacted by performance slow downs. Our intention is to do one on Wednesday and see if anything happens before doing the rest over the next few days.
What about other services run by many sites? Squids, MyProxy, Bdii, PerfSonar, Argus etc? Most of these aren’t heavily used so I don’t expect problems, but its always nice to know if other people have done it. Performance hits on squids could potentially cause a problem if they are loaded, but I guess we could just add another machine or two.
Thanks
Alastair
|