Hi
Can I explain the logic behind this proposal from the VO perspective.
We don’t really care about the difference between unscheduled and scheduled downtimes. In general what we care about is overall availability (as measured by things like ASAP). Occasionally a site going into downtime will have an impact on a VOs ongoing work, but 24 hours isn’t nearly enough time to react to this. We want sites to declare downtimes as far in advance as possible, and while most of the time it won’t make any difference to our actions, it will increase the chance of us preventing potential problems due to downtime declarations.
With regards to critical security patching, I feel that exceptions should be granted and that what is written on slide 5 is just wrong. Under the current system, if there is a critical security patch and you are in a position to immediately patch, then it is clearly better to do this, than to wait 24 hours simply so that the downtime could be considered scheduled. By making it 5 days, it highlights quite how silly / artificial the downtime declaration mechanism is. We definitely don’t want sites to wait 5 days before patching, we want people to patch as soon as reasonably possible.
The 5 days is a compromise: 5 days is sufficiently short that most genuinely planned interventions (e.g. a storage upgrade) can be accurately scheduled. 5 days is sufficiently long that if some incident does cause a problem, the site just deals with it, rather than trying to hobble along simply so the downtime can be considered scheduled.
Alastair
> On 16 May 2017, at 17:14, Stephen Jones <[log in to unmask]> wrote:
>
> Hi,
>
> It's interesting. We can't predict when some scary exploit will jump up at us, so we can't pledge to give much notice. So perhaps we should say that some types of unscheduled downtimes are perfectly acceptable and are no reflection on site performance. As others say, if unscheduled downtimes are always taken to be a reflection on site performance, then the temptation is to delay the adoption of patches in order to give plenty of notice. And this would be a case of throwing out the baby with the bathwater.
>
> (On a related issue: it's easy to patch a cluster without taking it down for a significant time, by doing batches of nodes in series. I have a little script, called snakey.pl, that I bring out when I need to put a new kernel on our Condor cluster.)
>
> Cheers,
>
> Ste
>
>
>
>
>
> On 16/05/17 15:11, Ian Collier wrote:
>> There should be an opportunity to discuss this at the Manchester Workshop.
>>
>> —Ian
>>
>>> On 16 May 2017, at 11:56, Peter Gronbech <[log in to unmask]> wrote:
>>>
>>> I noticed that there is a proposal to change the downtime declaration to 5 days before and intervention from 24 hours.
>>> See slide 6 of https://indico.egi.eu/indico/event/3237/contribution/4/material/slides/0.pdf
>>>
>>> This would mean interventions declared less than 5 days in advance will be considered unscheduled.
>>> I'm not sure if this has been approved yet though.
>>>
>>> Cheers Pete
>>>
>>>
>>> --
>>> ----------------------------------------------------------------------
>>> Peter Gronbech GridPP Project Manager Tel No. : 01865 273389
>>>
>>> Department of Particle Physics,
>>> University of Oxford,
>>> Keble Road, Oxford OX1 3RH, UK E-mail : [log in to unmask]
>>> ----------------------------------------------------------------------
>
>
> --
> Steve Jones [log in to unmask]
> Grid System Administrator office: 220
> High Energy Physics Division tel (int): 43396
> Oliver Lodge Laboratory tel (ext): +44 (0)151 794 3396
> University of Liverpool http://www.liv.ac.uk/physics/hep/
|