Hi Martin
At most a VO would want to check the downtime calendar once a week to look for any big downtimes that could affect work. The VO would then need a few days to react if necessary. Therefore in the worse case scenario (the downtime is entered shortly after the VO does its weekly check) you are looking at around 10 days notice. As I said 5 days is a compromise.
There are many ways that the downtime notification could be improved but these all require increasing the complexity. The system works quite well. There are a few cases in the last year where sites have entered long downtimes with barely more than 24 hours notice, which have caused problems for VOs. This is what we are trying to discourage.
Alastair
> On 17 May 2017, at 09:19, Martin Bly <[log in to unmask]> wrote:
>
> Alastair,
>
> If one were to call all downtimes 'Unscheduled' that had not been announced in advance by some interval T(i), what is the minimum value of T(i) that would allow the VOs to be happy and able to reschedule their activities to cope with the outage?
>
> Perhaps there need to be three classes;
>
> Scheduled, for the usual service stuff, with notification time >=T(i)
> Unscheduled, for the oops stuff including power failures,
> and Emergency, for patching 'patch or it's game over' vulnerabilities.
>
> Only the Unscheduled outages count against sites' Availability/Reliability.
>
> I seem to recall most 'Emergency' class outages of late being subject to an 'it won't count against you' declaration anyway.
>
> Martin.
> --
> Martin Bly
> RAL Tier1 Fabric Manager
>
>
>> -----Original Message-----
>> From: Testbed Support for GridPP member institutes [mailto:TB-
>> [log in to unmask]] On Behalf Of Alastair Dewhurst
>> Sent: Tuesday, May 16, 2017 10:35 PM
>> To: [log in to unmask]
>> Subject: Re: Down time declarations
>>
>> Hi
>>
>> Can I explain the logic behind this proposal from the VO perspective.
>>
>> We don’t really care about the difference between unscheduled and scheduled
>> downtimes. In general what we care about is overall availability (as measured
>> by things like ASAP). Occasionally a site going into downtime will have an impact
>> on a VOs ongoing work, but 24 hours isn’t nearly enough time to react to this.
>> We want sites to declare downtimes as far in advance as possible, and while
>> most of the time it won’t make any difference to our actions, it will increase the
>> chance of us preventing potential problems due to downtime declarations.
>>
>> With regards to critical security patching, I feel that exceptions should be
>> granted and that what is written on slide 5 is just wrong. Under the current
>> system, if there is a critical security patch and you are in a position to
>> immediately patch, then it is clearly better to do this, than to wait 24 hours
>> simply so that the downtime could be considered scheduled. By making it 5
>> days, it highlights quite how silly / artificial the downtime declaration
>> mechanism is. We definitely don’t want sites to wait 5 days before patching, we
>> want people to patch as soon as reasonably possible.
>>
>> The 5 days is a compromise: 5 days is sufficiently short that most genuinely
>> planned interventions (e.g. a storage upgrade) can be accurately scheduled. 5
>> days is sufficiently long that if some incident does cause a problem, the site just
>> deals with it, rather than trying to hobble along simply so the downtime can be
>> considered scheduled.
>>
>> Alastair
>>
>>
>>> On 16 May 2017, at 17:14, Stephen Jones <[log in to unmask]> wrote:
>>>
>>> Hi,
>>>
>>> It's interesting. We can't predict when some scary exploit will jump up at us, so
>> we can't pledge to give much notice. So perhaps we should say that some types
>> of unscheduled downtimes are perfectly acceptable and are no reflection on site
>> performance. As others say, if unscheduled downtimes are always taken to be a
>> reflection on site performance, then the temptation is to delay the adoption of
>> patches in order to give plenty of notice. And this would be a case of throwing
>> out the baby with the bathwater.
>>>
>>> (On a related issue: it's easy to patch a cluster without taking it down for a
>> significant time, by doing batches of nodes in series. I have a little script, called
>> snakey.pl, that I bring out when I need to put a new kernel on our Condor
>> cluster.)
>>>
>>> Cheers,
>>>
>>> Ste
>>>
>>>
>>>
>>>
>>>
>>> On 16/05/17 15:11, Ian Collier wrote:
>>>> There should be an opportunity to discuss this at the Manchester Workshop.
>>>>
>>>> —Ian
>>>>
>>>>> On 16 May 2017, at 11:56, Peter Gronbech
>> <[log in to unmask]> wrote:
>>>>>
>>>>> I noticed that there is a proposal to change the downtime declaration to 5
>> days before and intervention from 24 hours.
>>>>> See slide 6 of
>> https://indico.egi.eu/indico/event/3237/contribution/4/material/slides/0.pdf
>>>>>
>>>>> This would mean interventions declared less than 5 days in advance will be
>> considered unscheduled.
>>>>> I'm not sure if this has been approved yet though.
>>>>>
>>>>> Cheers Pete
>>>>>
>>>>>
>>>>> --
>>>>> ----------------------------------------------------------------------
>>>>> Peter Gronbech GridPP Project Manager Tel No. : 01865 273389
>>>>>
>>>>> Department of Particle Physics,
>>>>> University of Oxford,
>>>>> Keble Road, Oxford OX1 3RH, UK E-mail : [log in to unmask]
>>>>> ----------------------------------------------------------------------
>>>
>>>
>>> --
>>> Steve Jones [log in to unmask]
>>> Grid System Administrator office: 220
>>> High Energy Physics Division tel (int): 43396
>>> Oliver Lodge Laboratory tel (ext): +44 (0)151 794 3396
>>> University of Liverpool http://www.liv.ac.uk/physics/hep/
>
|