I am not sure if I speak for every Tier-2 here but: Tier-2s are
strongly encouraged to keep their Reliability high. As Reliability is
the the inverse of "time a site fails tests, while not being in
downtime", Tier-2s are effectively encouraged to never be in a state
where a test can fail and they aren't in downtime. This means that a
Tier-2 generally will try to include periods of time necessary to
drain queues and undertake all necessary work to maintain reliability
within their scheduled downtimes.
So, yes, the metric as it stands means that a Tier-2 "should" schedule
a 48 hour downtime prior to a short downtime of an SE.
As Daniela notes, however, the fact that this is all very unwieldy
means that for very short service outages, a site may simply avoid
draining queues or setting any downtime at all, "winging it".
This is a consequence of the metrics, and not necessarily the actions
anyone here would undertake under other circumstances.
Sam
On 21 November 2013 15:59, Alastair Dewhurst <[log in to unmask]> wrote:
> Hi
>
> It seems that the Tier 2s do not work in the same way as the Tier 1 with regards to downtimes. For the Tier 1 a downtime means the service is down. If you need to upgrade your storage element and this requires a short downtime of the service, are you saying you would schedule a 48 hour downtime and drain all your queues just for this?!
>
> At Tier 2s the only jobs that can only run at that site would be specific user analysis work. The assumption is that users can't think for themselves, so in advance of a downtime their work would be moved elsewhere, hence the long drain time for analysis jobs.. The work that can run elsewhere would continue to run at the site and if it fails then it can always be run elsewhere.
>
> Alastair
>
>
>
> On 21 Nov 2013, at 15:26, Ewan MacMahon <[log in to unmask]>
> wrote:
>
>>> -----Original Message-----
>>> From: Testbed Support for GridPP member institutes [mailto:TB-
>>> [log in to unmask]] On Behalf Of Alastair Dewhurst
>>
>>> I feel this is excessive and could cause problems for users who would like
>>> to use their local site but are blocked because their site is going into
>>> downtime next week.
>>>
>> Leaving aside the other points for a moment, surely the sensible logic
>> here would be to send jobs that can run elsewhere, elsewhere, but still
>> submit jobs that can only run at that one site? After all, if you submit
>> something and it doesn't work, you're no worse off than if you'd not
>> submitted it.
>>
>> Ewan
|