Print

Print


As Daniela notes, most sites schedule their long downtimes to include
expected "draining" time (or at least, the time needed for running
pilots to complete, at the minimum). ATLAS' proposal will essentially
double-count this time.

Sam

On 21 November 2013 15:02, Daniela Bauer
<[log in to unmask]> wrote:
> If it's a site wide outage, I only start draining the site once the
> downtime has started - I can't do it before, as users (and the
> dashboard - I'd get a ticket !) would rightly complain of not being
> able to submit to a site that's allegedly up and running.
> I find the Atlas proposal hilarious, but each to their own. If you do
> preemptive stopping of submissions, surely the parameter to be used
> should be "max queue length".
> Given that we have 48 h queues, our scheduled downtimes are at least 2
> days by construction if it's an invasive procedure.
>
> Cheers,
> Daniela
>
> On 21 November 2013 14:56, Alastair Dewhurst <[log in to unmask]> wrote:
>> Hi
>>
>> You may be aware that ATLAS have a system know as the switcher which automatically drains ATLAS jobs from sites before it goes into a scheduled downtime.  This has worked fairly well although in a few cases of long site downtimes it, users were caught out and "important" physics work was delayed.
>>
>> Work has been in progress to improve the switcher and one additional feature was to treat long downtimes (> 24 hours) differently.  At the ATLAS meeting on Tuesday the following presentation was given:
>> https://indico.cern.ch/getFile.py/access?contribId=10&resId=1&materialId=slides&confId=283853
>>
>> A summary of the talk is that for a scheduled downtime (on the SE) lasting more than 24 hours, the site would stop receiving analysis jobs 5 days before hand and not receive any new ATLAS jobs 24 hours before hand.  I feel this is excessive and could cause problems for users who would like to use their local site but are blocked because their site is going into downtime next week.
>>
>> I would be interested in hearing feedback from sites.  I would be interested in knowing:
>> - If sites pay any attention to how ATLAS drain their work before hand?
>> - How often you schedule a downtime longer than 24 hours and how long do these actually end up lasting?
>> - How you plan your downtimes?  Do you factor in a drain yourself, are you cautious when declaring downtimes knowing that it is easier to end early than extend into an unscheduled downtime.
>>
>> The aim should be that the ATLAS system works with the way that (the majority) of sites work, rather than sites having to work around the ATLAS system.
>>
>> Thank you.
>>
>> Alastair
>
>
>
> --
> Sent from the pit of despair
>
> -----------------------------------------------------------
> [log in to unmask]
> HEP Group/Physics Dep
> Imperial College
> London, SW7 2BW
> Tel: +44-(0)20-75947810
> http://www.hep.ph.ic.ac.uk/~dbauer/