Hi
Last week at ADC weekly there was a discussion on atlas analysis jobs duration and the max time limit was set to 12 h. A user can increase this limit. But 95 % of jobs will be running < 12 h.
Setting a site queue offline 120 h before DT ( comparing with 12 h max duration) looks like too much to me.
Elena
On 21 Nov 2013, at 15:23, Alastair Dewhurst wrote:
> Hi
>
> I am glad I was able to cheer you up.
>
> A drain of max queue length would only be sensible if the aim was to allow every job to finish. In most cases a failed job can be automatically retried or re-assigned to another site. ATLAS want to maximise throughput while reducing the chance of important data being stuck at a site. A pre-emptive drain allows the high priority work to finish and be copied away from the site.
>
> Alastair
>
>
>
> On 21 Nov 2013, at 15:02, Daniela Bauer <[log in to unmask]>
> wrote:
>
>> If it's a site wide outage, I only start draining the site once the
>> downtime has started - I can't do it before, as users (and the
>> dashboard - I'd get a ticket !) would rightly complain of not being
>> able to submit to a site that's allegedly up and running.
>> I find the Atlas proposal hilarious, but each to their own. If you do
>> preemptive stopping of submissions, surely the parameter to be used
>> should be "max queue length".
>> Given that we have 48 h queues, our scheduled downtimes are at least 2
>> days by construction if it's an invasive procedure.
>>
>> Cheers,
>> Daniela
>>
>> On 21 November 2013 14:56, Alastair Dewhurst <[log in to unmask]> wrote:
>>> Hi
>>>
>>> You may be aware that ATLAS have a system know as the switcher which automatically drains ATLAS jobs from sites before it goes into a scheduled downtime. This has worked fairly well although in a few cases of long site downtimes it, users were caught out and "important" physics work was delayed.
>>>
>>> Work has been in progress to improve the switcher and one additional feature was to treat long downtimes (> 24 hours) differently. At the ATLAS meeting on Tuesday the following presentation was given:
>>> https://indico.cern.ch/getFile.py/access?contribId=10&resId=1&materialId=slides&confId=283853
>>>
>>> A summary of the talk is that for a scheduled downtime (on the SE) lasting more than 24 hours, the site would stop receiving analysis jobs 5 days before hand and not receive any new ATLAS jobs 24 hours before hand. I feel this is excessive and could cause problems for users who would like to use their local site but are blocked because their site is going into downtime next week.
>>>
>>> I would be interested in hearing feedback from sites. I would be interested in knowing:
>>> - If sites pay any attention to how ATLAS drain their work before hand?
>>> - How often you schedule a downtime longer than 24 hours and how long do these actually end up lasting?
>>> - How you plan your downtimes? Do you factor in a drain yourself, are you cautious when declaring downtimes knowing that it is easier to end early than extend into an unscheduled downtime.
>>>
>>> The aim should be that the ATLAS system works with the way that (the majority) of sites work, rather than sites having to work around the ATLAS system.
>>>
>>> Thank you.
>>>
>>> Alastair
>>
>>
>>
>> --
>> Sent from the pit of despair
>>
>> -----------------------------------------------------------
>> [log in to unmask]
>> HEP Group/Physics Dep
>> Imperial College
>> London, SW7 2BW
>> Tel: +44-(0)20-75947810
>> http://www.hep.ph.ic.ac.uk/~dbauer/
__________________________________________________
Dr Elena Korolkova
Email: [log in to unmask]
Tel.: +44 (0)114 2223553
Fax: +44 (0)114 2223555
Department of Physics and Astronomy
University of Sheffield
Sheffield, S3 7RH, United Kingdom
|