Print

Print


Hi Alastair,

I think at a certain point you promised a document on the decision to 
move to ARC/HTcondor with pros and cons. I can't remember if you ever 
sent it. In case you did could you send it again please?

cheers
aelssandra


On 19/09/2013 11:56, Alastair Dewhurst wrote:
> Hi
>
> AthenaMP, ATLAS multi-process software is being designed to use any 
> number of cores.  However all the ATLAS multi-core queues setup across 
> the grid so far have been configured to specify 8 cores.
>
> While I am afraid I cannot find the documentation to back it up I 
> believe the WLCG 'agreed' that experiments should be able to request 
> 4n cores (where n is an integer).  Even if this wasn't agreed I 
> believe this is what ATLAS have adopted and for the moment are happy 
> with 8 cores.  I cannot predict what ATLAS will do with certainty but 
> for those sites that primarily support ATLAS, if they were going to 
> look into multi-core jobs, I would suggest working on ways to 
> dynamically allocate 8 core jobs when ATLAS occasionally need them, 
> rather than setup a dedicated whole node queue with dedicated 
> resources.  The dynamic job allocation was one of the requirements we 
> look at when choosing HTCondor.
>
> Anyway it is certainly something to discuss at a technical meeting!
>
> Alastair
>
>
>
> On 19 Sep 2013, at 11:21, Sam Skipsey <[log in to unmask] 
> <mailto:[log in to unmask]>> wrote:
>
>>
>>
>>
>> On 19 September 2013 10:48, Christopher J. Walker 
>> <[log in to unmask] <mailto:[log in to unmask]>> wrote:
>>
>>     On 19/09/13 10:12, Andrew Lahiff wrote:
>>
>>         Hi,
>>
>>         For the record, for the new batch system at RAL we have
>>         multicore queues
>>         on our CREAM CEs (currently configured to use 8 cores).
>>         However, on the
>>         ARC CEs jobs can request exactly how many cores (and how much
>>         memory)
>>         they need rather than having to use a specific queue.
>>
>>
>>     I believe it is perfectly possible to do this with CREAM too
>>     (though I'm not sure we have it set up on all CEs.
>>
>>
>> Sure, it's basically how our MPI support works at Glasgow, IIRC, and 
>> we're entirely behind CREAM CEs.
>>
>>     IIRC, when I asked the experiments why this wasn't sufficient and
>>     they wanted to auto discover how big the slot they had got was,
>>     it was because on a 12 core node, if you have jobs requesting 8
>>     slots, they may actually end up on a 12 slot machine - and if
>>     they know this they can make use of the extra slots they discover
>>     they have.
>>
>>
>> Well, this is the difference between "whole node queues" and 
>> "multicore queues" (and between shared memory multicore queues and 
>> message passing multicore queues). Our MPI support provides precisely 
>> that - so a job can request any number of cores and it'll get them, 
>> but almost never all on the same node, as they don't need to be for MPI.
>> (This seems sufficient for biomed, and indeed any other entity that 
>> writes MPI-based code.)
>>
>> We don't support OpenMP style shared-memory parallelism where you 
>> require N slots all on the same node. (This makes the scheduling 
>> problem harder as has been discussed before.)
>>
>> ATLAS/CMS seem to want whole node queues, in which case, if you 
>> assume a whole node queue is sensible a priori, it is reasonable for 
>> them to want to know how big the node they'll get is in advance.
>> (This would be particularly pessimal for an 8 slot job arriving on a 
>> 64 slot node, for example.)
>> We did have a whole node queue at Glasgow for testing (Andy Washbrook 
>> used this), but the scheduling was exceedingly pessimal (as running 
>> 10 jobs would offline 10 nodes, without checking how big the nodes 
>> were first... so if it hit a 64 core node...) so we turned it off.
>>
>> Sam
>>
>>
>>         This is how both
>>         ATLAS and CMS are running multicore jobs at RAL now. Condor
>>         is then
>>         responsible for scheduling the mix of single and multicore jobs.
>>
>>
>>     Chris
>>
>>
>>
>>         Regards,
>>         Andrew.
>>
>>         -----Original Message-----
>>         From: Testbed Support for GridPP member institutes
>>         [mailto:[log in to unmask]
>>         <mailto:[log in to unmask]>] On Behalf Of Alessandra Forti
>>         Sent: 19 September 2013 09:57
>>         To: [log in to unmask] <mailto:[log in to unmask]>
>>         Subject: Re: Technical Meetings
>>
>>         Well that's for testing. If it goes in production it will be
>>         more than one node and most of the things that are keeping
>>         this back are operational i.e. how not to waste resources and
>>         how to do the accounting if a job requests a certain number
>>         of CPU or the whole node.
>>
>>         cheers
>>         alessandra
>>
>>         On 18/09/2013 18:18, Christopher J. Walker wrote:
>>
>>             On 18/09/13 17:22, David Colling wrote:
>>
>>                 Hi Alessandra,
>>
>>                 Yes they are indeeedee. I only know of bits and
>>                 pieces in the LHC
>>                 world but do know, for example, that our T2K
>>                 colleagues make
>>                 extensive use of them. The Imperial T2K people code
>>                 and debug locally
>>                 and then run on the RAL resources. This is proving so
>>                 successful that
>>                 we are considering adding a bigger node - perhaps to
>>                 the GridPP
>>                 cloud so that others could use it via OpenStack.
>>                 These are at the
>>                 *ideas* stage at the moment, but if we did would
>>                 there be any takers
>>                 or would we have just thrown away a chunk of money
>>                 (or rather given
>>                 it to T2K as I am sure that they would use them)?
>>
>>                 I guess that the question is what should GridPP be
>>                 doing about this?
>>                 I don't see it as our place to fund development in
>>                 the individual
>>                 experiments but should we be acting as a conduit for
>>                 best practise?
>>                 Organising Goofit tutorials? Interacting with EGI as
>>                 Stephen suggests?
>>                 What else? Is there a focus that we can develop with
>>                 very little money?
>>
>>                 I think that these are questions for next Tuesday
>>                 rather than Friday
>>                 but I will add a specific discussion to the
>>                 discussion agenda for this.
>>
>>             QMUL now has a single node MPI queue - Dan has more
>>             details. What more
>>             does one need?
>>
>>             Chris
>>
>>                 Best,
>>                 david
>>
>>                 On 18/09/13 14:41, Alessandra Forti wrote:
>>
>>                     Hi,
>>
>>                     multicore should become a reality at the end of
>>                     LS1. We should
>>                     definitely have it as an activity.
>>
>>                     cheers
>>                     alessandra
>>
>>                     On 18/09/2013 14:12, David Colling wrote:
>>
>>                         Hi,
>>
>>                         I am just drawing up an agenda and is it
>>                         worth having an item on
>>                         FTS3 (from Andrew L.)?
>>
>>                         Also the many and multicore activity. Is
>>                         somebody able to describe
>>                         what has been happening in these two areas?
>>                         Is this something that
>>                         we want to have as an activity in GridPP5?
>>
>>                         Best,
>>                         david
>>
>>
>>
>>
>>         --
>>         Facts aren't facts if they come from the wrong people. (Paul
>>         Krugman)
>>
>>
>


-- 
Facts aren't facts if they come from the wrong people. (Paul Krugman)