Hi Alastair, I think at a certain point you promised a document on the decision to move to ARC/HTcondor with pros and cons. I can't remember if you ever sent it. In case you did could you send it again please? cheers aelssandra On 19/09/2013 11:56, Alastair Dewhurst wrote: > Hi > > AthenaMP, ATLAS multi-process software is being designed to use any > number of cores. However all the ATLAS multi-core queues setup across > the grid so far have been configured to specify 8 cores. > > While I am afraid I cannot find the documentation to back it up I > believe the WLCG 'agreed' that experiments should be able to request > 4n cores (where n is an integer). Even if this wasn't agreed I > believe this is what ATLAS have adopted and for the moment are happy > with 8 cores. I cannot predict what ATLAS will do with certainty but > for those sites that primarily support ATLAS, if they were going to > look into multi-core jobs, I would suggest working on ways to > dynamically allocate 8 core jobs when ATLAS occasionally need them, > rather than setup a dedicated whole node queue with dedicated > resources. The dynamic job allocation was one of the requirements we > look at when choosing HTCondor. > > Anyway it is certainly something to discuss at a technical meeting! > > Alastair > > > > On 19 Sep 2013, at 11:21, Sam Skipsey <[log in to unmask] > <mailto:[log in to unmask]>> wrote: > >> >> >> >> On 19 September 2013 10:48, Christopher J. Walker >> <[log in to unmask] <mailto:[log in to unmask]>> wrote: >> >> On 19/09/13 10:12, Andrew Lahiff wrote: >> >> Hi, >> >> For the record, for the new batch system at RAL we have >> multicore queues >> on our CREAM CEs (currently configured to use 8 cores). >> However, on the >> ARC CEs jobs can request exactly how many cores (and how much >> memory) >> they need rather than having to use a specific queue. >> >> >> I believe it is perfectly possible to do this with CREAM too >> (though I'm not sure we have it set up on all CEs. >> >> >> Sure, it's basically how our MPI support works at Glasgow, IIRC, and >> we're entirely behind CREAM CEs. >> >> IIRC, when I asked the experiments why this wasn't sufficient and >> they wanted to auto discover how big the slot they had got was, >> it was because on a 12 core node, if you have jobs requesting 8 >> slots, they may actually end up on a 12 slot machine - and if >> they know this they can make use of the extra slots they discover >> they have. >> >> >> Well, this is the difference between "whole node queues" and >> "multicore queues" (and between shared memory multicore queues and >> message passing multicore queues). Our MPI support provides precisely >> that - so a job can request any number of cores and it'll get them, >> but almost never all on the same node, as they don't need to be for MPI. >> (This seems sufficient for biomed, and indeed any other entity that >> writes MPI-based code.) >> >> We don't support OpenMP style shared-memory parallelism where you >> require N slots all on the same node. (This makes the scheduling >> problem harder as has been discussed before.) >> >> ATLAS/CMS seem to want whole node queues, in which case, if you >> assume a whole node queue is sensible a priori, it is reasonable for >> them to want to know how big the node they'll get is in advance. >> (This would be particularly pessimal for an 8 slot job arriving on a >> 64 slot node, for example.) >> We did have a whole node queue at Glasgow for testing (Andy Washbrook >> used this), but the scheduling was exceedingly pessimal (as running >> 10 jobs would offline 10 nodes, without checking how big the nodes >> were first... so if it hit a 64 core node...) so we turned it off. >> >> Sam >> >> >> This is how both >> ATLAS and CMS are running multicore jobs at RAL now. Condor >> is then >> responsible for scheduling the mix of single and multicore jobs. >> >> >> Chris >> >> >> >> Regards, >> Andrew. >> >> -----Original Message----- >> From: Testbed Support for GridPP member institutes >> [mailto:[log in to unmask] >> <mailto:[log in to unmask]>] On Behalf Of Alessandra Forti >> Sent: 19 September 2013 09:57 >> To: [log in to unmask] <mailto:[log in to unmask]> >> Subject: Re: Technical Meetings >> >> Well that's for testing. If it goes in production it will be >> more than one node and most of the things that are keeping >> this back are operational i.e. how not to waste resources and >> how to do the accounting if a job requests a certain number >> of CPU or the whole node. >> >> cheers >> alessandra >> >> On 18/09/2013 18:18, Christopher J. Walker wrote: >> >> On 18/09/13 17:22, David Colling wrote: >> >> Hi Alessandra, >> >> Yes they are indeeedee. I only know of bits and >> pieces in the LHC >> world but do know, for example, that our T2K >> colleagues make >> extensive use of them. The Imperial T2K people code >> and debug locally >> and then run on the RAL resources. This is proving so >> successful that >> we are considering adding a bigger node - perhaps to >> the GridPP >> cloud so that others could use it via OpenStack. >> These are at the >> *ideas* stage at the moment, but if we did would >> there be any takers >> or would we have just thrown away a chunk of money >> (or rather given >> it to T2K as I am sure that they would use them)? >> >> I guess that the question is what should GridPP be >> doing about this? >> I don't see it as our place to fund development in >> the individual >> experiments but should we be acting as a conduit for >> best practise? >> Organising Goofit tutorials? Interacting with EGI as >> Stephen suggests? >> What else? Is there a focus that we can develop with >> very little money? >> >> I think that these are questions for next Tuesday >> rather than Friday >> but I will add a specific discussion to the >> discussion agenda for this. >> >> QMUL now has a single node MPI queue - Dan has more >> details. What more >> does one need? >> >> Chris >> >> Best, >> david >> >> On 18/09/13 14:41, Alessandra Forti wrote: >> >> Hi, >> >> multicore should become a reality at the end of >> LS1. We should >> definitely have it as an activity. >> >> cheers >> alessandra >> >> On 18/09/2013 14:12, David Colling wrote: >> >> Hi, >> >> I am just drawing up an agenda and is it >> worth having an item on >> FTS3 (from Andrew L.)? >> >> Also the many and multicore activity. Is >> somebody able to describe >> what has been happening in these two areas? >> Is this something that >> we want to have as an activity in GridPP5? >> >> Best, >> david >> >> >> >> >> -- >> Facts aren't facts if they come from the wrong people. (Paul >> Krugman) >> >> > -- Facts aren't facts if they come from the wrong people. (Paul Krugman)