Hi All
Many of you will be aware that ATLAS currently have multiple PanDA Queues (PQs) per site for different types of jobs (e.g. Multi/Single core, High/Low memory, Production/Analysis, Short/Long, and many other combinations). Recently, ATLAS finished development work on what they are calling a Unified PQ (and because it was first used to unify single and multi-core queues, the name UCORE has stuck). Rather than have multiple PQs per site a single PQ per compute resource is all that will be required. This Unified PQ will have the ability to submit the same range of jobs to a batch system as the current many PQ setup.
This should be a significant improvement for ATLAS as not only does it make the configuration more straight forward, it also means that jobs can be prioritised much more accurately. This is because PanDA treats each PQ as a separate resource. Therefore, at the moment, if there are a bunch of low priority single core jobs and high priority multi-core jobs, the batch system might prefer to run the single core jobs as opposed to draining some slots and running the multi-core jobs ATLAS want. With a unified PQ, PandDA can stop the submission of single core jobs if all it actually needs to run at a site are high priority multi-core jobs. Note: This could mean, that once a site is switched over to UCORE they experience more “spikes” in the types of job ATLAS submit.
So why am I telling you all this?
1) To identify if there are any sites that would benefit from having their configuration changed at a specific time. Reasons for this could be your are migrating to a different batch system, or you are about to migrate to CC/SL7/containers.
2) To identify edge cases. I am sure there is a good reason your site has decided to run HTCondor inside jobs submitted to PBS, just so you can provide docker containers to run VAC jobs… but the ATLAS developers may not have thought of this.
3) To identify if we could improve the current settings for any particular site. If you would like a certain type of jobs submitted, but it was always too much effort to setup a new PQ for it, then this may be the opportunity.
4) To ask sites what their plans are regarding VAC. If a site is running VAC and a batch system, then we will still need multiple PQ. For simplicity ATLAS would obviously prefer to submit to the minimum number of independent CPU resources, however more importantly, if the CPU resources are being partitioned by the site then we run into the same problem we had before of low priority jobs in one queue keeping out high priority jobs in another. ATLAS would strongly prefer to only have one method of accessing each sites CPU resources, but if it must be more, then we are intending to only submit multi-core jobs (e.g. the stuff that will be high priority) to VAC queues. We wanted to check what impact this might have on sites?
At this moment in time we have a UCORE queue setup and working for RAL-LCG2-ECHO (there may be others but I can claim to understand the RAL config!) and I intend to setup up the next one at ECDF because they will have multiple DDM endpoints that require UCORE PQ to work. I am happy to have suggestions for what to do next. Also, please let me know if you have any questions/comments about what I have just said.
Thanks
Alastair
|