JISCMail - TB-SUPPORT Archives

On 3 April 2012 15:51, Alessandra Forti <[log in to unmask]> wrote:

I *think* that what ATLAS want, given the note about "sites that don't kill on vmem don't need to worry", is that they simply be allowed to run processes with 4GB of mapped address space.

yes, that's the final aim of course. However they have had enough problems in the past either with batch systems configure to kill or with nodes with not enough virtual memory.

Sure. The problem with the initial request is that it conflates these problems (and treats the second confusingly), and so leads to a mixed message for sites.

(The graph from the slide doesn't help, as it implies that ATLAS MC is fine with an average of 1GB / core of physical memory.)

Said that if the atlas historic dashboard can be trusted all UK sites have run reco jobs in the past year in different degrees and I cannot see specific memory error problems. Some have been killed by the batch system though

http://dashb-atlas-job.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=UK&activities=reco&sitesSort=8&start=2011-02-01&end=2012-04-03&timeRange=daily&sortBy=0&granularity=Monthly&generic=0&series=All&type=pfe

It'd be interesting to know why.

Certainly *some* jobs have been killed by Glasgow's batch system for exceeding the wallclock limit on a queue. Not many of them, and usually as a result of something else happening that actually broke the job...

Sam

cheers
alessandra