Hi Steve,

I thought you may be victim of the Heavy Ion production this morning but looking at it doesn't seem to be the case.

First you should make sure you really are not killing on vmem. Do you have cgroups enabled?

Getting a random job among those failed it has the following memory values

http://bigpanda.cern.ch/job?pandaid=2838623139

maxpss 12270782
maxrss 15953796
maxswap 327608
maxvmem 25073408

maxpss is the value that counts and it is ~12GB (the one that cgroups RSS would look like). maxrss listed here is the "standard" one and has double counting, though in this cases still seems contained. This are the RSS values that sites without cgroups see, so it may still be problematic and vmem is the only one that really went above and sites shouldn't cut on vmem.

Taking a different job with wilder values

http://bigpanda.cern.ch/job?pandaid=2838658355

maxpss 11884292
maxrss 21732960
maxswap 0
maxvmem 34439872

again maxpss is around ~12GB, maxrss of that size would create problems on any site that implements cuts without cgroups. vmem off the roof as usual.

So looking at these two jobs I'd say you don't have cgroups [1]. If I'm wrong and you do have it you may want to review the configuration.

These jobs shouldn't be killed.


cheers
alessandra

[1] https://www.gridpp.ac.uk/wiki/Enable_Cgroups_in_HTCondor


On 28/04/2016 17:56, Stephen Jones wrote:
[log in to unmask]" type="cite">Hi guys,

I have a ticket open that I have to fix.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=121092


Has anybody seen this problem, esp. anybody who runs arc or condor (or esp. esp. anybody who runs arc AND condor?) This is the run down.

At Liverpool, MCORE jobs fail because they build up a ResidentSetSize that is more than their declared JobMemoryLimit. So condor gets rid of them. Look at this output. First, I get the Finished record of a job that failed (found from bigpanda.cern.ch):

# grep JxMNDm07vFonwOMCrq6pnv5nABFKDmABFKDmwfFKDmFBFKDmuQC3Go /var/log/arc/gm-jobs.log

And it tells me LRMS error: (271) job killed: vmem:

> Finished - job id: JxMNDm07vFonwOMCrq6pnv5nABFKDmABFKDmwfFKDmFBFKDmuQC3Go, ..
> owner: "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1",
> lrms: condor, queue: grid, lrmsid: 2637369.hepgrid2.ph.liv.ac.uk, failure: "LRMS error: (271) job killed: vmem."

So I get the history of that job, and it tells me it had a JobMemoryLimit of 16384000, and a ResidentSetSize_RAW of 16405624. Which is more...

# condor_history -long  2637369.hepgrid2.ph.liv.ac.uk | grep -e JobMemoryLimit -e PeriodicRemove -e ResidentSetSize_RAW
> JobMemoryLimit = 16384000          # Note mem limit is 2000MiB * 10**6 * 8 cores
> ResidentSetSize_RAW = 16405624

Since 16405624 is more than 16384000, the job is killed off (with the slightly misleading message about vmem). The reason is given in the history log:

# condor_history -long  2637369.hepgrid2.ph.liv.ac.uk | grep -e RemoveReason

> RemoveReason = "The job attribute PeriodicRemove expression 'false || RemoteUserCpu + RemoteSysCpu > JobCpuLimit || RemoteWallClockTime > JobTimeLimit || ResidentSetSize > JobMemoryLimit' evaluated to TRUE"

I take this to mean that ResidentSetSize > JobMemoryLimit, so Condor's periodic remove function killed the job.

This is happening to a lot of jobs, inc. single core ones. Is anyone else seeing anything like this? What's it all about? Let me know what you think if you see this issue, or (esp.) if you Don't see this issue.

Cheers,

Ste











-- 
Respect is a rational process. \\//
Fatti non foste a viver come bruti (Dante)