Hi guys,
I have a ticket open that I have to fix.
https://ggus.eu/index.php?mode=ticket_info&ticket_id=121092
Has anybody seen this problem, esp. anybody who runs arc or condor (or
esp. esp. anybody who runs arc AND condor?) This is the run down.
At Liverpool, MCORE jobs fail because they build up a ResidentSetSize
that is more than their declared JobMemoryLimit. So condor gets rid of
them. Look at this output. First, I get the Finished record of a job
that failed (found from bigpanda.cern.ch):
# grep JxMNDm07vFonwOMCrq6pnv5nABFKDmABFKDmwfFKDmFBFKDmuQC3Go
/var/log/arc/gm-jobs.log
And it tells me LRMS error: (271) job killed: vmem:
> Finished - job id:
JxMNDm07vFonwOMCrq6pnv5nABFKDmABFKDmwfFKDmFBFKDmuQC3Go, ..
> owner: "/DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1",
> lrms: condor, queue: grid, lrmsid: 2637369.hepgrid2.ph.liv.ac.uk,
failure: "LRMS error: (271) job killed: vmem."
So I get the history of that job, and it tells me it had a
JobMemoryLimit of 16384000, and a ResidentSetSize_RAW of 16405624. Which
is more...
# condor_history -long 2637369.hepgrid2.ph.liv.ac.uk | grep -e
JobMemoryLimit -e PeriodicRemove -e ResidentSetSize_RAW
> JobMemoryLimit = 16384000 # Note mem limit is 2000MiB *
10**6 * 8 cores
> ResidentSetSize_RAW = 16405624
Since 16405624 is more than 16384000, the job is killed off (with the
slightly misleading message about vmem). The reason is given in the
history log:
# condor_history -long 2637369.hepgrid2.ph.liv.ac.uk | grep -e RemoveReason
> RemoveReason = "The job attribute PeriodicRemove expression 'false ||
RemoteUserCpu + RemoteSysCpu > JobCpuLimit || RemoteWallClockTime >
JobTimeLimit || ResidentSetSize > JobMemoryLimit' evaluated to TRUE"
I take this to mean that ResidentSetSize > JobMemoryLimit, so Condor's
periodic remove function killed the job.
This is happening to a lot of jobs, inc. single core ones. Is anyone
else seeing anything like this? What's it all about? Let me know what
you think if you see this issue, or (esp.) if you Don't see this issue.
Cheers,
Ste
--
Steve Jones [log in to unmask]
Grid System Administrator office: 220
High Energy Physics Division tel (int): 43396
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 3396
University of Liverpool http://www.liv.ac.uk/physics/hep/
|