Hi Steve,
I thought you may be victim of the Heavy Ion production this morning
but looking at it doesn't seem to be the case.
First you should make sure you really are not killing on vmem. Do
you have cgroups enabled?
Getting a random job among those failed it has the following memory
values
http://bigpanda.cern.ch/job?pandaid=2838623139
maxpss |
12270782 |
maxrss |
15953796 |
maxswap |
327608 |
maxvmem |
25073408 |
maxpss is the value that counts and it is ~12GB (the one that
cgroups RSS would look like). maxrss listed here is the "standard"
one and has double counting, though in this cases still seems
contained. This are the RSS values that sites without cgroups see,
so it may still be problematic and vmem is the only one that really
went above and sites shouldn't cut on vmem.
Taking a different job with wilder values
http://bigpanda.cern.ch/job?pandaid=2838658355
maxpss |
11884292 |
maxrss |
21732960 |
maxswap |
0 |
maxvmem |
34439872 |
again maxpss is around ~12GB, maxrss of that size would create
problems on any site that implements cuts without cgroups. vmem off
the roof as usual.
So looking at these two jobs I'd say you don't have cgroups [1]. If
I'm wrong and you do have it you may want to review the
configuration.
These jobs shouldn't be killed.
cheers
alessandra
[1] https://www.gridpp.ac.uk/wiki/Enable_Cgroups_in_HTCondor
On 28/04/2016 17:56, Stephen Jones
wrote:
[log in to unmask]" type="cite">Hi
guys,
I have a ticket open that I have to fix.
https://ggus.eu/index.php?mode=ticket_info&ticket_id=121092
Has anybody seen this problem, esp. anybody who runs arc or condor
(or esp. esp. anybody who runs arc AND condor?) This is the run
down.
At Liverpool, MCORE jobs fail because they build up a
ResidentSetSize that is more than their declared JobMemoryLimit.
So condor gets rid of them. Look at this output. First, I get the
Finished record of a job that failed (found from
bigpanda.cern.ch):
# grep JxMNDm07vFonwOMCrq6pnv5nABFKDmABFKDmwfFKDmFBFKDmuQC3Go
/var/log/arc/gm-jobs.log
And it tells me LRMS error: (271) job killed: vmem:
> Finished - job id:
JxMNDm07vFonwOMCrq6pnv5nABFKDmABFKDmwfFKDmFBFKDmuQC3Go, ..
> owner: "/DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1",
> lrms: condor, queue: grid, lrmsid:
2637369.hepgrid2.ph.liv.ac.uk, failure: "LRMS error: (271) job
killed: vmem."
So I get the history of that job, and it tells me it had a
JobMemoryLimit of 16384000, and a ResidentSetSize_RAW of 16405624.
Which is more...
# condor_history -long 2637369.hepgrid2.ph.liv.ac.uk | grep -e
JobMemoryLimit -e PeriodicRemove -e ResidentSetSize_RAW
> JobMemoryLimit = 16384000 # Note mem limit is
2000MiB * 10**6 * 8 cores
> ResidentSetSize_RAW = 16405624
Since 16405624 is more than 16384000, the job is killed off (with
the slightly misleading message about vmem). The reason is given
in the history log:
# condor_history -long 2637369.hepgrid2.ph.liv.ac.uk | grep -e
RemoveReason
> RemoveReason = "The job attribute PeriodicRemove expression
'false || RemoteUserCpu + RemoteSysCpu > JobCpuLimit ||
RemoteWallClockTime > JobTimeLimit || ResidentSetSize >
JobMemoryLimit' evaluated to TRUE"
I take this to mean that ResidentSetSize > JobMemoryLimit, so
Condor's periodic remove function killed the job.
This is happening to a lot of jobs, inc. single core ones. Is
anyone else seeing anything like this? What's it all about? Let me
know what you think if you see this issue, or (esp.) if you Don't
see this issue.
Cheers,
Ste
--
Respect is a rational process. \\//
Fatti non foste a viver come bruti (Dante)