Print

Print


Hi guys,

I have a ticket open that I have to fix.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=121092


Has anybody seen this problem, esp. anybody who runs arc or condor (or 
esp. esp. anybody who runs arc AND condor?) This is the run down.

At Liverpool, MCORE jobs fail because they build up a ResidentSetSize 
that is more than their declared JobMemoryLimit. So condor gets rid of 
them. Look at this output. First, I get the Finished record of a job 
that failed (found from bigpanda.cern.ch):

# grep JxMNDm07vFonwOMCrq6pnv5nABFKDmABFKDmwfFKDmFBFKDmuQC3Go 
/var/log/arc/gm-jobs.log

And it tells me LRMS error: (271) job killed: vmem:

 > Finished - job id: 
JxMNDm07vFonwOMCrq6pnv5nABFKDmABFKDmwfFKDmFBFKDmuQC3Go, ..
 > owner: "/DC=ch/DC=cern/OU=Organic 
Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1",
 > lrms: condor, queue: grid, lrmsid: 2637369.hepgrid2.ph.liv.ac.uk, 
failure: "LRMS error: (271) job killed: vmem."

So I get the history of that job, and it tells me it had a 
JobMemoryLimit of 16384000, and a ResidentSetSize_RAW of 16405624. Which 
is more...

# condor_history -long  2637369.hepgrid2.ph.liv.ac.uk | grep -e 
JobMemoryLimit -e PeriodicRemove -e ResidentSetSize_RAW
 > JobMemoryLimit = 16384000          # Note mem limit is 2000MiB * 
10**6 * 8 cores
 > ResidentSetSize_RAW = 16405624

Since 16405624 is more than 16384000, the job is killed off (with the 
slightly misleading message about vmem). The reason is given in the 
history log:

# condor_history -long  2637369.hepgrid2.ph.liv.ac.uk | grep -e RemoveReason

 > RemoveReason = "The job attribute PeriodicRemove expression 'false || 
RemoteUserCpu + RemoteSysCpu > JobCpuLimit || RemoteWallClockTime > 
JobTimeLimit || ResidentSetSize > JobMemoryLimit' evaluated to TRUE"

I take this to mean that ResidentSetSize > JobMemoryLimit, so Condor's 
periodic remove function killed the job.

This is happening to a lot of jobs, inc. single core ones. Is anyone 
else seeing anything like this? What's it all about? Let me know what 
you think if you see this issue, or (esp.) if you Don't see this issue.

Cheers,

Ste










-- 
Steve Jones                             [log in to unmask]
Grid System Administrator               office: 220
High Energy Physics Division            tel (int): 43396
Oliver Lodge Laboratory                 tel (ext): +44 (0)151 794 3396
University of Liverpool                 http://www.liv.ac.uk/physics/hep/